This topic explains how to use Crash Diagnostics (Crashd) to diagnose unstable or unresponsive workload clusters based on Photon OS in Tanzu Kubernetes Grid with a standalone management cluster.
For how to use Crashd to diagnose workload clusters deployed by a vSphere with Tanzu Supervisor, see How to collect a diagnostic log bundle from a Tanzu Kubernetes Guest Cluster on vSphere with Tanzu in the VMware Knowledge Base.
Crashd examines the bootstrap workload cluster that the tanzu cluster create
process creates locally using kind
before deploying the cluster to cloud infrastructure.
Crashd is an open source project that makes it easy troubleshoot problems with Kubernetes clusters.
Crashd uses a script file written in Starlark, a Python-like language, that interacts with your management or workload clusters to collect infrastructure and cluster information.
Crashd can collect diagnostics from supported infrastructures including:
Crashd takes the output from the commands run by the script and adds the output to a tar
file. The tar
file is then saved locally for further analysis.
Tanzu Kubernetes Grid includes signed binaries for Crashd and a diagnostics script file for Photon OS workload clusters.
To install or upgrade crashd
, follow the instructions below.
In the version drop-down, select 2.3.1.
Download Crashd for your platform.
Use the tar
command to unpack the binary for your platform.
Linux:
tar -xvf crashd-linux-amd64-v0.3.7-vmware.7.tar.gz
macOS:
tar -xvf crashd-darwin-amd64-v0.3.7-vmware.7.tar.gz
The previous step creates a directory named crashd
with the following files:
crashd
crashd/args
crashd/diagnostics.crsh
crashd/crashd-PLATFORM-amd64-v0.3.7+vmware.7
Move the binary into the /usr/local/bin
folder.
Linux:
mv ./crashd/crashd-linux-amd64-v0.3.7+vmware.7 /usr/local/bin/crashd
macOS:
mv ./crashd/crashd-darwin-amd64-v0.3.7+vmware.7 /usr/local/bin/crashd
When Crashd runs, it takes argument values from an args
file and passes them to a script file, diagnostics.crsh
. The script runs commands to extract information that can help diagnose problems on Photon OS workload clusters.
Prior to running Crashd script diagnostics.crsh
, your local machine must have the following programs on its execution path:
kubectl
scp
ssh
Notewhen investigating problems with a bootstrap cluster, you will need
kind
(v0.7.0 or higher) command installed locally.
Additionally, before you can run Crashd, you must follow these steps:
kubeconfig
file for the management cluster by using command tanzu mc kubeconfig get <management-cluster-name>
.kubeconfig
, public-key
file, the diagnostics.crsh
file, and the args
file are in the same location.Delete any local kind
clusters other than the one that was created to deploy the workload cluster that you are troubleshooting:
docker ps
to identify the currently-running kind
clusterkind
clusters by running kind delete cluster --name CLUSTER-NAME
Navigate to the location where you downloaded and unpacked the Crashd bundle.
In a text editor, overwrite the existing arguments file args
with the code below. This file contains the key/value pairs to pass to the CrashD script:
# ######################################################
# Crashd script argument file
#
# This file defines CLI argument values that are passed
# Crashd when running scripts for troubleshooting TKG
# clusters.
# ######################################################
# target: specifies cluster to target.
# Valid targets are: {bootstrap, mgmt, workload}
target=mgmt
# infra: the underlying infrastructure used by the TKG cluster.
# Valid values are: {vsphere, aws, azure}
infra=vsphere
# workdir: a local directory where collected files are staged.
workdir=./workdir
# ssh_user: the user ID used for SSH connections to cluster nodes.
ssh_user=capv
# ssh_pk_file: the path to the private key file created to SSH
# into cluster nodes.
ssh_pk_file=./capv.pem
# ######################################################
# Management Cluster
# The following arguments are used to collect information
# from a management cluster or named workload clusters.
# ######################################################
# mgmt_cluster_config: the kubeconfig file path for the management cluster.
mgmt_cluster_config=./tkg_cluster_config
# ######################################################
# Workload Cluster
# The following arguments are used to collect information
# from one or more workload clusters that are managed
# by the management cluster configured above.
# ######################################################
# workload_clusters: a comma separated list of workload cluster names
# [uncomment below]
#workload_clusters=tkg-cluster-wc-498
# workload_cluster_ns: the namespace where the workload cluster
# is hosted in the management plane.
# Note: it's actually the namespace in which the secrets/${workload_cluster_name}-kubeconfig
# is created in the management cluster.
# [uncomment below]
#workload_cluster_ns=default
Record the local path to an SSH private key file. If you do not already have an SSH keypair, or want to generate a new one, run ssh-keygen
as described in Create an SSH Key Pair. For example:
ssh-keygen -t rsa -b 4096 -C "[email protected]"
When prompted, enter a local path for the file location. For information about creating the SSH key pairs.
Set the following arguments in the args
file:
target
: Set this value to:
bootstrap
to diagnose a local bootstrap standalone management clustermgmt
to diagnose a deployed standalone management clusterworkload
to diagnose one or more workload clustersinfra
: The underlying infrastructure for your cluster: aws
, azure
, or vsphere
.workdir
: The location where files are collected.ssh_user
: The SSH user used to access cluster machines. For clusters running on vSphere, the user name is capv
.ssh_pk_file
: The path to your SSH private key file.mgmt_cluster_config
The path of the kubeconfig file for the management cluster.To diagnose workload clusters, uncomment and set the following in addition to the arguments listed above:
workload_clusters
: A comma-separated list of workload cluster names from which to collect diagnostics information.workload_cluster_ns
: The namespace in which the secrets/WORKLOAD-CLUSTER-NAME-kubeconfig
is created in the management cluster.Create a Crashd script file diagnostics.crsh
containing the code in Diagnostics File diagnostics.crsh
below.
Run the crashd
command from the location where the script file diagnostics.crsh
and argument file args
are located.
crashd run --args-file args diagnostics.crsh
Optionally, monitor Crashd output. By default, the crashd
command runs silently until completion. However, you can use flag --debug
to view log messages on the screen similar to the following:
crashd run --debug --args-file args diagnostics.crsh
DEBU[0003] creating working directory ./workdir/tkg-kind-12345
DEBU[0003] kube_capture(what=objects)
DEBU[0003] Searching in 20 groups
...
DEBU[0015] Archiving [./workdir/tkg-kind-12345] in bootstrap.tkg-kind-12345.diagnostics.tar.gz
DEBU[0015] Archived workdir/tkg-kind-12345/kind-logs/docker-info.txt
DEBU[0015] Archived workdir/tkg-kind-12345/kind-logs/tkg-kind-12345-control-plane/alternatives.log
DEBU[0015] Archived workdir/tkg-kind-12345/kind-logs/tkg-kind-12345-control-plane/containerd.log
diagnostics.crsh
In the CrashD bundle download, overwrite the existing diagnostics.crsh
file with the following code, as the script to pass to the crashd run
command:
def capture_node_diagnostics(nodes):
capture(cmd="sudo df -i", resources=nodes)
capture(cmd="sudo crictl info", resources=nodes)
capture(cmd="df -h /var/lib/containerd", resources=nodes)
capture(cmd="sudo systemctl status kubelet", resources=nodes)
capture(cmd="sudo systemctl status containerd", resources=nodes)
capture(cmd="sudo journalctl -xeu kubelet", resources=nodes)
capture(cmd="sudo journalctl -xeu containerd", resources=nodes)
capture(cmd="sudo cat /var/log/cloud-init-output.log", resources=nodes)
capture(cmd="sudo cat /var/log/cloud-init.log", resources=nodes)
def capture_windows_node_diagnostics(nodes):
capture(cmd="Get-CimInstance -ClassName Win32_LogicalDisk", file_name="disk_info.out", resources=nodes)
capture(cmd="(Get-ItemProperty -Path c:\\windows\\system32\\hal.dll).VersionInfo.FileVersion",file_name="windows_version_info.out", resources=nodes)
capture(cmd="cat C:\\k\\StartKubelet.ps1 ; cat C:\\var\\lib\\kubelet\\kubeadm-flags.env", resources=nodes)
capture(cmd="Get-Service Kubelet | select * ", resources=nodes)
capture(cmd="Get-Service Containerd | select * ", resources=nodes)
capture(cmd="Get-Service ovs* | select * ", resources=nodes)
capture(cmd="Get-Service antrea-agent | select * ", resources=nodes)
capture(cmd="Get-Service kube-proxy | select * ", resources=nodes)
capture(cmd="Get-Service Kubelet | select * ", resources=nodes)
capture(cmd="Get-HNSNetwork", resources=nodes)
capture(cmd="& 'c:\\Program Files\\containerd\\crictl.exe' -r 'npipe:////./pipe/containerd-containerd' info", resources=nodes)
capture(cmd="Get-MpPreference | select ExclusionProcess", resources=nodes)
capture(cmd="cat c:\\var\\log\\kubelet\\kubelet.exe.INFO", resources=nodes)
capture(cmd="cat c:\\var\\log\\antrea\\antrea-agent.exe.INFO", resources=nodes)
capture(cmd="cat c:\\var\\log\\kube-proxy\\kube-proxy.exe.INFO", resources=nodes)
capture(cmd="cat 'c:\\Program Files\\Cloudbase Solutions\\Cloudbase-Init\\log\\cloudbase-init-unattend.log'", resources=nodes)
capture(cmd="cat 'c:\\Program Files\\Cloudbase Solutions\\Cloudbase-Init\\log\\cloudbase-init.log'", resources=nodes)
copy_from(path="C:\\Windows\\System32\\Winevt\\Logs\\System.evtx", resources=nodes)
copy_from(path="C:\\Windows\\System32\\Winevt\\Logs\\Application.evtx", resources=nodes)
copy_from(path="c:\\openvswitch\\var\\log\\openvswitch\\ovs-vswitchd.log", resources=nodes)
copy_from(path="c:\\openvswitch\\var\\log\\openvswitch\\ovsdb-server.log", resources=nodes)
# fetches a suitable capi provider, for either capa or others (capv/capz),
# to be used for enumerating cluster machines
def fetch_provider(iaas, workload_cluster_name, ssh_cfg, kube_cfg, namespace, filter_labels):
# workaround: vsphere and azure use same provider as they work similarly (see issue #162)
if iaas == "vsphere" or iaas == "azure":
provider = capv_provider(
workload_cluster=workload_cluster_name,
namespace=namespace,
ssh_config=ssh_cfg,
mgmt_kube_config=kube_cfg,
labels=filter_labels
)
else:
provider = capa_provider(
workload_cluster=workload_cluster_name,
namespace=namespace,
ssh_config=ssh_cfg,
mgmt_kube_config=kube_cfg,
labels=filter_labels
)
return provider
# retrieves linux management provider for linux nodes
def fetch_mgmt_provider_linux(infra, ssh_cfg, kube_cfg, ns):
return fetch_provider(infra, '', ssh_cfg, kube_cfg, ns, ["kubernetes.io/os=linux"])
# retrieves windows mgmt provider for windows nodes
def fetch_mgmt_provider_windows(infra, ssh_cfg, kube_cfg, ns):
return fetch_provider(infra, '', ssh_cfg, kube_cfg, ns, ["kubernetes.io/os=windows"])
# retrieves linux workload provider for linux nodes
def fetch_workload_provider_linux(infra, wc_cluster, ssh_cfg, kube_cfg, ns):
return fetch_provider(infra, wc_cluster, ssh_cfg, kube_cfg, ns, ["kubernetes.io/os=linux"])
# retrieves windows workload provider for windodws nodes
def fetch_workload_provider_windows(infra, wc_cluster, ssh_cfg, kube_cfg, ns):
return fetch_provider(infra, wc_cluster, ssh_cfg, kube_cfg, ns, ["kubernetes.io/os=windows"])
def diagnose_mgmt_cluster(infra):
# validation
args.ssh_user
args.ssh_pk_file
args.mgmt_cluster_config
if len(infra) == 0:
print("Infra argument not provided")
return
wd = "{}/tkg-mgmt-cluster".format(args.workdir)
conf = crashd_config(workdir=wd)
ssh_conf = ssh_config(username=args.ssh_user, private_key_path=args.ssh_pk_file)
kube_conf = kube_config(path=args.mgmt_cluster_config)
# fetch linux mgmt node diagnostics
mgmt_provider_linux = fetch_mgmt_provider_linux(infra, ssh_conf, kube_conf, '')
lin_nodes = resources(provider=mgmt_provider_linux)
capture_node_diagnostics(lin_nodes)
# fetch win mgmt node diagnostics
mgmt_provider_win = fetch_mgmt_provider_windows(infra, ssh_conf, kube_conf, '')
win_nodes = resources(provider=mgmt_provider_win)
if len(win_nodes) > 0:
capture_windows_node_diagnostics(win_nodes)
#add code to collect pod info from cluster
set_defaults(kube_config(capi_provider = mgmt_provider_linux))
pods_ns=[
"capi-kubeadm-bootstrap-system",
"capi-kubeadm-control-plane-system",
"capi-system",
"capi-webhook-system",
"cert-manager",
"tkg-system",
"kube-system",
"tkr-system",
"capa-system",
"capv-system",
"capz-system",
]
if infra == "vsphere":
pods_ns.append("tkg-system-networking")
pods_ns.append("avi-system")
kube_capture(what="logs", namespaces=pods_ns)
kube_capture(what="objects", kinds=["pods", "services"], namespaces=pods_ns)
kube_capture(what="objects", kinds=["deployments", "replicasets"], groups=["apps"], namespaces=pods_ns)
kube_capture(what="objects", kinds=["apps"], groups=["kappctrl.k14s.io"], namespaces=["tkg-system"])
kube_capture(what="objects", kinds=["tanzukubernetesreleases"], groups=["run.tanzu.vmware.com"])
kube_capture(what="objects", kinds=["configmaps"], namespaces=["tkr-system"])
kube_capture(what="objects", categories=["cluster-api"])
kube_capture(what="objects", groups=["ipam.cluster.x-k8s.io"])
if infra == "vsphere":
kube_capture(what="objects", kinds=["akodeploymentconfigs"])
archive(output_file="tkg-mgmt.diagnostics.tar.gz", source_paths=[conf.workdir])
def diagnose_workload_cluster(infra, name):
# validation
args.infra
args.ssh_user
args.ssh_pk_file
args.mgmt_cluster_config
workload_ns = args.workload_cluster_ns
if len(infra) == 0:
print("Infra argument not provided")
return
wd = "{}/{}".format(args.workdir, name)
conf = crashd_config(workdir=wd)
ssh_conf = ssh_config(username=args.ssh_user, private_key_path=args.ssh_pk_file)
kube_conf = kube_config(path=args.mgmt_cluster_config)
# fetch linux workload node diagnostics
wc_provider_linux = fetch_workload_provider_linux(infra, name, ssh_conf, kube_conf, workload_ns)
lin_nodes = resources(provider=wc_provider_linux)
capture_node_diagnostics(lin_nodes)
# fetch win workload node diagnostics
wc_provider_win = fetch_workload_provider_windows(infra, name, ssh_conf, kube_conf, workload_ns)
win_nodes = resources(provider=wc_provider_win)
if len(win_nodes) > 0:
capture_windows_node_diagnostics(win_nodes)
#add code to collect pod info from cluster
set_defaults(kube_config(capi_provider = wc_provider_linux))
pods_ns=["default", "kube-system", "tkg-system"]
if infra == "vsphere":
pods_ns.append("tkg-system-networking")
pods_ns.append("avi-system")
kube_capture(what="logs", namespaces=pods_ns)
kube_capture(what="objects", kinds=["pods", "services"], namespaces=pods_ns)
kube_capture(what="objects", kinds=["deployments", "replicasets"], groups=["apps"], namespaces=pods_ns)
kube_capture(what="objects", kinds=["apps"], groups=["kappctrl.k14s.io"], namespaces=["tkg-system"])
if infra == "vsphere":
kube_capture(what="objects", kinds=["akodeploymentconfigs"])
archive(output_file="{}.diagnostics.tar.gz".format(name), source_paths=[conf.workdir])
# extract diagnostic info from local kind boostrap cluster
def diagnose_bootstrap_cluster():
p = prog_avail_local("kind")
if p == "":
print("Error: kind is not available")
return
clusters=get_tkg_bootstrap_clusters()
if len(clusters) == 0:
print("No tkg-kind bootstrap cluster found")
return
pod_ns=[
"caip-in-cluster-system",
"capi-kubeadm-bootstrap-system",
"capi-kubeadm-control-plane-system",
"capi-system",
"capi-webhook-system",
"capv-system",
"capa-system",
"capz-system",
"cert-manager",
"tkg-system",
"tkg-system-networking",
"avi-system",
]
# for each tkg-kind cluster:
# - capture kind logs, export kubecfg, and api objects
for kind_cluster in clusters:
wd = "{}/{}".format(args.workdir, kind_cluster)
run_local("kind export logs --name {} {}/kind-logs".format(kind_cluster, wd))
kind_cfg = capture_local(
cmd="kind get kubeconfig --name {0}".format(kind_cluster),
workdir="./",
file_name="{}.kubecfg".format(kind_cluster)
)
conf = crashd_config(workdir=wd)
set_defaults(kube_config(path=kind_cfg))
kube_capture(what="objects", kinds=["pods", "services"], namespaces=pod_ns)
kube_capture(what="objects", kinds=["deployments", "replicasets"], groups=["apps"], namespaces=pod_ns)
kube_capture(what="objects", categories=["cluster-api"])
kube_capture(what="objects", kinds=["akodeploymentconfigs"])
archive(output_file="bootstrap.{}.diagnostics.tar.gz".format(kind_cluster), source_paths=[conf.workdir])
# return tkg clusters in kind (tkg-kind-xxxx)
def get_tkg_bootstrap_clusters():
clusters = run_local("kind get clusters").split('\n')
result = []
for cluster in clusters:
if cluster.startswith("tkg-kind"):
result.append(cluster)
return result
def check_prereqs():
# validate args
args.workdir
p = prog_avail_local("ssh")
if p == "":
print("Error: ssh is not available")
return False
p = prog_avail_local("scp")
if p == "":
print("Error: scp is not available")
return False
p = prog_avail_local("kubectl")
if p == "":
print("Error: kubectl is not available")
return False
return True
def diagnose(target, infra):
# validation
if not check_prereqs():
print("Error: One or more prerequisites are missing")
return
# run diagnostics
if target == "bootstrap":
diagnose_bootstrap_cluster()
elif target == "mgmt":
diagnose_mgmt_cluster(infra)
elif target == "workload":
for name in args.workload_clusters.split(","):
diagnose_workload_cluster(infra, name)
else:
print("Error: unknown target {}".format(target))
diagnose(args.target, args.infra)