Crash Recovery and Diagnostics for Kubernetes is an open source project that helps you to investigate and troubleshoot unhealthy or unresponsive Kubernetes clusters. It automates the diagnosis of problem clusters that might be in an unstable state, or even inoperable. Crash Recovery and Diagnostics provides you with the ability to automatically collect machine states and other information from each node in a cluster.
To specify the resources to collect from cluster machines, a series of commands are declared in a file called a diagnostics file. Like a Dockerfile, the diagnostics file is a collection of line-by-line directives with commands that are executed on each specified cluster machine. The output of the commands is then added to a
tar file and saved for further analysis.
Tanzu Kubernetes Grid includes signed binaries for Crash Recovery and Diagnostics and a default diagnostics file for Photon OS Tanzu Kubernetes clusters.
Use either the
tar command or the extraction tool of your choice to unpack the bundle:
tar -xzf crash-diagnostics-v0.2.2.tar.gz
tarcommand or the extraction tool of your choice to unpack the
gunzip command to unpack the binary for your platform.
Move the binary into the
mv ./crash-diagnostics /usr/local/bin/crash-diagnostics
Make the file executable.
chmod +x /usr/local/bin/crash-diagnostics
The Crash Recovery and Diagnostics bundle that Tanzu Kubernetes Grid provides includes a default Crash Recovery and Diagnostics file,
Diagnostics.file, that you can use to diagnose problems on Photon OS management clusters and Tanzu Kubernetes clusters that you deploy on vSphere from Tanzu Kubernetes Grid.
kubeconfigfile from the management cluster by running
tkg get credentials <management-cluster-name>.
Navigate to the location in which you downloaded and unpacked the Crash Diagnostic bundle, and open
Diagnostics.file in a text editor.
For example, use
vi to edit the file.
The file contains a series of commands that run sequentially on the cluster VMs:
# FROM specifies a space-separated host list used to get diagnostics data. # Retries is the maximum number of connection attempts made to connect to hosts. FROM hosts:"<space-separated list of ip:port>" retries:"20" # Environment variables used later in the file ENV SSH_USER=<ssh-user> ENV SSH_KEY=<private-ssh-key> # AUTHCONFIG configures authentication for remote connection to cluster machines # specified in the FROM declaration above. Each remote connection # will use the specified username and private-key. AUTHCONFIG username:$SSH_USER private-key:$SSH_KEY # WORKDIR specifies a location on disk where the tool stages files # before they are bundled. WORKDIR <directory/path/> # OUTPUT specifies a path for the generated tar.gz file output bundle OUTPUT <file/path/name>.tar.gz # Capture run time info from each cluster machines CAPTURE sudo df CAPTURE sudo df -i CAPTURE sudo ifconfig -a CAPTURE sudo rpm -qa CAPTURE sudo netstat -anp CAPTURE sudo netstat -aens CAPTURE sudo route CAPTURE sudo mount CAPTURE sudo dmesg CAPTURE sudo free CAPTURE sudo uptime CAPTURE sudo date CAPTURE sudo ps auwwx --sort -rss CAPTURE sudo bash -c "ulimit -a" CAPTURE sudo bash -c "umask" CAPTURE sudo cat /proc/meminfo CAPTURE sudo cat /proc/cpuinfo CAPTURE sudo cat /proc/vmstat CAPTURE sudo cat /proc/swaps CAPTURE sudo cat /proc/mounts CAPTURE sudo arp -a CAPTURE sudo env CAPTURE sudo top -d 5 -n 5 -b CAPTURE sudo docker ps -a CAPTURE sudo iptables -L -n CAPTURE sudo systemctl status kubelet CAPTURE sudo systemctl status docker CAPTURE sudo journalctl -xeu kubelet CAPTURE sudo cat /var/log/cloud-init-output.log CAPTURE sudo cat /var/log/cloud-init.log # KUBECONFIG specifies the location of a kubeconfig file # used by subsequent Kubernetes commands below. KUBECONFIG <file/path/to>/kubeconfig # Retrieve API objects and logs KUBEGET objects kinds:"pods" KUBEGET logs
Update the following elements with information about your cluster.
FROM: Add a comma-separated list of cluster node VM IP addresses.
ENV SSH_USERThe SSH user name for the cluster. For clusters running on vSphere, the user name is
ENV SSH_KEYFor information about creating the SSH key pairs, see Create an SSH Key Pair in Prepare to Deploy the Management Cluster to vSphere.
WORKDIRThe location in which to prepare files before they are bundled into the
OUTPUTThe location and name of the
KUBECONFIGThe path to the configuration file for the cluster or clusters.
Run Crash Recovery and Diagnostics.
crash-diagnostics run command from the location in which the
Diagnostics.file is located.
When you run the
crash-diagnostics run command, by default Crash Recovery and Diagnostics searches for and executes a diagnostics script file named
You can specify
--file to run Crash Recovery and Diagnostics from a different diagnostics file. For example, if you have Tanzu Kubernetes Grid Plus support and you can engage with Tanzu Kubernetes Grid Plus Customer Reliability Engineers to resolve a problem, they might provide you with a custom diagnostics file to run.
crash-diagnostics --file my-diagnostics.file
You can specify the output file to be generated by the tool by using
--output, which overrides the default value.
crash-diagnostics --file my-diagnostics.file --output my-cluster.tar.gz
If you specify the
--debug flag, you see log messages on the screen similar to the following:
$> crash-diagnostics run --debug DEBU Parsing script file DEBU Parsing [1: FROM local] DEBU FROM parsed OK DEBU Parsing [2: WORKDIR /tmp/crasdir] ... DEBU Archiving [/tmp/crashdir] in out.tar.gz DEBU Archived /tmp/crashdir/local/df_-i.txt DEBU Archived /tmp/crashdir/local/lsof_-i.txt DEBU Archived /tmp/crashdir/local/netstat_-an.txt DEBU Archived /tmp/crashdir/local/ps_-ef.txt DEBU Archived /tmp/crashdir/local/var/log/syslog INFO Created archive out.tar.gz INFO Created archive out.tar.gz INFO Output done