check-circle-line exclamation-circle-line close-line

<

Crash Recovery and Diagnostics for Kubernetes is an open source project that helps you to investigate and troubleshoot unhealthy or unresponsive Kubernetes clusters. It automates the diagnosis of problem clusters that might be in an unstable state, or even inoperable. Crash Recovery and Diagnostics provides you with the ability to automatically collect machine states and other information from each node in a cluster.

To specify the resources to collect from cluster machines, a series of commands are declared in a file called a diagnostics file. Like a Dockerfile, the diagnostics file is a collection of line-by-line directives with commands that are executed on each specified cluster machine. The output of the commands is then added to a tar file and saved for further analysis.

Tanzu Kubernetes Grid includes signed binaries for Crash Recovery and Diagnostics and a default diagnostics file for Photon OS Tanzu Kubernetes clusters.

Install or Upgrade the Crash Recovery and Diagnostics Binary

This procedure assumes that you are using Tanzu Kubernetes Grid 1.1.2 or 1.1.3. In version 1.1.0, the version of Crash Recovery and Diagnostics is v0.2.2-vmware.2. To upgrade Crash Recovery and Diagnostics from v0.2.2-vmware.2 to v0.2.2-vmware.3, follow the installation procedure to replace the existing binary with the new one.

  1. Go to https://www.vmware.com/go/get-tkg and log in with your My VMware credentials.
  2. Download the bundle for Crash Recovery and Diagnostics for your platform.

    • Linux: Crash Recovery and Diagnostics for Kubernetes 0.2.2 Linux
    • Mac OS: Crash Recovery and Diagnostics for Kubernetes 0.2.2 Mac
  3. Use either the tar command or the extraction tool of your choice to unpack the crash-diagnostics-v0.2.2.tar bundle.

  4. Use the gunzip command to unpack the binary for your platform.

    • Linux:
      gunzip crash-diagnostics-linux-v0.2.2-vmware.3.gz
      
    • Mac OS:
      gunzip crash-diagnostics-darwin-v0.2.2-vmware.3.gz
      
  5. Move the binary into the /usr/local/bin folder.

    • Linux:
      mv ./crash-diagnostics-linux-v0.2.2-vmware.3 /usr/local/bin/crash-diagnostics
      
    • Mac OS:
      mv ./crash-diagnostics-darwin-v0.2.2-vmware.3 /usr/local/bin/crash-diagnostics
      
  6. Make the file executable.

    chmod +x /usr/local/bin/crash-diagnostics
    

Run Crash Recovery and Diagnostics on Photon OS Tanzu Kubernetes Grid Clusters

The Crash Recovery and Diagnostics bundle that Tanzu Kubernetes Grid provides includes a default Crash Recovery and Diagnostics file, Diagnostics.file, that you can use to diagnose problems on Photon OS management clusters and Tanzu Kubernetes clusters that you deploy on vSphere from Tanzu Kubernetes Grid.

Prerequisites

  • Crash-Diagnostics requires an SSH private/public key pair.
  • Ensure your Tanzu Kubernetes Grid VMs are configured to use your SSH public key.
  • Collect the IP address of your workload Tanzu Kubernetes cluster VMs.
  • Extract the kubeconfig file from the management cluster by running tkg get credentials <management-cluster-name>.

Procedure

  1. Navigate to the location in which you downloaded and unpacked the Crash Diagnostic bundle, and open Diagnostics.file in a text editor.

    For example, use vi to edit the file.

    vi Diagnostics.file
    

    The file contains a series of commands that run sequentially on the cluster VMs:

    # FROM specifies a space-separated host list used to get diagnostics data.
    # Retries is the maximum number of connection attempts made to connect to hosts.
    FROM hosts:"<space-separated list of ip:port>" retries:"20"
    
    # Environment variables used later in the file
    ENV SSH_USER=<ssh-user>
    ENV SSH_KEY=<private-ssh-key>
    
    # AUTHCONFIG configures authentication for remote connection to cluster machines
    # specified in the FROM declaration above. Each remote connection
    # will use the specified username and private-key.
    AUTHCONFIG username:$SSH_USER private-key:$SSH_KEY
    
    # WORKDIR specifies a location on disk where the tool stages files
    # before they are bundled.
    WORKDIR <directory/path/>
    
    # OUTPUT specifies a path for the generated tar.gz file output bundle
    OUTPUT <file/path/name>.tar.gz
    
    # Capture run time info from each cluster machines
    CAPTURE sudo df
    CAPTURE sudo df -i
    CAPTURE sudo ifconfig -a
    CAPTURE sudo rpm -qa
    CAPTURE sudo netstat -anp
    CAPTURE sudo netstat -aens
    CAPTURE sudo route
    CAPTURE sudo mount
    CAPTURE sudo dmesg
    CAPTURE sudo free
    CAPTURE sudo uptime
    CAPTURE sudo date
    CAPTURE sudo ps auwwx --sort -rss
    CAPTURE sudo bash -c "ulimit -a"
    CAPTURE sudo bash -c "umask"
    CAPTURE sudo cat /proc/meminfo
    CAPTURE sudo cat /proc/cpuinfo
    CAPTURE sudo cat /proc/vmstat
    CAPTURE sudo cat /proc/swaps
    CAPTURE sudo cat /proc/mounts
    CAPTURE sudo arp -a
    CAPTURE sudo env
    CAPTURE sudo top -d 5 -n 5 -b
    CAPTURE sudo docker ps -a
    CAPTURE sudo iptables -L -n
    CAPTURE sudo systemctl status kubelet
    CAPTURE sudo systemctl status docker
    CAPTURE sudo journalctl -xeu kubelet
    CAPTURE sudo cat /var/log/cloud-init-output.log
    CAPTURE sudo cat /var/log/cloud-init.log
    
    # KUBECONFIG specifies the location of a kubeconfig file
    # used by subsequent Kubernetes commands below.
    KUBECONFIG <file/path/to>/kubeconfig
    
    # Retrieve API objects and logs
    KUBEGET objects kinds:"pods"
    KUBEGET logs 
    
  2. Update the following elements with information about your cluster.

    • FROM: Add a comma-separated list of cluster node VM IP addresses.
    • ENV SSH_USER The SSH user name for the cluster. For clusters running on vSphere, the user name is capv.
    • ENV SSH_KEY: The path to your SSH private key file. For information about creating the SSH key pairs, see Create an SSH Key Pair in Deploy a Management Cluster to vSphere.
    • WORKDIR The location in which to prepare files before they are bundled into the tar output file.
    • OUTPUT The location and name of the tar output file.
    • KUBECONFIG The path to the configuration file for the cluster or clusters.
  3. Run Crash Recovery and Diagnostics.

    Run the crash-diagnostics run command from the location in which the Diagnostics.file is located.

    crash-diagnostics run
    

Crash Recovery and Diagnostics Options

When you run the crash-diagnostics run command, by default Crash Recovery and Diagnostics searches for and executes a diagnostics script file named ./Diagnostics.file.

crash-diagnostics run

You can specify --file to run Crash Recovery and Diagnostics from a different diagnostics file.

crash-diagnostics run --file my-diagnostics.file 

You can specify the output file to be generated by the tool by using --output, which overrides the default value.

crash-diagnostics run --file my-diagnostics.file --output my-cluster.tar.gz

If you specify the --debug flag, you see log messages on the screen similar to the following:

$> crash-diagnostics run --debug

DEBU[0000] Parsing script file
DEBU[0000] Parsing [1: FROM local]
DEBU[0000] FROM parsed OK
DEBU[0000] Parsing [2: WORKDIR /tmp/crasdir]
...
DEBU[0000] Archiving [/tmp/crashdir] in out.tar.gz
DEBU[0000] Archived /tmp/crashdir/local/df_-i.txt
DEBU[0000] Archived /tmp/crashdir/local/lsof_-i.txt
DEBU[0000] Archived /tmp/crashdir/local/netstat_-an.txt
DEBU[0000] Archived /tmp/crashdir/local/ps_-ef.txt
DEBU[0000] Archived /tmp/crashdir/local/var/log/syslog
INFO[0000] Created archive out.tar.gz
INFO[0002] Created archive out.tar.gz
INFO[0002] Output done