本主题介绍了如何使用崩溃诊断 (Crashd) 根据具有独立管理集群的 Tanzu Kubernetes Grid 中的 Photon OS 诊断不稳定或无响应的工作负载集群。
有关如何使用 Crashd 诊断由 vSphere with Tanzu 主管部署的工作负载集群,请参见 VMware 知识库中的 如何从 vSphere with Tanzu 上的 Tanzu Kubernetes 客户机集群收集诊断日志包。
Crashd 会在将集群部署到云基础设施之前,检查 tanzu cluster create
过程在本地使用 kind
创建的引导工作负载集群。
崩溃是一个开源项目,可以轻松地对 Kubernetes 集群问题进行故障排除。
Crashd 使用 Starlark(一种类似 Python 的语言)编写的脚本文件,该文件与您的管理集群或工作负载集群进行交互以收集基础架构和集群信息。
Crashd 可以从支持的基础架构收集诊断信息,包括:
Crashd 将从脚本运行的命令获取输出,并将输出添加到 tar
文件中。然后,tar
文件保存在本地以供进一步分析。
Tanzu Kubernetes Grid 包括用于 Crashd 的签名二进制文件和用于 Photon OS 工作负载集群的诊断脚本文件。
要安装或升级 crashd
,请按照以下说明进行操作。
下载适用于您平台的 Crashd。
使用 tar
命令为您的平台解压缩二进制文件。
Linux:
tar -xvf crashd-linux-amd64-v0.3.7-vmware.7.tar.gz
MacOS:
tar -xvf crashd-darwin-amd64-v0.3.7-vmware.7.tar.gz
上一步创建了名为 crashd
的目录,其中包含以下文件:
crashd
crashd/args
crashd/diagnostics.crsh
crashd/crashd-PLATFORM-amd64-v0.3.7+vmware.7
将二进制文件移动到 /usr/local/bin
文件夹中。
Linux:
mv ./crashd/crashd-linux-amd64-v0.3.7+vmware.7 /usr/local/bin/crashd
MacOS:
mv ./crashd/crashd-darwin-amd64-v0.3.7+vmware.7 /usr/local/bin/crashd
Crashd 运行时,会从 args
文件中获取参数值,并将其传递到脚本文件 diagnostics.crsh
。该脚本运行命令以提取有助于诊断 Photon OS 工作负载集群问题的信息。
在运行 Crashd 脚本 diagnostics.crsh
之前,本地计算机的执行路径必须具有以下程序:
kubectl
scp
ssh
注意调查引导集群的问题时,需要在本地安装
kind
(v0.7.0 或更高版本)命令。
此外,在运行 Crashd 之前,必须执行以下步骤:
tanzu mc kubeconfig get <management-cluster-name>
提取管理集群的 kubeconfig
文件。kubeconfig
文件、public-key
文件、diagnostics.crsh
文件和 args
文件位于同一位置。删除为部署正在进行故障排除的工作负载集群而创建的集群之外的任何本地 kind
集群:
docker ps
以确定当前运行的 kind
集群kind delete cluster --name CLUSTER-NAME
,删除任何其他 kind
导航到您下载并解压缩 Crashd 包的位置。
在文本编辑器中,使用以下代码覆盖现有参数文件 args
。此文件包含要传递到 Crashd 脚本的键/值对:
# ######################################################
# Crashd script argument file
#
# This file defines CLI argument values that are passed
# Crashd when running scripts for troubleshooting TKG
# clusters.
# ######################################################
# target: specifies cluster to target.
# Valid targets are: {bootstrap, mgmt, workload}
target=mgmt
# infra: the underlying infrastructure used by the TKG cluster.
# Valid values are: {vsphere, aws, azure}
infra=vsphere
# workdir: a local directory where collected files are staged.
workdir=./workdir
# ssh_user: the user ID used for SSH connections to cluster nodes.
ssh_user=capv
# ssh_pk_file: the path to the private key file created to SSH
# into cluster nodes.
ssh_pk_file=./capv.pem
# ######################################################
# Management Cluster
# The following arguments are used to collect information
# from a management cluster or named workload clusters.
# ######################################################
# mgmt_cluster_config: the kubeconfig file path for the management cluster.
mgmt_cluster_config=./tkg_cluster_config
# ######################################################
# Workload Cluster
# The following arguments are used to collect information
# from one or more workload clusters that are managed
# by the management cluster configured above.
# ######################################################
# workload_clusters: a comma separated list of workload cluster names
# [uncomment below]
#workload_clusters=tkg-cluster-wc-498
# workload_cluster_ns: the namespace where the workload cluster
# is hosted in the management plane.
# Note: it's actually the namespace in which the secrets/${workload_cluster_name}-kubeconfig
# is created in the management cluster.
# [uncomment below]
#workload_cluster_ns=default
记录 SSH 私钥文件的本地路径。如果您还没有 SSH 密钥对,或者想要生成新的密钥对,请运行 ssh-keygen
,如创建 SSH 密钥对中所述。例如:
ssh-keygen -t rsa -b 4096 -C "[email protected]"
出现提示时,输入文件位置的本地路径。有关创建 SSH 密钥对的信息。
在 args
文件中设置以下参数:
target
:将此值设置为:
bootstrap
- 可诊断本地引导独立管理集群mgmt
- 可诊断已部署的独立管理集群workload
- 可诊断一个或多个工作负载集群infra
:集群的底层基础架构:aws
、azure
或 vsphere
。workdir
:收集文件的位置。ssh_user
:用于访问集群计算机的 SSH 用户。对于在 vSphere 上运行的集群,用户名为 capv
。ssh_pk_file
:SSH 私钥文件的路径。mgmt_cluster_config
管理集群的 kubeconfig 文件的路径。要诊断工作负载集群,除了上面列出的参数外,还取消注释并设置以下内容:
workload_clusters
:要从中收集诊断信息的工作负载集群名称的逗号分隔列表。workload_cluster_ns
:在管理集群中创建 secrets/WORKLOAD-CLUSTER-NAME-kubeconfig
的命名空间。创建 Crashd 脚本文件 diagnostics.crsh
,其中包含下面的诊断文件diagnostics.crsh
中的代码。
从脚本文件 diagnostics.crsh
和参数文件 crashd
所在的位置运行 args
命令。
crashd run --args-file args diagnostics.crsh
另外,监控 Crashd 输出。默认情况下,crashd
命令以静默方式运行,直到完成为止。不过,可以使用标记 --debug
在屏幕上查看类似以下内容的日志消息:
crashd run --debug --args-file args diagnostics.crsh
DEBU[0003] creating working directory ./workdir/tkg-kind-12345
DEBU[0003] kube_capture(what=objects)
DEBU[0003] Searching in 20 groups
...
DEBU[0015] Archiving [./workdir/tkg-kind-12345] in bootstrap.tkg-kind-12345.diagnostics.tar.gz
DEBU[0015] Archived workdir/tkg-kind-12345/kind-logs/docker-info.txt
DEBU[0015] Archived workdir/tkg-kind-12345/kind-logs/tkg-kind-12345-control-plane/alternatives.log
DEBU[0015] Archived workdir/tkg-kind-12345/kind-logs/tkg-kind-12345-control-plane/containerd.log
diagnostics.crsh
在 CrashD 包下载中,使用以下代码覆盖现有 diagnostics.crsh
文件,作为要传递到 crashd run
命令的脚本:
def capture_node_diagnostics(nodes):
capture(cmd="sudo df -i", resources=nodes)
capture(cmd="sudo crictl info", resources=nodes)
capture(cmd="df -h /var/lib/containerd", resources=nodes)
capture(cmd="sudo systemctl status kubelet", resources=nodes)
capture(cmd="sudo systemctl status containerd", resources=nodes)
capture(cmd="sudo journalctl -xeu kubelet", resources=nodes)
capture(cmd="sudo journalctl -xeu containerd", resources=nodes)
capture(cmd="sudo cat /var/log/cloud-init-output.log", resources=nodes)
capture(cmd="sudo cat /var/log/cloud-init.log", resources=nodes)
def capture_windows_node_diagnostics(nodes):
capture(cmd="Get-CimInstance -ClassName Win32_LogicalDisk", file_name="disk_info.out", resources=nodes)
capture(cmd="(Get-ItemProperty -Path c:\\windows\\system32\\hal.dll).VersionInfo.FileVersion",file_name="windows_version_info.out", resources=nodes)
capture(cmd="cat C:\\k\\StartKubelet.ps1 ; cat C:\\var\\lib\\kubelet\\kubeadm-flags.env", resources=nodes)
capture(cmd="Get-Service Kubelet | select * ", resources=nodes)
capture(cmd="Get-Service Containerd | select * ", resources=nodes)
capture(cmd="Get-Service ovs* | select * ", resources=nodes)
capture(cmd="Get-Service antrea-agent | select * ", resources=nodes)
capture(cmd="Get-Service kube-proxy | select * ", resources=nodes)
capture(cmd="Get-Service Kubelet | select * ", resources=nodes)
capture(cmd="Get-HNSNetwork", resources=nodes)
capture(cmd="& 'c:\\Program Files\\containerd\\crictl.exe' -r 'npipe:////./pipe/containerd-containerd' info", resources=nodes)
capture(cmd="Get-MpPreference | select ExclusionProcess", resources=nodes)
capture(cmd="cat c:\\var\\log\\kubelet\\kubelet.exe.INFO", resources=nodes)
capture(cmd="cat c:\\var\\log\\antrea\\antrea-agent.exe.INFO", resources=nodes)
capture(cmd="cat c:\\var\\log\\kube-proxy\\kube-proxy.exe.INFO", resources=nodes)
capture(cmd="cat 'c:\\Program Files\\Cloudbase Solutions\\Cloudbase-Init\\log\\cloudbase-init-unattend.log'", resources=nodes)
capture(cmd="cat 'c:\\Program Files\\Cloudbase Solutions\\Cloudbase-Init\\log\\cloudbase-init.log'", resources=nodes)
copy_from(path="C:\\Windows\\System32\\Winevt\\Logs\\System.evtx", resources=nodes)
copy_from(path="C:\\Windows\\System32\\Winevt\\Logs\\Application.evtx", resources=nodes)
copy_from(path="c:\\openvswitch\\var\\log\\openvswitch\\ovs-vswitchd.log", resources=nodes)
copy_from(path="c:\\openvswitch\\var\\log\\openvswitch\\ovsdb-server.log", resources=nodes)
# fetches a suitable capi provider, for either capa or others (capv/capz),
# to be used for enumerating cluster machines
def fetch_provider(iaas, workload_cluster_name, ssh_cfg, kube_cfg, namespace, filter_labels):
# workaround: vsphere and azure use same provider as they work similarly (see issue #162)
if iaas == "vsphere" or iaas == "azure":
provider = capv_provider(
workload_cluster=workload_cluster_name,
namespace=namespace,
ssh_config=ssh_cfg,
mgmt_kube_config=kube_cfg,
labels=filter_labels
)
else:
provider = capa_provider(
workload_cluster=workload_cluster_name,
namespace=namespace,
ssh_config=ssh_cfg,
mgmt_kube_config=kube_cfg,
labels=filter_labels
)
return provider
# retrieves linux management provider for linux nodes
def fetch_mgmt_provider_linux(infra, ssh_cfg, kube_cfg, ns):
return fetch_provider(infra, '', ssh_cfg, kube_cfg, ns, ["kubernetes.io/os=linux"])
# retrieves windows mgmt provider for windows nodes
def fetch_mgmt_provider_windows(infra, ssh_cfg, kube_cfg, ns):
return fetch_provider(infra, '', ssh_cfg, kube_cfg, ns, ["kubernetes.io/os=windows"])
# retrieves linux workload provider for linux nodes
def fetch_workload_provider_linux(infra, wc_cluster, ssh_cfg, kube_cfg, ns):
return fetch_provider(infra, wc_cluster, ssh_cfg, kube_cfg, ns, ["kubernetes.io/os=linux"])
# retrieves windows workload provider for windodws nodes
def fetch_workload_provider_windows(infra, wc_cluster, ssh_cfg, kube_cfg, ns):
return fetch_provider(infra, wc_cluster, ssh_cfg, kube_cfg, ns, ["kubernetes.io/os=windows"])
def diagnose_mgmt_cluster(infra):
# validation
args.ssh_user
args.ssh_pk_file
args.mgmt_cluster_config
if len(infra) == 0:
print("Infra argument not provided")
return
wd = "{}/tkg-mgmt-cluster".format(args.workdir)
conf = crashd_config(workdir=wd)
ssh_conf = ssh_config(username=args.ssh_user, private_key_path=args.ssh_pk_file)
kube_conf = kube_config(path=args.mgmt_cluster_config)
# fetch linux mgmt node diagnostics
mgmt_provider_linux = fetch_mgmt_provider_linux(infra, ssh_conf, kube_conf, '')
lin_nodes = resources(provider=mgmt_provider_linux)
capture_node_diagnostics(lin_nodes)
# fetch win mgmt node diagnostics
mgmt_provider_win = fetch_mgmt_provider_windows(infra, ssh_conf, kube_conf, '')
win_nodes = resources(provider=mgmt_provider_win)
if len(win_nodes) > 0:
capture_windows_node_diagnostics(win_nodes)
#add code to collect pod info from cluster
set_defaults(kube_config(capi_provider = mgmt_provider_linux))
pods_ns=[
"capi-kubeadm-bootstrap-system",
"capi-kubeadm-control-plane-system",
"capi-system",
"capi-webhook-system",
"cert-manager",
"tkg-system",
"kube-system",
"tkr-system",
"capa-system",
"capv-system",
"capz-system",
]
if infra == "vsphere":
pods_ns.append("tkg-system-networking")
pods_ns.append("avi-system")
kube_capture(what="logs", namespaces=pods_ns)
kube_capture(what="objects", kinds=["pods", "services"], namespaces=pods_ns)
kube_capture(what="objects", kinds=["deployments", "replicasets"], groups=["apps"], namespaces=pods_ns)
kube_capture(what="objects", kinds=["apps"], groups=["kappctrl.k14s.io"], namespaces=["tkg-system"])
kube_capture(what="objects", kinds=["tanzukubernetesreleases"], groups=["run.tanzu.vmware.com"])
kube_capture(what="objects", kinds=["configmaps"], namespaces=["tkr-system"])
kube_capture(what="objects", categories=["cluster-api"])
kube_capture(what="objects", groups=["ipam.cluster.x-k8s.io"])
if infra == "vsphere":
kube_capture(what="objects", kinds=["akodeploymentconfigs"])
archive(output_file="tkg-mgmt.diagnostics.tar.gz", source_paths=[conf.workdir])
def diagnose_workload_cluster(infra, name):
# validation
args.infra
args.ssh_user
args.ssh_pk_file
args.mgmt_cluster_config
workload_ns = args.workload_cluster_ns
if len(infra) == 0:
print("Infra argument not provided")
return
wd = "{}/{}".format(args.workdir, name)
conf = crashd_config(workdir=wd)
ssh_conf = ssh_config(username=args.ssh_user, private_key_path=args.ssh_pk_file)
kube_conf = kube_config(path=args.mgmt_cluster_config)
# fetch linux workload node diagnostics
wc_provider_linux = fetch_workload_provider_linux(infra, name, ssh_conf, kube_conf, workload_ns)
lin_nodes = resources(provider=wc_provider_linux)
capture_node_diagnostics(lin_nodes)
# fetch win workload node diagnostics
wc_provider_win = fetch_workload_provider_windows(infra, name, ssh_conf, kube_conf, workload_ns)
win_nodes = resources(provider=wc_provider_win)
if len(win_nodes) > 0:
capture_windows_node_diagnostics(win_nodes)
#add code to collect pod info from cluster
set_defaults(kube_config(capi_provider = wc_provider_linux))
pods_ns=["default", "kube-system", "tkg-system"]
if infra == "vsphere":
pods_ns.append("tkg-system-networking")
pods_ns.append("avi-system")
kube_capture(what="logs", namespaces=pods_ns)
kube_capture(what="objects", kinds=["pods", "services"], namespaces=pods_ns)
kube_capture(what="objects", kinds=["deployments", "replicasets"], groups=["apps"], namespaces=pods_ns)
kube_capture(what="objects", kinds=["apps"], groups=["kappctrl.k14s.io"], namespaces=["tkg-system"])
if infra == "vsphere":
kube_capture(what="objects", kinds=["akodeploymentconfigs"])
archive(output_file="{}.diagnostics.tar.gz".format(name), source_paths=[conf.workdir])
# extract diagnostic info from local kind boostrap cluster
def diagnose_bootstrap_cluster():
p = prog_avail_local("kind")
if p == "":
print("Error: kind is not available")
return
clusters=get_tkg_bootstrap_clusters()
if len(clusters) == 0:
print("No tkg-kind bootstrap cluster found")
return
pod_ns=[
"caip-in-cluster-system",
"capi-kubeadm-bootstrap-system",
"capi-kubeadm-control-plane-system",
"capi-system",
"capi-webhook-system",
"capv-system",
"capa-system",
"capz-system",
"cert-manager",
"tkg-system",
"tkg-system-networking",
"avi-system",
]
# for each tkg-kind cluster:
# - capture kind logs, export kubecfg, and api objects
for kind_cluster in clusters:
wd = "{}/{}".format(args.workdir, kind_cluster)
run_local("kind export logs --name {} {}/kind-logs".format(kind_cluster, wd))
kind_cfg = capture_local(
cmd="kind get kubeconfig --name {0}".format(kind_cluster),
workdir="./",
file_name="{}.kubecfg".format(kind_cluster)
)
conf = crashd_config(workdir=wd)
set_defaults(kube_config(path=kind_cfg))
kube_capture(what="objects", kinds=["pods", "services"], namespaces=pod_ns)
kube_capture(what="objects", kinds=["deployments", "replicasets"], groups=["apps"], namespaces=pod_ns)
kube_capture(what="objects", categories=["cluster-api"])
kube_capture(what="objects", kinds=["akodeploymentconfigs"])
archive(output_file="bootstrap.{}.diagnostics.tar.gz".format(kind_cluster), source_paths=[conf.workdir])
# return tkg clusters in kind (tkg-kind-xxxx)
def get_tkg_bootstrap_clusters():
clusters = run_local("kind get clusters").split('\n')
result = []
for cluster in clusters:
if cluster.startswith("tkg-kind"):
result.append(cluster)
return result
def check_prereqs():
# validate args
args.workdir
p = prog_avail_local("ssh")
if p == "":
print("Error: ssh is not available")
return False
p = prog_avail_local("scp")
if p == "":
print("Error: scp is not available")
return False
p = prog_avail_local("kubectl")
if p == "":
print("Error: kubectl is not available")
return False
return True
def diagnose(target, infra):
# validation
if not check_prereqs():
print("Error: One or more prerequisites are missing")
return
# run diagnostics
if target == "bootstrap":
diagnose_bootstrap_cluster()
elif target == "mgmt":
diagnose_mgmt_cluster(infra)
elif target == "workload":
for name in args.workload_clusters.split(","):
diagnose_workload_cluster(infra, name)
else:
print("Error: unknown target {}".format(target))
diagnose(args.target, args.infra)