本主題說明如何使用當機診斷 (Crashd),在具有獨立管理叢集的 Tanzu Kubernetes Grid 中的 Photon OS 上診斷工作負載叢集不穩定或無回應的情況。
有關如何使用 Crashd 診斷由 vSphere with Tanzu 主管部署的工作負載叢集,請參閱 VMware 知識庫中的如何從 vSphere with Tanzu 上的 Tanzu Kubernetes 客體叢集收集診斷記錄服務包。
Crashd 會在將叢集部署至雲端基礎結構之前,檢查 tanzu cluster create
程序使用 kind
在本機建立的啟動載入工作負載叢集。
Crashd 是開放原始碼專案,可讓您輕鬆對 Kubernetes 叢集的問題進行疑難排解。
Crashd 會使用以 Starlark (一種類似 Python 的語言) 所撰寫的指令碼檔案,該檔案會與您的管理叢集或工作負載叢集互動,以收集基礎結構和叢集資訊。
Crashd 可以從支援的基礎結構收集診斷資訊,包括:
Crashd 會從指令碼所執行的命令取得輸出,並將輸出新增到 tar
檔案。然後,會將 tar
檔案儲存在本機,以供進一步分析。
Tanzu Kubernetes Grid 包含用於 Crashd 的簽署二進位檔,以及用於 Photon OS 工作負載叢集的診斷指令碼檔案。
若要安裝或升級 crashd
,請按照以下說明操作。
下載您的平台適用的 Crashd。
使用 tar
命令為您的平台解壓縮二進位檔。
Linux:
tar -xvf crashd-linux-amd64-v0.3.7-vmware.6.tar.gz
macOS:
tar -xvf crashd-darwin-amd64-v0.3.7-vmware.6.tar.gz
上一步建立了名為 crashd
的目錄,其中包含以下檔案:
crashd
crashd/args
crashd/diagnostics.crsh
crashd/crashd-PLATFORM-amd64-v0.3.7+vmware.6
將二進位檔移入 /usr/local/bin
資料夾。
Linux:
mv ./crashd/crashd-linux-amd64-v0.3.7+vmware.6 /usr/local/bin/crashd
macOS:
mv ./crashd/crashd-darwin-amd64-v0.3.7+vmware.6 /usr/local/bin/crashd
Crashd 執行時,會從 args
檔案取得引數值,並將其傳遞到指令碼檔案 diagnostics.crsh
。該指令碼執行命令來擷取資訊,以協助診斷 Photon OS 工作負載叢集的問題。
在執行 Crashd 指令碼 diagnostics.crsh
之前,本機機器的執行路徑上必須有以下程式:
kubectl
scp
ssh
附註在調查啟動叢集的問題時,您需要在本機安裝
kind
(v0.7.0 或更新版本) 命令。
此外,在執行 Crashd 之前,必須遵循下列步驟:
tanzu mc kubeconfig get <management-cluster-name>
擷取管理叢集的 kubeconfig
檔案。kubeconfig
、public-key
、diagnostics.crsh
和 args
檔案位於相同位置。刪除除了為部署您正在進行疑難排解的工作負載叢集而建立的叢集之外的任何本機 kind
叢集:
docker ps
以識別目前執行的 kind
叢集kind delete cluster --name CLUSTER-NAME
,刪除任何其他 kind
叢集導覽至您下載並解壓縮 Crashd 服務包的位置。
在文字編輯器中,使用以下程式碼來覆寫現有的引數檔案 args
。此檔案包含要傳遞至 CrashD 指令碼的索引鍵/值配對:
# ######################################################
# Crashd script argument file
#
# This file defines CLI argument values that are passed
# Crashd when running scripts for troubleshooting TKG
# clusters.
# ######################################################
# target: specifies cluster to target.
# Valid targets are: {bootstrap, mgmt, workload}
target=mgmt
# infra: the underlying infrastructure used by the TKG cluster.
# Valid values are: {vsphere, aws, azure}
infra=vsphere
# workdir: a local directory where collected files are staged.
workdir=./workdir
# ssh_user: the user ID used for SSH connections to cluster nodes.
ssh_user=capv
# ssh_pk_file: the path to the private key file created to SSH
# into cluster nodes.
ssh_pk_file=./capv.pem
# ######################################################
# Management Cluster
# The following arguments are used to collect information
# from a management cluster or named workload clusters.
# ######################################################
# mgmt_cluster_config: the kubeconfig file path for the management cluster.
mgmt_cluster_config=./tkg_cluster_config
# ######################################################
# Workload Cluster
# The following arguments are used to collect information
# from one or more workload clusters that are managed
# by the management cluster configured above.
# ######################################################
# workload_clusters: a comma separated list of workload cluster names
# [uncomment below]
#workload_clusters=tkg-cluster-wc-498
# workload_cluster_ns: the namespace where the workload cluster
# is hosted in the management plane.
# Note: it's actually the namespace in which the secrets/${workload_cluster_name}-kubeconfig
# is created in the management cluster.
# [uncomment below]
#workload_cluster_ns=default
記錄 SSH 私密金鑰檔案的本機路徑。如果您還沒有 SSH 金鑰配對,或者您想產生新的金鑰配對,請執行 ssh-keygen
,如建立 SSH 金鑰配對中所述。例如:
ssh-keygen -t rsa -b 4096 -C "[email protected]"
出現提示時,輸入檔案位置的本機路徑。有關建立 SSH 金鑰配對的資訊。
在 args
檔案中設定下列引數:
target
:將此值設定為:
bootstrap
,以診斷本機啟動獨立管理叢集mgmt
,以診斷已部署的獨立管理叢集workload
,以診斷一或多個工作負載叢集infra
:叢集的基礎結構:aws
、azure
或 vsphere
。workdir
:收集檔案的位置。ssh_user
:用於存取叢集機器的 SSH 使用者。若為 vSphere 上執行的叢集,使用者名稱為 capv
。ssh_pk_file
:SSH 私密金鑰檔案的路徑。mgmt_cluster_config
:管理叢集的 kubeconfig 檔案路徑。若要診斷工作負載叢集,除了上面列出的引數外,還需取消註解並設定以下項目:
workload_clusters
:要從中收集診斷資訊的工作負載叢集名稱 (逗號分隔清單)。workload_cluster_ns
:在管理叢集中建立了 secrets/WORKLOAD-CLUSTER-NAME-kubeconfig
的命名空間。建立 Crashd 指令碼檔案 diagnostics.crsh
,其中包含以下診斷檔案 diagnostics.crsh
中的程式碼。
從指令碼檔案 diagnostics.crsh
和引數檔 args
所在的位置執行 crashd
命令。
crashd run --args-file args diagnostics.crsh
可選擇監控 Crashd 輸出。依預設,crashd
命令以無訊息方式執行,直到完成為止。不過,您可以使用旗標 --debug
,在畫面上檢視類似如下的記錄訊息:
crashd run --debug --args-file args diagnostics.crsh
DEBU[0003] creating working directory ./workdir/tkg-kind-12345
DEBU[0003] kube_capture(what=objects)
DEBU[0003] Searching in 20 groups
...
DEBU[0015] Archiving [./workdir/tkg-kind-12345] in bootstrap.tkg-kind-12345.diagnostics.tar.gz
DEBU[0015] Archived workdir/tkg-kind-12345/kind-logs/docker-info.txt
DEBU[0015] Archived workdir/tkg-kind-12345/kind-logs/tkg-kind-12345-control-plane/alternatives.log
DEBU[0015] Archived workdir/tkg-kind-12345/kind-logs/tkg-kind-12345-control-plane/containerd.log
diagnostics.crsh
在 CrashD 服務包下載中,以下列程式碼覆寫現有 diagnostics.crsh
檔案,作為指令碼來傳遞給 crashd run
命令:
def capture_node_diagnostics(nodes):
capture(cmd="sudo df -i", resources=nodes)
capture(cmd="sudo crictl info", resources=nodes)
capture(cmd="df -h /var/lib/containerd", resources=nodes)
capture(cmd="sudo systemctl status kubelet", resources=nodes)
capture(cmd="sudo systemctl status containerd", resources=nodes)
capture(cmd="sudo journalctl -xeu kubelet", resources=nodes)
capture(cmd="sudo journalctl -xeu containerd", resources=nodes)
capture(cmd="sudo cat /var/log/cloud-init-output.log", resources=nodes)
capture(cmd="sudo cat /var/log/cloud-init.log", resources=nodes)
def capture_windows_node_diagnostics(nodes):
capture(cmd="Get-CimInstance -ClassName Win32_LogicalDisk", file_name="disk_info.out", resources=nodes)
capture(cmd="(Get-ItemProperty -Path c:\\windows\\system32\\hal.dll).VersionInfo.FileVersion",file_name="windows_version_info.out", resources=nodes)
capture(cmd="cat C:\\k\\StartKubelet.ps1 ; cat C:\\var\\lib\\kubelet\\kubeadm-flags.env", resources=nodes)
capture(cmd="Get-Service Kubelet | select * ", resources=nodes)
capture(cmd="Get-Service Containerd | select * ", resources=nodes)
capture(cmd="Get-Service ovs* | select * ", resources=nodes)
capture(cmd="Get-Service antrea-agent | select * ", resources=nodes)
capture(cmd="Get-Service kube-proxy | select * ", resources=nodes)
capture(cmd="Get-Service Kubelet | select * ", resources=nodes)
capture(cmd="Get-HNSNetwork", resources=nodes)
capture(cmd="& 'c:\\Program Files\\containerd\\crictl.exe' -r 'npipe:////./pipe/containerd-containerd' info", resources=nodes)
capture(cmd="Get-MpPreference | select ExclusionProcess", resources=nodes)
capture(cmd="cat c:\\var\\log\\kubelet\\kubelet.exe.INFO", resources=nodes)
capture(cmd="cat c:\\var\\log\\antrea\\antrea-agent.exe.INFO", resources=nodes)
capture(cmd="cat c:\\var\\log\\kube-proxy\\kube-proxy.exe.INFO", resources=nodes)
capture(cmd="cat 'c:\\Program Files\\Cloudbase Solutions\\Cloudbase-Init\\log\\cloudbase-init-unattend.log'", resources=nodes)
capture(cmd="cat 'c:\\Program Files\\Cloudbase Solutions\\Cloudbase-Init\\log\\cloudbase-init.log'", resources=nodes)
copy_from(path="C:\\Windows\\System32\\Winevt\\Logs\\System.evtx", resources=nodes)
copy_from(path="C:\\Windows\\System32\\Winevt\\Logs\\Application.evtx", resources=nodes)
copy_from(path="c:\\openvswitch\\var\\log\\openvswitch\\ovs-vswitchd.log", resources=nodes)
copy_from(path="c:\\openvswitch\\var\\log\\openvswitch\\ovsdb-server.log", resources=nodes)
# fetches a suitable capi provider, for either capa or others (capv/capz),
# to be used for enumerating cluster machines
def fetch_provider(iaas, workload_cluster_name, ssh_cfg, kube_cfg, namespace, filter_labels):
# workaround: vsphere and azure use same provider as they work similarly (see issue #162)
if iaas == "vsphere" or iaas == "azure":
provider = capv_provider(
workload_cluster=workload_cluster_name,
namespace=namespace,
ssh_config=ssh_cfg,
mgmt_kube_config=kube_cfg,
labels=filter_labels
)
else:
provider = capa_provider(
workload_cluster=workload_cluster_name,
namespace=namespace,
ssh_config=ssh_cfg,
mgmt_kube_config=kube_cfg,
labels=filter_labels
)
return provider
# retrieves linux management provider for linux nodes
def fetch_mgmt_provider_linux(infra, ssh_cfg, kube_cfg, ns):
return fetch_provider(infra, '', ssh_cfg, kube_cfg, ns, ["kubernetes.io/os=linux"])
# retrieves windows mgmt provider for windows nodes
def fetch_mgmt_provider_windows(infra, ssh_cfg, kube_cfg, ns):
return fetch_provider(infra, '', ssh_cfg, kube_cfg, ns, ["kubernetes.io/os=windows"])
# retrieves linux workload provider for linux nodes
def fetch_workload_provider_linux(infra, wc_cluster, ssh_cfg, kube_cfg, ns):
return fetch_provider(infra, wc_cluster, ssh_cfg, kube_cfg, ns, ["kubernetes.io/os=linux"])
# retrieves windows workload provider for windodws nodes
def fetch_workload_provider_windows(infra, wc_cluster, ssh_cfg, kube_cfg, ns):
return fetch_provider(infra, wc_cluster, ssh_cfg, kube_cfg, ns, ["kubernetes.io/os=windows"])
def diagnose_mgmt_cluster(infra):
# validation
args.ssh_user
args.ssh_pk_file
args.mgmt_cluster_config
if len(infra) == 0:
print("Infra argument not provided")
return
wd = "{}/tkg-mgmt-cluster".format(args.workdir)
conf = crashd_config(workdir=wd)
ssh_conf = ssh_config(username=args.ssh_user, private_key_path=args.ssh_pk_file)
kube_conf = kube_config(path=args.mgmt_cluster_config)
# fetch linux mgmt node diagnostics
mgmt_provider_linux = fetch_mgmt_provider_linux(infra, ssh_conf, kube_conf, '')
lin_nodes = resources(provider=mgmt_provider_linux)
capture_node_diagnostics(lin_nodes)
# fetch win mgmt node diagnostics
mgmt_provider_win = fetch_mgmt_provider_windows(infra, ssh_conf, kube_conf, '')
win_nodes = resources(provider=mgmt_provider_win)
if len(win_nodes) > 0:
capture_windows_node_diagnostics(win_nodes)
#add code to collect pod info from cluster
set_defaults(kube_config(capi_provider = mgmt_provider_linux))
pods_ns=[
"capi-kubeadm-bootstrap-system",
"capi-kubeadm-control-plane-system",
"capi-system",
"capi-webhook-system",
"cert-manager",
"tkg-system",
"kube-system",
"tkr-system",
"capa-system",
"capv-system",
"capz-system",
]
if infra == "vsphere":
pods_ns.append("tkg-system-networking")
pods_ns.append("avi-system")
kube_capture(what="logs", namespaces=pods_ns)
kube_capture(what="objects", kinds=["pods", "services"], namespaces=pods_ns)
kube_capture(what="objects", kinds=["deployments", "replicasets"], groups=["apps"], namespaces=pods_ns)
kube_capture(what="objects", kinds=["apps"], groups=["kappctrl.k14s.io"], namespaces=["tkg-system"])
kube_capture(what="objects", kinds=["tanzukubernetesreleases"], groups=["run.tanzu.vmware.com"])
kube_capture(what="objects", kinds=["configmaps"], namespaces=["tkr-system"])
kube_capture(what="objects", categories=["cluster-api"])
kube_capture(what="objects", groups=["ipam.cluster.x-k8s.io"])
if infra == "vsphere":
kube_capture(what="objects", kinds=["akodeploymentconfigs"])
archive(output_file="tkg-mgmt.diagnostics.tar.gz", source_paths=[conf.workdir])
def diagnose_workload_cluster(infra, name):
# validation
args.infra
args.ssh_user
args.ssh_pk_file
args.mgmt_cluster_config
workload_ns = args.workload_cluster_ns
if len(infra) == 0:
print("Infra argument not provided")
return
wd = "{}/{}".format(args.workdir, name)
conf = crashd_config(workdir=wd)
ssh_conf = ssh_config(username=args.ssh_user, private_key_path=args.ssh_pk_file)
kube_conf = kube_config(path=args.mgmt_cluster_config)
# fetch linux workload node diagnostics
wc_provider_linux = fetch_workload_provider_linux(infra, name, ssh_conf, kube_conf, workload_ns)
lin_nodes = resources(provider=wc_provider_linux)
capture_node_diagnostics(lin_nodes)
# fetch win workload node diagnostics
wc_provider_win = fetch_workload_provider_windows(infra, name, ssh_conf, kube_conf, workload_ns)
win_nodes = resources(provider=wc_provider_win)
if len(win_nodes) > 0:
capture_windows_node_diagnostics(win_nodes)
#add code to collect pod info from cluster
set_defaults(kube_config(capi_provider = wc_provider_linux))
pods_ns=["default", "kube-system", "tkg-system"]
if infra == "vsphere":
pods_ns.append("tkg-system-networking")
pods_ns.append("avi-system")
kube_capture(what="logs", namespaces=pods_ns)
kube_capture(what="objects", kinds=["pods", "services"], namespaces=pods_ns)
kube_capture(what="objects", kinds=["deployments", "replicasets"], groups=["apps"], namespaces=pods_ns)
kube_capture(what="objects", kinds=["apps"], groups=["kappctrl.k14s.io"], namespaces=["tkg-system"])
if infra == "vsphere":
kube_capture(what="objects", kinds=["akodeploymentconfigs"])
archive(output_file="{}.diagnostics.tar.gz".format(name), source_paths=[conf.workdir])
# extract diagnostic info from local kind boostrap cluster
def diagnose_bootstrap_cluster():
p = prog_avail_local("kind")
if p == "":
print("Error: kind is not available")
return
clusters=get_tkg_bootstrap_clusters()
if len(clusters) == 0:
print("No tkg-kind bootstrap cluster found")
return
pod_ns=[
"caip-in-cluster-system",
"capi-kubeadm-bootstrap-system",
"capi-kubeadm-control-plane-system",
"capi-system",
"capi-webhook-system",
"capv-system",
"capa-system",
"capz-system",
"cert-manager",
"tkg-system",
"tkg-system-networking",
"avi-system",
]
# for each tkg-kind cluster:
# - capture kind logs, export kubecfg, and api objects
for kind_cluster in clusters:
wd = "{}/{}".format(args.workdir, kind_cluster)
run_local("kind export logs --name {} {}/kind-logs".format(kind_cluster, wd))
kind_cfg = capture_local(
cmd="kind get kubeconfig --name {0}".format(kind_cluster),
workdir="./",
file_name="{}.kubecfg".format(kind_cluster)
)
conf = crashd_config(workdir=wd)
set_defaults(kube_config(path=kind_cfg))
kube_capture(what="objects", kinds=["pods", "services"], namespaces=pod_ns)
kube_capture(what="objects", kinds=["deployments", "replicasets"], groups=["apps"], namespaces=pod_ns)
kube_capture(what="objects", categories=["cluster-api"])
kube_capture(what="objects", kinds=["akodeploymentconfigs"])
archive(output_file="bootstrap.{}.diagnostics.tar.gz".format(kind_cluster), source_paths=[conf.workdir])
# return tkg clusters in kind (tkg-kind-xxxx)
def get_tkg_bootstrap_clusters():
clusters = run_local("kind get clusters").split('\n')
result = []
for cluster in clusters:
if cluster.startswith("tkg-kind"):
result.append(cluster)
return result
def check_prereqs():
# validate args
args.workdir
p = prog_avail_local("ssh")
if p == "":
print("Error: ssh is not available")
return False
p = prog_avail_local("scp")
if p == "":
print("Error: scp is not available")
return False
p = prog_avail_local("kubectl")
if p == "":
print("Error: kubectl is not available")
return False
return True
def diagnose(target, infra):
# validation
if not check_prereqs():
print("Error: One or more prerequisites are missing")
return
# run diagnostics
if target == "bootstrap":
diagnose_bootstrap_cluster()
elif target == "mgmt":
diagnose_mgmt_cluster(infra)
elif target == "workload":
for name in args.workload_clusters.split(","):
diagnose_workload_cluster(infra, name)
else:
print("Error: unknown target {}".format(target))
diagnose(args.target, args.infra)