The Online Diagnostic System (ODS) feature automates debugging of NSX at runtime. ODS is implemented through runbooks that come in-built in NSX. Runbooks contain debugging procedures and have full observability of NSX components. A runbook generates a debugging report. It also generates runtime artifacts of an issue such as packet capture, live core dump of user process, and output of scripts and tools. These artifacts can be collected later and used for offline analysis and debugging. Note that you cannot modify a predefined runbook.
You can use APIs to perform following runbook operations:
- Invoke a runbook to initiate the runtime debugging
- Check the debugging report
- Download artifacts generated by the invoked runbook
- Manage the lifecycle of a runbook
Predefined runbooks
Starting with NSX 4.1.1, ODS is also supported for Unified Appliance (UA). The following table lists the runbooks and also the nodes on which you can run them.
Runbook | Description | Node | Version | |
---|---|---|---|---|
PnicPerf | This runbook can identify pNIC TX and RX performance issues. This runbook takes ID of the logical port to be diagnosed as an input argument. |
ESXi | 4.1 | |
OverlayTunnel | This runbook can identify overlay tunnel failures, such as gateway configuration error tunnel missing, tunnel down. This runbook takes IP address of source VTEP and IP address of destination VTEP as input arguments. |
ESXi | 4.1 | |
PortBlock | This runbook can identify causes for a port’s blockage, such as a DVPort could be blocked because of incorrect LSP/LS configuration. | ESXi | 4.1 | |
AdfCollect | This runbook gathers ESXi datapath performance data through ADF collector. This runbook takes the following input arguments:
|
ESXi | 4.1 | |
ControllerConn | This runbook can identify controller connectivity issues caused by controller failure, proxy failure, underlay network outage, or FQDN resolution failure. This runbook takes IP address of the ESXi host and IP address of controller as input arguments. |
ESXi | 4.1 | |
NxgiPlatform | This runbook diagnoses issues in NXGI platform datapath that powers various features, such as Distributed MPS, Endpoint Protection, IDFW, IDS, and Intelligence. This runbook takes functionality as an input argument. This is the specific NXGI dependent functionality (one from MPS, IDFW, IDS, EPP, and Intelligence with EPP being the default) that needs to be checked or debugged. |
ESXi | 4.1.1 | |
VifInfo | This runbook gets a detailed information about a virtual interface, which can be used as input to other runbooks. This runbook takes VIF as an input argument. |
ESXi | 4.1.1 | |
EdgeHealth |
This runbook performs following functions:
This runbook does not have any input argument. |
Edge | 4.1.1 | |
MacAddressInfo |
This runbook diagnoses issues related to a specific MAC address and checks and remediates performance issues on a host-switch. It performs the following functions:
This runbook takes the host-switch name and mac-address as input arguments. |
ESXi | 4.1.1 | |
EdgeDpBfd | This runbook triages NSX Edge BFD issues. This runbook takes source IP and destination IP of BFD session as input arguments. It also takes capture as an optional argument to ensure that only BFD packets specific to the session are filtered and captured. |
Edge | 4.1.1 | |
DistributedMps | This runbook verifies the health of the Malware Prevention Service (MPS) pipeline and diagnoses any issues encountered with MPS, such as protection not working on a particular VM or certain files not being scanned. Input argument for this runbook is the VM UUID of the VM to be diagnosed for protection. |
ESXi | 4.1.1 | |
DuplicateVtepDetectorProvider | This runbook detects any duplicate IP or label in VTEPs. This runbook does not have any input argument. |
UA | 4.1.1 | |
LspStaleInfo | This runbook fetches the stale logical ports. This runbook does not have any input argument. |
UA | ||
BgpNeighborState | This runbook diagnoses various flows that can cause the BGP neighbor to be down. The runbook also collects the following supporting artifacts for offline debugging .
This runbook takes the following input arguments:
|
Edge | 4.1.1 | |
PimMrouteState | This runbook triages multicast traffic loss which can happen due to various reasons. Given a source IP of the multicast traffic sender and group IP of multicast traffic, this runbook will help to identify the root cause of the problem and collect supporting artifacts for offline debugging. This runbook takes source IP, group IP, and traffic direction as input arguments. |
Edge | 4.1.1 | |
OspfNbrState | This runbook triages ospf neighbor state issues by diagnosing various logical flows. This runbook takes the neighbor IP address as an input argument. |
Edge | 4.1.1 | |
IdpsDpStatus | This runbook checks the status of both IDPS/IDS module (in context engine) and IDPS engine. It also compares loaded profiles, rules, signatures, and captured events in both IDPS module and IDPS engine module. This runbook does not have any input argument. |
ESXi | 4.1.1 | |
NxgiPlatformUA | This runbook diagnoses issues in management plane for NXGI platform, which powers various features like Distributed MPS, Endpoint Protection, IDFW, IDS, Intelligence. This runbook can be used in conjunction with the NxgiPlatform runbook which runs on transport nodes. This runbook takes transport node ID as an optional input argument. |
UA | 4.1.1 | |
CorfuServer | This runbook checks the Corfu server layout stability, infra ping and disk IO condition, and compactor health. To override the default look back days or hours of log events, this runbook takes lookback days and lookback hours as input arguments. |
UA | 4.1.1 | |
EdgeIDPS | This runbook checks and retrieves stats for IDS/IPS signature present in edge. This runbook takes signature ID as an input argument. |
Edge | 4.1.1 | |
EdgeRouting | This runbook checks multiple aspects of a given logical router's health and provides a list of edge processes that are not in good health. The runbook provides troubleshooting directions for the common north/south routing issues. The runbook gathers various pieces of information and reports the status of the routing protocols, tunnels, and ports. It will do a health check for the routing stack and various daemons used by routing on the Edge. For the provided destination IP it will check the RIB, FIB, and ARP tables. It will also run a ping and traceroute using the source and destination IP.
This runbook takes the following input arguments:
|
Edge | 4.1.1 | |
PimConfigCheck |
This runbook provides more details about what exactly failed when edge throws routing_config_error. It also helps in triaging the issue. This runbook does not have any input argument. |
Edge | 4.1.1 | |
NCPPendingPod | This runbook debugs pod stuck in a pending state. This runbook takes namespace and name of the pending pod as input arguments. |
ESXi | 4.1.2 |
Steps to debug at runtime
To debug at runtime, perform the following steps:
- Step 1: Fetch a list of predefined runbook.
-
Run the following API to fetch a list of predefined runbooks.
GET https://<nsx-mgr>/policy/api/v1/infra/sha/pre-defined-runbooks
This API returns a list of predefined runbooks along with the following information:
- Configuration details.
- Node type on which the runbook is supported.
- General details of the runbook such as id, name, and path.
- Parameter details if any required at the time of invoking the runbook.
Example Response:
{ "results": [ { "version": { "major": 1, "minor": 0 }, "default_config": { "enabled": true, "timeout": 300, "threshold_number": 5, "throttle_cycle": 10 }, "supported_node_types": [ "nsx-esx" ], "parameters": [ { "name": "advanced", "optional": true, "parameter_type": "BOOLEAN", "default_value": "False" }, { "name": "cycle", "optional": false, "parameter_type": "INTEGER", "max": "20", "min": "1" }, { "name": "interval", "optional": true, "parameter_type": "INTEGER", "max": "300", "min": "1" } ], "resource_type": "OdsPredefinedRunbook", "id": "00000000-0000-4164-6643-6f6c6c656374", "display_name": "AdfCollect", "path": "/infra/sha/pre-defined-runbooks/00000000-0000-4164-6643-6f6c6c656374", "relative_path": "00000000-0000-4164-6643-6f6c6c656374", "parent_path": "/infra", "remote_path": "", "unique_id": "069a4c7b-532a-4926-a402-6a7986a306b2", "realization_id": "069a4c7b-532a-4926-a402-6a7986a306b2", "owner_id": "f2a2a9e1-2578-435d-877d-47e01eb04954", "origin_site_id": "f2a2a9e1-2578-435d-877d-47e01eb04954", "marked_for_delete": false, "overridden": false, "_create_time": 1669600880454, "_create_user": "system", "_last_modified_time": 1669600880454, "_last_modified_user": "system", "_system_owned": false, "_revision": 0 }, { "version": { "major": 1, "minor": 0 }, "default_config": { "enabled": true, "timeout": 60, "threshold_number": 5, "throttle_cycle": 3 }, "supported_node_types": [ "nsx-esx" ], "resource_type": "OdsPredefinedRunbook", "id": "0000436f-6e74-726f-6c6c-6572436f6e6e", "display_name": "ControllerConn", "path": "/infra/sha/pre-defined-runbooks/0000436f-6e74-726f-6c6c-6572436f6e6e", "relative_path": "0000436f-6e74-726f-6c6c-6572436f6e6e", "parent_path": "/infra", "remote_path": "", "unique_id": "3914cbe4-41b4-45f1-9dad-2bd96a2de0d8", "realization_id": "3914cbe4-41b4-45f1-9dad-2bd96a2de0d8", "owner_id": "f2a2a9e1-2578-435d-877d-47e01eb04954", "origin_site_id": "f2a2a9e1-2578-435d-877d-47e01eb04954", "marked_for_delete": false, "overridden": false, "_create_time": 1669600880493, "_create_user": "system", "_last_modified_time": 1669600880493, "_last_modified_user": "system", "_system_owned": false, "_revision": 0 }, { "version": { "major": 1, "minor": 0 }, "default_config": { "enabled": true, "timeout": 120, "threshold_number": 5, "throttle_cycle": 5 }, "supported_node_types": [ "nsx-esx" ], "parameters": [ { "name": "src", "optional": false, "parameter_type": "COMPOUND" }, { "name": "dst", "optional": false, "parameter_type": "COMPOUND" } ], "resource_type": "OdsPredefinedRunbook", "id": "0000004f-7665-726c-6179-54756e6e656c", "display_name": "OverlayTunnel", "path": "/infra/sha/pre-defined-runbooks/0000004f-7665-726c-6179-54756e6e656c", "relative_path": "0000004f-7665-726c-6179-54756e6e656c", "parent_path": "/infra", "remote_path": "", "unique_id": "3597af28-e670-456e-8347-4d1a53a5cb90", "realization_id": "3597af28-e670-456e-8347-4d1a53a5cb90", "owner_id": "f2a2a9e1-2578-435d-877d-47e01eb04954", "origin_site_id": "f2a2a9e1-2578-435d-877d-47e01eb04954", "marked_for_delete": false, "overridden": false, "_create_time": 1669600880518, "_create_user": "system", "_last_modified_time": 1669600880518, "_last_modified_user": "system", "_system_owned": false, "_revision": 0 }, { "version": { "major": 1, "minor": 0 }, "default_config": { "enabled": true, "timeout": 120, "threshold_number": 5, "throttle_cycle": 5 }, "supported_node_types": [ "nsx-esx" ], "parameters": [ { "name": "lsp", "optional": false, "parameter_type": "STRING" } ], "resource_type": "OdsPredefinedRunbook", "id": "00000000-0000-0000-506e-696350657266", "display_name": "PnicPerf", "path": "/infra/sha/pre-defined-runbooks/00000000-0000-0000-506e-696350657266", "relative_path": "00000000-0000-0000-506e-696350657266", "parent_path": "/infra", "remote_path": "", "unique_id": "53f29b77-dcf5-4561-85ec-8f35280e3f3a", "realization_id": "53f29b77-dcf5-4561-85ec-8f35280e3f3a", "owner_id": "f2a2a9e1-2578-435d-877d-47e01eb04954", "origin_site_id": "f2a2a9e1-2578-435d-877d-47e01eb04954", "marked_for_delete": false, "overridden": false, "_create_time": 1669600880553, "_create_user": "system", "_last_modified_time": 1669600880553, "_last_modified_user": "system", "_system_owned": false, "_revision": 0 }, { "version": { "major": 1, "minor": 0 }, "default_config": { "enabled": true, "timeout": 60, "threshold_number": 5, "throttle_cycle": 3 }, "supported_node_types": [ "nsx-esx" ], "parameters": [ { "name": "vif", "optional": false, "parameter_type": "STRING" } ], "resource_type": "OdsPredefinedRunbook", "id": "00000000-0000-0050-6f72-74426c6f636b", "display_name": "PortBlock", "path": "/infra/sha/pre-defined-runbooks/00000000-0000-0050-6f72-74426c6f636b", "relative_path": "00000000-0000-0050-6f72-74426c6f636b", "parent_path": "/infra", "remote_path": "", "unique_id": "6f411a92-30c0-4838-9758-c00220cb5fab", "realization_id": "6f411a92-30c0-4838-9758-c00220cb5fab", "owner_id": "f2a2a9e1-2578-435d-877d-47e01eb04954", "origin_site_id": "f2a2a9e1-2578-435d-877d-47e01eb04954", "marked_for_delete": false, "overridden": false, "_create_time": 1669600880572, "_create_user": "system", "_last_modified_time": 1669600880572, "_last_modified_user": "system", "_system_owned": false, "_revision": 0 } ], "result_count": 5, "sort_by": "display_name", "sort_ascending": true }
In the above response, the following predefined runbooks are returned.- AdfCollect
- ControllerConn
- OverlayTunnel
- PnicPerf
- PortBlock
If a runbook requires a parameter at the time of invocation, the parameter details are specified in the parameters key. For example, the Overlay Tunnel runbook requires two parameters, source and destination that are local and remote VTEP IPs of the tunnel to be diagnosed.
- Step 2: Invoke the runbook.
-
Run the following API to invoke a runbook.
POST https://<nsx-mgr>/policy/api/v1/infra/sha/runbook-invocations/<invoke-name>
Example Request: Invoking the Overlay Tunnel runbook.
POST https://<nsx-mgr>/policy/api/v1/infra/sha/runbook-invocations/OverlayTunnel { "runbook_path": "/infra/sha/pre-defined-runbooks/0000004f-7665-726c-6179-54756e6e656c", "target_node": "6c7a9374-459d-46b2-9ea6-c63b37c7cc38", "arguments": [ { "key": "src", "value": "192.168.0.11" }, { "key": "dst", "value": "192.168.0.10" } ] }
- Step 3: Check the report of the invoked runbook
-
Run the following API to get the report of the invoked runbook.
GET https://<nsx-mgr>/policy/api/v1/infra/sha/runbook-invocations/<invoke-name>/report
Example Request:
GET https://<nsx-mgr>/policy/api/v1/infra/sha/runbook-invocations/OverlayTunnel/report
Example Response:
{ "invocation_id": "70527fed-1e5e-4fed-a880-28cde04a66b1", "target_node": "6c7a9374-459d-46b2-9ea6-c63b37c7cc38", "timestamp": 1662469099, "sys_info": { "host_name": "sc2-10-185-4-158.eng.vmware.com", "os_name": "VMkernel", "os_version": "7.0.3" }, "result_message": "Tunnel 192.168.0.11 -> 192.168.0.10 is in up state", "recommendation_message": "No changes required as the tunnel is in up state.", "step_details": [ { "step_id": 1, "action_summary": "Check the status of tunnel for the given source/destination VTEPs.", "action_result": "Tunnel 192.168.0.11 -> 192.168.0.10 is in state up" } ], "status": { "request_status": "SUCCESS", "operation_state": "FINISHED" } }
The response returns some metadata information, such as timestamp and system details like host name and operating system. The report also returns the result of debugging, remediation suggestion if any along with steps executed for debugging with action summary and action result of each step. If debugging is interrupted for any reason, the operation_state field will hold a value to define the reason for the interruption. In case the runbook invocation does not succeed, the report provides error details and does not show debugging related fields.
- Step 4: Download artifacts.
-
- Run the following API to gather artifacts.
POST https://{{mgr-ip}}/policy/api/v1/infra/sha/runbook-invocations/{{invocation-name}}
Example Request:POST https://{{mgr-ip}}/policy/api/v1/infra/sha/runbook-invocations/{{invocation-name}} { "runbook_path": "/infra/sha/pre-defined-runbooks/00000000-0000-4164-6643-6f6c6c656374", "target_node": "{{target-node}}", "arguments": [ { "key": "advanced", "value": "{{advanced-mode}}" }, { "key": "cycle", "value": "{{cycle-count}}" }, { "key": "interval", "value": "{{interval-in-sec}}" } ] }
If the advanced parameter is set to false at the time of invocation, the runbook will collect topology info, net-stats, NSX DP stats and uplink info. If the advanced parameter is set to true, the runbook will additionally run advanced performance tools such as vmkstats (available only on the physical machine).
The cycle parameter defines the number of times the ADF collector is executed in an invocation.
The interval parameter defines the waiting interval between consecutive ADF collector executions. It takes effect only when the cycle parameter is set to greater than 1.
Example Response :
{ "invocation_id": "80a0037a-52e1-48d7-b28e-c3bfb8475e8c", "target_node": "b794f78f-7eb0-433f-8f11-63e6b3121c28", "timestamp": 1668674073, "sys_info": { "host_name": "sc2-rdops-vm06-dhcp-204-101.eng.vmware.com", "os_name": "VMkernel", "os_version": "7.0.3" }, "result_message": "ADF data collection runbook completes.", "recommendation_message": "No action needs to be taken.", "step_details": [ { "step_id": 1, "action_summary": "Run ADF collector.", "action_result": "ADF data collection is successfully performed in the following time point(s) along with the following artifact(s): [(2022-11-17 08:35:30, a44f0446-0ac3-4e7f-8513-fb1248985d9e.tar)]", "artifacts": [ "a44f0446-0ac3-4e7f-8513-fb1248985d9e" ] } ], "status": { "request_status": "SUCCESS", "operation_state": "FINISHED" } }
- Run the following API to download the artifacts.
GET https://<nsx-mgr>/policy/api/v1/infra/sha/runbook-invocations/<invoke-name>/artifact
If the runbook generated the artifacts, the API returns a bundled file, else it returns an error message. Save the binary response to a tar.gz file. This file contains a runbook invocation report (in JSON) as well as a tar file for the performed ADF collection.
- Run the following API to gather artifacts.
Changing a runbook configuration
A transport node (TN) group is first created and then a runbook profile is bound to it. If you change the configuration of a runbook, it is changed at all nodes to which the runbook profile is bound. You can configure whether a runbook is enabled or not, the debugging timeout, and frequency at which the runbook can be invoked using the throttle cycle mechanism. The throttle cycle mechanism specifies number of times a runbook can be executed in a specific time.
Note that a runbook can have only one profile, but a node might have multiple runbook profiles based on the TN groups to which it belongs. In this case, the profile with the highest priority is applied on the node.
To change the a runbook configuration, perform the following steps:
- Step 1: Create a TN group of ESXi.
-
Run the following API to create a TN group.
PATCH https://<nsx-mgr>/policy/api/v1/infra/domains/default/groups/<group-name>
Example Request:
PATCH https://<nsx-mgr>/policy/api/v1/infra/domains/default/groups/<group-name> { "expression": [ { "paths": [ "/infra/sites/default/enforcement-points/default/host-transport-nodes/TN1" ], "resource_type": "PathExpression" } ], "extended_expression": [], "reference": false, "group_type": [], "resource_type": "Group" }
- Step 2: Change the configuration of the runbook profile and bind the TN group with the profile.
-
Run the following API to change the configuration,
PATCH https://<nsx-mgr>/policy/api/v1/infra/sha/runbook-profiles/<profile-name>
Example Request:
PATCH https://<nsx-mgr>/policy/api/v1/infra/sha/runbook-profiles/<profile-name> { "runbook_path": "/infra/sha/pre-defined-runbooks/0000004f-7665-726c-6179-54756e6e656c", "applied_to_group_path": "/infra/domains/default/groups/tngroup2", "config": { "enabled": true, "timeout": 120, "threshold_number": 2, "throttle_cycle": 6 } }
In this example, the throttle_cycle is 6 minutes and the threshold_number is 2, which means that within 6 minutes, the runbook can be invoked no more than two times.
For complete information about ODS APIs, see NSX Intelligence & NSX Application Platform API Guide.