The Online Diagnostic System (ODS) feature automates debugging of NSX at runtime. ODS is implemented through runbooks that come in-built in NSX. Runbooks contain debugging procedures and have full observability of NSX components. A runbook generates a debugging report. It also generates runtime artifacts of an issue such as packet capture, live core dump of user process, and output of scripts and tools. These artifacts can be collected later and used for offline analysis and debugging. Note that you cannot modify a predefined runbook.

You can use APIs to perform following runbook operations:

  • Invoke a runbook to initiate the runtime debugging
  • Check the debugging report
  • Download artifacts generated by the invoked runbook
  • Manage the lifecycle of a runbook

Predefined runbooks

Starting with NSX 4.1.1, ODS is also supported for Unified Appliance (UA). The following table lists the runbooks and also the nodes on which you can run them.

Table 1.
Runbook Description Node Version
PnicPerf This runbook can identify pNIC TX and RX performance issues.

This runbook takes ID of the logical port to be diagnosed as an input argument.

ESXi 4.1
OverlayTunnel This runbook can identify overlay tunnel failures, such as gateway configuration error tunnel missing, tunnel down.

This runbook takes IP address of source VTEP and IP address of destination VTEP as input arguments.

ESXi 4.1
PortBlock This runbook can identify causes for a port’s blockage, such as a DVPort could be blocked because of incorrect LSP/LS configuration. ESXi 4.1
AdfCollect This runbook gathers ESXi datapath performance data through ADF collector.

This runbook takes the following input arguments:

  • advanced: Whether advanced performance tools are enabled.
  • cycle: Number of ADF collection cycles.
  • interval: Waiting interval between consecutive ADF collection cycles in seconds.
ESXi 4.1
ControllerConn This runbook can identify controller connectivity issues caused by controller failure, proxy failure, underlay network outage, or FQDN resolution failure.

This runbook takes IP address of the ESXi host and IP address of controller as input arguments.

ESXi 4.1
NxgiPlatform

This runbook diagnoses issues in NXGI platform datapath that powers various features, such as Distributed MPS, Endpoint Protection, IDFW, IDS, and Intelligence.

This runbook takes functionality as an input argument. This is the specific NXGI dependent functionality (one from MPS, IDFW, IDS, EPP, and Intelligence with EPP being the default) that needs to be checked or debugged.

ESXi 4.1.1
VifInfo

This runbook gets a detailed information about a virtual interface, which can be used as input to other runbooks.

This runbook takes VIF as an input argument.

ESXi 4.1.1
EdgeHealth
This runbook performs following functions:
  • Triages common edge down issues occurring during deployments and upgrades.
  • Suggests workarounds to get edge node back into healthy state and resume normal functioning.

This runbook does not have any input argument.

Edge 4.1.1
MacAddressInfo
This runbook diagnoses issues related to a specific MAC address and checks and remediates performance issues on a host-switch. It performs the following functions:
  • For a given MAC address and host-switch, this runbook obtains MAC address details (such as MAC, VLAN, VNI, and portID) from the host-switch MAC address table and provides a diagnosis in case a MAC is not present.
  • For a given static MAC address used by a VNIC, it also retrieves the VLAN and VNI information of the VNIC port.

This runbook takes the host-switch name and mac-address as input arguments.

ESXi 4.1.1
EdgeDpBfd

This runbook triages NSX Edge BFD issues.

This runbook takes source IP and destination IP of BFD session as input arguments. It also takes capture as an optional argument to ensure that only BFD packets specific to the session are filtered and captured.

Edge 4.1.1
DistributedMps

This runbook verifies the health of the Malware Prevention Service (MPS) pipeline and diagnoses any issues encountered with MPS, such as protection not working on a particular VM or certain files not being scanned.

Input argument for this runbook is the VM UUID of the VM to be diagnosed for protection.

ESXi 4.1.1
DuplicateVtepDetectorProvider

This runbook detects any duplicate IP or label in VTEPs.

This runbook does not have any input argument.

UA 4.1.1
LspStaleInfo

This runbook fetches the stale logical ports.

This runbook does not have any input argument.

UA
BgpNeighborState

This runbook diagnoses various flows that can cause the BGP neighbor to be down. The runbook also collects the following supporting artifacts for offline debugging .

  • BGP and BFD packet capture.
  • Ping and traceroute results.

This runbook takes the following input arguments:

  • Peer IP: This is mandatory argument and takes BGP neighbor IP address.
  • Logical Router name: This is mandatory argument and it takes the logical router name configured by user.
  • Packet capture: This is an optional argument. If this is set to True, the runbook will capture BGP/BFD packets. By default this is set to False.
Edge 4.1.1
PimMrouteState

This runbook triages multicast traffic loss which can happen due to various reasons.

Given a source IP of the multicast traffic sender and group IP of multicast traffic, this runbook will help to identify the root cause of the problem and collect supporting artifacts for offline debugging.

This runbook takes source IP, group IP, and traffic direction as input arguments.

Edge 4.1.1
OspfNbrState

This runbook triages ospf neighbor state issues by diagnosing various logical flows.

This runbook takes the neighbor IP address as an input argument.

Edge 4.1.1
IdpsDpStatus

This runbook checks the status of both IDPS/IDS module (in context engine) and IDPS engine. It also compares loaded profiles, rules, signatures, and captured events in both IDPS module and IDPS engine module.

This runbook does not have any input argument.

ESXi 4.1.1
NxgiPlatformUA

This runbook diagnoses issues in management plane for NXGI platform, which powers various features like Distributed MPS, Endpoint Protection, IDFW, IDS, Intelligence. This runbook can be used in conjunction with the NxgiPlatform runbook which runs on transport nodes.

This runbook takes transport node ID as an optional input argument.

UA 4.1.1
CorfuServer

This runbook checks the Corfu server layout stability, infra ping and disk IO condition, and compactor health.

To override the default look back days or hours of log events, this runbook takes lookback days and lookback hours as input arguments.

UA 4.1.1
EdgeIDPS

This runbook checks and retrieves stats for IDS/IPS signature present in edge.

This runbook takes signature ID as an input argument.

Edge 4.1.1
EdgeRouting

This runbook checks multiple aspects of a given logical router's health and provides a list of edge processes that are not in good health. The runbook provides troubleshooting directions for the common north/south routing issues.

The runbook gathers various pieces of information and reports the status of the routing protocols, tunnels, and ports. It will do a health check for the routing stack and various daemons used by routing on the Edge. For the provided destination IP it will check the RIB, FIB, and ARP tables. It will also run a ping and traceroute using the source and destination IP.

This runbook takes the following input arguments:
  • Destination/peer IP: This is a mandatory argument.
  • Source IP: This is a mandatory argument. The source IP is used for the ping and traceroute and must be an IP from the interface the destination IP is expected to be directed to.
  • Logical Router (LR) name: This is a mandatory argument. It is the logical router name as configured in the system.
Edge 4.1.1

PimConfigCheck

This runbook provides more details about what exactly failed when edge throws routing_config_error. It also helps in triaging the issue.

This runbook does not have any input argument.

Edge 4.1.1
NCPPendingPod This runbook debugs pod stuck in a pending state.

This runbook takes namespace and name of the pending pod as input arguments.

ESXi 4.1.2
PortStatusInfo This runbook identifies issues on a port that is connected to vSwitch.

This runbook takes the VIF ID as an input argument to retrieve port details.

ESXi 4.2.1

Steps to debug at runtime

To debug at runtime, perform the following steps:

Step 1: Fetch a list of predefined runbook.

Run the following API to fetch a list of predefined runbooks.

GET https://<nsx-mgr>/policy/api/v1/infra/sha/pre-defined-runbooks

This API returns a list of predefined runbooks along with the following information:

  • Configuration details.
  • Node type on which the runbook is supported.
  • General details of the runbook such as id, name, and path.
  • Parameter details if any required at the time of invoking the runbook.

Example Response:

{
    "results": [
        {
            "version": {
                "major": 1,
                "minor": 0
            },
            "default_config": {
                "enabled": true,
                "timeout": 300,
                "threshold_number": 5,
                "throttle_cycle": 10
            },
            "supported_node_types": [
                "nsx-esx"
            ],
            "parameters": [
                {
                    "name": "advanced",
                    "optional": true,
                    "parameter_type": "BOOLEAN",
                    "default_value": "False"
                },
                {
                    "name": "cycle",
                    "optional": false,
                    "parameter_type": "INTEGER",
                    "max": "20",
                    "min": "1"
                },
                {
                    "name": "interval",
                    "optional": true,
                    "parameter_type": "INTEGER",
                    "max": "300",
                    "min": "1"
                }
            ],
            "resource_type": "OdsPredefinedRunbook",
            "id": "00000000-0000-4164-6643-6f6c6c656374",
            "display_name": "AdfCollect",
            "path": "/infra/sha/pre-defined-runbooks/00000000-0000-4164-6643-6f6c6c656374",
            "relative_path": "00000000-0000-4164-6643-6f6c6c656374",
            "parent_path": "/infra",
            "remote_path": "",
            "unique_id": "069a4c7b-532a-4926-a402-6a7986a306b2",
            "realization_id": "069a4c7b-532a-4926-a402-6a7986a306b2",
            "owner_id": "f2a2a9e1-2578-435d-877d-47e01eb04954",
            "origin_site_id": "f2a2a9e1-2578-435d-877d-47e01eb04954",
            "marked_for_delete": false,
            "overridden": false,
            "_create_time": 1669600880454,
            "_create_user": "system",
            "_last_modified_time": 1669600880454,
            "_last_modified_user": "system",
            "_system_owned": false,
            "_revision": 0
        },
        {
            "version": {
                "major": 1,
                "minor": 0
            },
            "default_config": {
                "enabled": true,
                "timeout": 60,
                "threshold_number": 5,
                "throttle_cycle": 3
            },
            "supported_node_types": [
                "nsx-esx"
            ],
            "resource_type": "OdsPredefinedRunbook",
            "id": "0000436f-6e74-726f-6c6c-6572436f6e6e",
            "display_name": "ControllerConn",
            "path": "/infra/sha/pre-defined-runbooks/0000436f-6e74-726f-6c6c-6572436f6e6e",
            "relative_path": "0000436f-6e74-726f-6c6c-6572436f6e6e",
            "parent_path": "/infra",
            "remote_path": "",
            "unique_id": "3914cbe4-41b4-45f1-9dad-2bd96a2de0d8",
            "realization_id": "3914cbe4-41b4-45f1-9dad-2bd96a2de0d8",
            "owner_id": "f2a2a9e1-2578-435d-877d-47e01eb04954",
            "origin_site_id": "f2a2a9e1-2578-435d-877d-47e01eb04954",
            "marked_for_delete": false,
            "overridden": false,
            "_create_time": 1669600880493,
            "_create_user": "system",
            "_last_modified_time": 1669600880493,
            "_last_modified_user": "system",
            "_system_owned": false,
            "_revision": 0
        },
        {
            "version": {
                "major": 1,
                "minor": 0
            },
            "default_config": {
                "enabled": true,
                "timeout": 120,
                "threshold_number": 5,
                "throttle_cycle": 5
            },
            "supported_node_types": [
                "nsx-esx"
            ],
            "parameters": [
                {
                    "name": "src",
                    "optional": false,
                    "parameter_type": "COMPOUND"
                },
                {
                    "name": "dst",
                    "optional": false,
                    "parameter_type": "COMPOUND"
                }
            ],
            "resource_type": "OdsPredefinedRunbook",
            "id": "0000004f-7665-726c-6179-54756e6e656c",
            "display_name": "OverlayTunnel",
            "path": "/infra/sha/pre-defined-runbooks/0000004f-7665-726c-6179-54756e6e656c",
            "relative_path": "0000004f-7665-726c-6179-54756e6e656c",
            "parent_path": "/infra",
            "remote_path": "",
            "unique_id": "3597af28-e670-456e-8347-4d1a53a5cb90",
            "realization_id": "3597af28-e670-456e-8347-4d1a53a5cb90",
            "owner_id": "f2a2a9e1-2578-435d-877d-47e01eb04954",
            "origin_site_id": "f2a2a9e1-2578-435d-877d-47e01eb04954",
            "marked_for_delete": false,
            "overridden": false,
            "_create_time": 1669600880518,
            "_create_user": "system",
            "_last_modified_time": 1669600880518,
            "_last_modified_user": "system",
            "_system_owned": false,
            "_revision": 0
        },
        {
            "version": {
                "major": 1,
                "minor": 0
            },
            "default_config": {
                "enabled": true,
                "timeout": 120,
                "threshold_number": 5,
                "throttle_cycle": 5
            },
            "supported_node_types": [
                "nsx-esx"
            ],
            "parameters": [
                {
                    "name": "lsp",
                    "optional": false,
                    "parameter_type": "STRING"
                }
            ],
            "resource_type": "OdsPredefinedRunbook",
            "id": "00000000-0000-0000-506e-696350657266",
            "display_name": "PnicPerf",
            "path": "/infra/sha/pre-defined-runbooks/00000000-0000-0000-506e-696350657266",
            "relative_path": "00000000-0000-0000-506e-696350657266",
            "parent_path": "/infra",
            "remote_path": "",
            "unique_id": "53f29b77-dcf5-4561-85ec-8f35280e3f3a",
            "realization_id": "53f29b77-dcf5-4561-85ec-8f35280e3f3a",
            "owner_id": "f2a2a9e1-2578-435d-877d-47e01eb04954",
            "origin_site_id": "f2a2a9e1-2578-435d-877d-47e01eb04954",
            "marked_for_delete": false,
            "overridden": false,
            "_create_time": 1669600880553,
            "_create_user": "system",
            "_last_modified_time": 1669600880553,
            "_last_modified_user": "system",
            "_system_owned": false,
            "_revision": 0
        },
        {
            "version": {
                "major": 1,
                "minor": 0
            },
            "default_config": {
                "enabled": true,
                "timeout": 60,
                "threshold_number": 5,
                "throttle_cycle": 3
            },
            "supported_node_types": [
                "nsx-esx"
            ],
            "parameters": [
                {
                    "name": "vif",
                    "optional": false,
                    "parameter_type": "STRING"
                }
            ],
            "resource_type": "OdsPredefinedRunbook",
            "id": "00000000-0000-0050-6f72-74426c6f636b",
            "display_name": "PortBlock",
            "path": "/infra/sha/pre-defined-runbooks/00000000-0000-0050-6f72-74426c6f636b",
            "relative_path": "00000000-0000-0050-6f72-74426c6f636b",
            "parent_path": "/infra",
            "remote_path": "",
            "unique_id": "6f411a92-30c0-4838-9758-c00220cb5fab",
            "realization_id": "6f411a92-30c0-4838-9758-c00220cb5fab",
            "owner_id": "f2a2a9e1-2578-435d-877d-47e01eb04954",
            "origin_site_id": "f2a2a9e1-2578-435d-877d-47e01eb04954",
            "marked_for_delete": false,
            "overridden": false,
            "_create_time": 1669600880572,
            "_create_user": "system",
            "_last_modified_time": 1669600880572,
            "_last_modified_user": "system",
            "_system_owned": false,
            "_revision": 0
        }
    ],
    "result_count": 5,
    "sort_by": "display_name",
    "sort_ascending": true
}
In the above response, the following predefined runbooks are returned.
  • AdfCollect
  • ControllerConn
  • OverlayTunnel
  • PnicPerf
  • PortBlock

    If a runbook requires a parameter at the time of invocation, the parameter details are specified in the parameters key. For example, the Overlay Tunnel runbook requires two parameters, source and destination that are local and remote VTEP IPs of the tunnel to be diagnosed.

Step 2: Get parameter details of the runbook.

Run the following API to fetch the parameter details of a runbook that you want to invoke.

https://{{MANAGER_IP}}/policy/api/v1/infra/sha/pre-defined-runbooks/0000004f-7665-726c-6179-54756e6e656c/help
{
    "summary": "Runbook to diagnose overlay tunnel issues.",
    "parameter_info": [
        {
            "summary": "IP address of source VTEP",
            "parameter": {
                "name": "src",
                "optional": false,
                "parameter_type": "COMPOUND"
            }
        },
        {
            "summary": "IP address of destination VTEP",
            "parameter": {
                "name": "dst",
                "optional": false,
                "parameter_type": "COMPOUND"
            }
        }
    ]
}
Note that the parameter name is mapped to key and parameter type is mapped to value in a runbook's invocation API.
Step 3: Invoke the runbook.

Run the following API to invoke a runbook.

POST https://<nsx-mgr>/policy/api/v1/infra/sha/runbook-invocations/<invoke-name>

Example Request: Invoking the Overlay Tunnel runbook.

POST https://<nsx-mgr>/policy/api/v1/infra/sha/runbook-invocations/OverlayTunnel
{
    "runbook_path": "/infra/sha/pre-defined-runbooks/0000004f-7665-726c-6179-54756e6e656c",
    "target_node": "6c7a9374-459d-46b2-9ea6-c63b37c7cc38",
    "arguments": [
        {
            "key": "src",
            "value": "192.168.0.11"
        },
        {
            "key": "dst",
            "value": "192.168.0.10"
        }
    ]
}
The target_node can be a host node, an edge node, or a Unified Appliance node.

The ID of a host or an edge can be obtained through the following API.

https://{{MANAGER_IP}}/api/v1/transport-nodes

The appliance ID list can be obtained from cluster API. The valid ID is the external ID of appliance with manager role.

https://{{MANAGER_IP}}/api/v1/cluster/nodes
Step 4: Check the report of the invoked runbook.

Run the following API to get the report of the invoked runbook.

GET https://<nsx-mgr>/policy/api/v1/infra/sha/runbook-invocations/<invoke-name>/report

Example Request:

GET https://<nsx-mgr>/policy/api/v1/infra/sha/runbook-invocations/OverlayTunnel/report

Example Response:

{
    "invocation_id": "70527fed-1e5e-4fed-a880-28cde04a66b1",
    "target_node": "6c7a9374-459d-46b2-9ea6-c63b37c7cc38",
    "timestamp": 1662469099,
    "sys_info": {
        "host_name": "sc2-10-185-4-158.eng.vmware.com",
        "os_name": "VMkernel",
        "os_version": "7.0.3"
    },
    "result_message": "Tunnel 192.168.0.11 -> 192.168.0.10 is in up state",
    "recommendation_message": "No changes required as the tunnel is in up state.",
    "step_details": [
        {
            "step_id": 1,
            "action_summary": "Check the status of tunnel for the given source/destination VTEPs.",
            "action_result": "Tunnel 192.168.0.11 -> 192.168.0.10 is in state up"
        }
    ],
    "status": {
        "request_status": "SUCCESS",
        "operation_state": "FINISHED"
    }
}

The response returns some metadata information, such as timestamp and system details like host name and operating system. The report also returns the result of debugging, remediation suggestion if any along with steps executed for debugging with action summary and action result of each step. If debugging is interrupted for any reason, the operation_state field will hold a value to define the reason for the interruption. In case the runbook invocation does not succeed, the report provides error details and does not show debugging related fields.

Step 5: Download artifacts.
  1. Run the following API to gather artifacts.

    POST https://{{mgr-ip}}/policy/api/v1/infra/sha/runbook-invocations/{{invocation-name}}

    Example Request:
    POST https://{{mgr-ip}}/policy/api/v1/infra/sha/runbook-invocations/{{invocation-name}}
    {
        "runbook_path": "/infra/sha/pre-defined-runbooks/00000000-0000-4164-6643-6f6c6c656374",
        "target_node": "{{target-node}}",
        "arguments": [
            {
                "key": "advanced",
                "value": "{{advanced-mode}}"
            },
            {
                "key": "cycle",
                "value": "{{cycle-count}}"
            },
            {
                "key": "interval",
                "value": "{{interval-in-sec}}"
            }
        ]
    }

    If the advanced parameter is set to false at the time of invocation, the runbook will collect topology info, net-stats, NSX DP stats and uplink info. If the advanced parameter is set to true, the runbook will additionally run advanced performance tools such as vmkstats (available only on the physical machine).

    The cycle parameter defines the number of times the ADF collector is executed in an invocation.

    The interval parameter defines the waiting interval between consecutive ADF collector executions. It takes effect only when the cycle parameter is set to greater than 1.

    Example Response :

    {
        "invocation_id": "80a0037a-52e1-48d7-b28e-c3bfb8475e8c",
        "target_node": "b794f78f-7eb0-433f-8f11-63e6b3121c28",
        "timestamp": 1668674073,
        "sys_info": {
            "host_name": "sc2-rdops-vm06-dhcp-204-101.eng.vmware.com",
            "os_name": "VMkernel",
            "os_version": "7.0.3"
        },
        "result_message": "ADF data collection runbook completes.",
        "recommendation_message": "No action needs to be taken.",
        "step_details": [
            {
                "step_id": 1,
                "action_summary": "Run ADF collector.",
                "action_result": "ADF data collection is successfully performed in the following time point(s) along with the following artifact(s): [(2022-11-17 08:35:30, a44f0446-0ac3-4e7f-8513-fb1248985d9e.tar)]",
                "artifacts": [
                    "a44f0446-0ac3-4e7f-8513-fb1248985d9e"
                ]
            }
        ],
        "status": {
            "request_status": "SUCCESS",
            "operation_state": "FINISHED"
        }
    }
  2. Run the following API to download the artifacts.

    GET https://<nsx-mgr>/policy/api/v1/infra/sha/runbook-invocations/<invoke-name>/artifact

If the runbook generated the artifacts, the API returns a bundled file, else it returns an error message. Save the binary response to a tar.gz file. This file contains a runbook invocation report (in JSON) as well as a tar file for the performed ADF collection.

Changing a runbook configuration

A transport node (TN) group is first created and then a runbook profile is bound to it. If you change the configuration of a runbook, it is changed at all nodes to which the runbook profile is bound. You can configure whether a runbook is enabled or not, the debugging timeout, and frequency at which the runbook can be invoked using the throttle cycle mechanism. The throttle cycle mechanism specifies number of times a runbook can be executed in a specific time.

Note that a runbook can have only one profile, but a node might have multiple runbook profiles based on the TN groups to which it belongs. In this case, the profile with the highest priority is applied on the node.

To change the a runbook configuration, perform the following steps:

Step 1: Create a TN group of ESXi.

Run the following API to create a TN group.

PATCH https://<nsx-mgr>/policy/api/v1/infra/domains/default/groups/<group-name>

Example Request:

PATCH https://<nsx-mgr>/policy/api/v1/infra/domains/default/groups/<group-name>
{
    "expression": [
        {
            "paths": [
                "/infra/sites/default/enforcement-points/default/host-transport-nodes/TN1"
            ],
            "resource_type": "PathExpression"
        }
    ],
    "extended_expression": [],
    "reference": false,
    "group_type": [],
    "resource_type": "Group"
}
Step 2: Change the configuration of the runbook profile and bind the TN group with the profile.

Run the following API to change the configuration,

PATCH https://<nsx-mgr>/policy/api/v1/infra/sha/runbook-profiles/<profile-name>

Example Request:

PATCH https://<nsx-mgr>/policy/api/v1/infra/sha/runbook-profiles/<profile-name>
{
    "runbook_path": "/infra/sha/pre-defined-runbooks/0000004f-7665-726c-6179-54756e6e656c",
    "applied_to_group_path": "/infra/domains/default/groups/tngroup2",
    "config": {
        "enabled": true,
        "timeout": 120,
        "threshold_number": 2,
        "throttle_cycle": 6
    }
}

In this example, the throttle_cycle is 6 minutes and the threshold_number is 2, which means that within 6 minutes, the runbook can be invoked no more than two times.

For complete information about ODS APIs, see NSX Intelligence & NSX Application Platform API Guide.