ESXi hosts that are prepared for NSX 6.4.5 or 6.4.6 display a Purple Screen of Death (PSOD) diagnostic screen when the virtual infrastructure latency feature is enabled in vRNI 4.2 or later.

Problem

PSOD diagnostic screen is displayed when the number of BFD tunnels exceed 900.

Cause

The virtual infrastructure latency feature in vRNI uses BFD monitoring on NSX-prepared hosts to establish tunnels between hosts. PSOD occurs when the NSX kernel module maintains the state of the BFD sessions while responding to a detailed BFD tunnel query from the control plane agent.

PSOD is not observed when the number of BFD tunnels are in a few hundreds. When the number of BFD tunnels exceed 900, the host experiences a critical error and becomes inoperative. The number of hosts that will create over 900 BFD tunnels depends on the number of VTEPs in your environment.

To determine the number of BFD tunnels in your environment, use the following formula: (N-1)*(T^2)

Where:
  • N is the number of hosts.
  • T is the number of VTEPs per host.

For example, in a cluster of four hosts with two VTEPs each, the number of BFD tunnels that each host can see is:

(4-1)*(2^2)=12
In the stack trace of the PSOD, observe entries similar to the following:
#0 DLM_free (msp=0x431a455dcca0, mem=mem@entry=0x431a458cbd10, allowTrim=allowTrim@entry=1 '\001') at bora/vmkernel/main/dlmalloc.c:4924
#1 0x0000418012343ffa in Heap_Free (heap=0x431a455dc000, mem=<optimized out>, mem@entry=0x431a458cbd10) at bora/vmkernel/main/heap.c:4314
#2 0x000041801222db25 in vmk_HeapFree (heap=<optimized out>, mem=mem@entry=0x431a458cbd10) at bora/vmkernel/core/vmkapi_heap.c:250
#3 0x000041801393ca61 in __VDL2_Free (heapID=<optimized out>, data=data@entry=0x431a458cbd10) at /build/mts/release/bora-13168956/esx-datapath/modules/vdl2/vdl2.c:152
#4 0x0000418013950caf in VDL2_CPTaskFree (task=task@entry=0x431a458cbd10) at /build/mts/release/bora-13168956/esx-datapath/modules/vdl2/vdl2_ctlplane.c:164
#5 0x0000418013949415 in VDL2CPWorldProcessTask (task=0x431a458cbd10) at /build/mts/release/bora-13168956/esx-datapath/modules/vdl2/vdl2_cpworld.c:283
#6 VDL2CPWorldFunc (data=data@entry=0x0) at /build/mts/release/bora-13168956/esx-datapath/modules/vdl2/vdl2_cpworld.c:335
#7 0x0000418012308adf in vmkWorldFunc (data=<optimized out>) at bora/vmkernel/main/vmkapi_world.c:528
#8 0x00004180124c91f5 in CpuSched_StartWorld (destWorld=<optimized out>, previous=<optimized out>) at bora/vmkernel/sched/cpusched.c:10792
#9 0x0000000000000000 in ?? ()
In the /var/log/vmkernel.log file of the host, observe the following entries, which indicate that BFD was enabled on the host:
# cpu75:68603 opID=6616a61a)vxlan: VDL2PortsetPropSet:1036: Updating BFD VTEP config to : enable
# cpu75:68603 opID=6616a61a)BFD: BFD_CreateNewSession ENTER: localIP: a.b.c.d , remoteIP: w.x.y.z , probeInterval (in milli seconds): 12000
# cpu75:68603 opID=6616a61a)WARNING: BFD: Inserted new session: Discriminator 1471713223, localIP: a.b.c.d remoteIP: w.x.y.z
In the ESXi core dump or memory dump, observe the following BFD messages (BFD state change: init -> up)
less vmkernel-zdump.1
    vers:1 diag:"No Diagnostic" state:up mult:3 length:24
    flags: pol
    my_disc:0x50c322ca your_disc:0x39f2436f
    min_tx:300000us (300ms)
    min_rx:12000000us (12000ms)
    min_rx_echo:0us (0ms)(null): BFD state change: init->up "No Diagnostic"->"No Diagnostic".(null): New remote min_rx.
    vers:1 diag:"No Diagnostic" state:up mult:3 length:24
    flags: pol
    my_disc:0x5a566ae8 your_disc:0x16f3890c
    min_tx:300000us (300ms)
    min_rx:12000000us (12000ms)
    min_rx_echo:0us (0ms)(null): BFD state change: init->up "No Diagnostic"- >"No Diagnostic".(null): New remote min_rx.

Solution

  1. If you have used vRNI to enable collection of latency metrics from NSX-prepared hosts, disable the virtual infrastructure latency feature.
    1. In vRNI, navigate to Settings > Accounts and Data Sources.
    2. Edit the NSX Manager data source, and deselect the Enable Virtual Infrastructure Latency check box.
    3. Click Submit to confirm the change.
  2. If you have used NSX APIs to enable collection of latency metrics, or if the vRNI appliance is inaccessible, disable BFD by running an API request.
    1. Retrieve the BFD global configuration details by running the following GET API, and observe that BFD is enabled:
      GET /api/2.0/vdn/bfd/configuration/global
      Example API response:
      <bfdGlobalConfiguration>
          <enabled>true</enabled>
          <pollingIntervalSecondsForHost>180</pollingIntervalSecondsForHost>
          <bfdIntervalMillSecondsForHost>120000</bfdIntervalMillSecondsForHost>
      </bfdGlobalConfiguration>
    2. Disable BFD by running the following PUT API request:
      PUT /api/2.0/vdn/bfd/configuration/global
      Example request body:
      <bfdGlobalConfiguration>
          <enabled>false</enabled>
          <pollingIntervalSecondsForHost>180</pollingIntervalSecondsForHost>
          <bfdIntervalMillSecondsForHost>120000</bfdIntervalMillSecondsForHost>
      </bfdGlobalConfiguration>
    For detailed information about the BFD configuration parameters, see the NSX API Guide.