The SD-WAN Gateway is a pipeline architecture which processes bursts of traffic and bursts of high CPU are expected. The Gateway should be monitored for CPU cores that are spinning at 100%. However, the DPDK cores run in poll mode for performance reasons and they expect to take close to 100% CPU at high throughput.

You can monitor a Gateway with thresholds that provide warning or critical states which indicate potential issues prior to impacting services. The following table lists the threshold values and recommended actions.

Threshold State Threshold Value Recommended Corrective Action
Warning 80%

If the threshold value is crossed consistently for 5 minutes:

  • Check per-process CPU usage.
  • Monitor for 10 more minutes.

If the threshold value is crossed consistently for 5 minutes:

  • Collect Gateway diagnostic bundle.
  • Open a support case with VMware.
Critical 90%

If the threshold value is crossed consistently for 5 minutes:

  • Monitor for possible critical packet drop which can indicate over capacity.

If the issue is observed for one hour:

  • If over capacity is observed over a 5 minute interval, add Gateway capacity and rebalance to avoid capacity related service impact.
    Note: Before rebalancing the Gateway, confirm that the capacity metrics are within the recommended limit. For more information on capacity metrics, see Capacity of Gateway Components.
The following is an example Python script for monitoring the CPU usage:
Note: You can also use Telegraf to monitor the CPU usage. For more information, see Monitor Gateways using Telegraf.
#! /usr/bin/env python
"""
Check for CPUs spinning at 100%
"""
import re
import collections
import time
import sys
import json
import os
import subprocess
re_cpu = re.compile(r"^cpu\d+\s")
CPUStat = collections.namedtuple('CPUStat', ['user', 'nice', 'sys', 'idle'])
def get_stats():
        stats = open("/proc/stat").readlines()
        ret = {}
        for s in stats:
                if not re_cpu.search(s): continue
                s = s.split()
                ret[s[0]] = CPUStat(*[ int(v) for v in s[1:5]])
        return ret
def verify_dpdk_support():
    if os.path.isfile('/opt/vc/etc/dpdk.json'):
       with open("/opt/vc/etc/dpdk.json") as data:
           d=json.loads((data.read()))
       if "status" in  d.keys():
           return True if d['status'] is "Supported" else False
    else:
       return False
def another_verify_dpdk_support():
    if os.path.isfile('/opt/vc/bin/debug.py'):
        f=subprocess.check_output(["/opt/vc/bin/debug.py","--dpdk_ports_dump"])
        x=[r.split() for r in f.split('\n')]
	if len(x) <= 1:
		return False
	else:
                return True
    else:
	return False
dpdk_status=verify_dpdk_support() or another_verify_dpdk_support()
if __name__ == "__main__":
        try:
                stat1 = get_stats()
                time.sleep(3)
                stat2 = get_stats()
        except:
                print "UKNOWN - failed to get CPU stat: %s" % str(sys.exc_info()[1])
                sys.exit(3)
        busy_cpu_set = [ cpu for cpu in stat1 if (stat2[cpu].idle - stat1[cpu].idle)==0 ]
        if not busy_cpu_set:
                        print "OK - no spinning CPUs"
                        sys.exit(0)
        if dpdk_status == True:
           if "cpu1" in busy_cpu_set and len(busy_cpu_set) == 1:
                        print "OK - no spinning CPUs"
                        sys.exit(0)
           elif "cpu1" in busy_cpu_set:
                        busy_cpu_set.remove('cpu1')
                        print "CRITICAL - %s is at 100%%" % (",".join(busy_cpu_set))
                        sys.exit(2)
           else:
                        print busy_cpu_set,1
                        print "CRITICAL - %s is at 100%%" % (",".join(busy_cpu_set))
                        sys.exit(2)
        else:
                        print "CRITICAL - %s is at 100%%" % (",".join(busy_cpu_set))
                        sys.exit(2)