Significant and sustained drops in the critical queues are common in over capacity situations and are likely to impact customers.

The counters in the following critical queues should be monitored closely.

vc_queue_net_sch, vc_queue_link_select and vc_queue_link_sch – These schedulers are meant for Link selection, QOS, and link scheduling for packets destined for Edges. The Orchestrator takes packets from global packet scheduler (vc_queue_net_sch), chooses the appropriate link to send them on (vc_queue_link_select), and enqueues them to link-level packet scheduler (vc_queue_link_sch).

The drops here indicate that the Gateway cannot send VCMP traffic (for example, the return packets from the internet) to Edges fast enough.

vc_queue_link_encrypt_0 and vc_queue_link_encrypt_1 – This is Software-based packet encryption and manages packet encryption as well as encapsulation and similar processing before packet transmission.

In a DPDK-enabled Gateway, capacity issues are first observed as drops in this queue. The drops in the queue indicate that the Gateway cannot encrypt traffic fast enough.

vc_queue_vcmp_tx – This is packet transmission for VCMP packets. The queue handles transmission of packets on interfaces, queueing for potential retransmission if needed, and freeing of packets. The drops in the queue indicate that the Gateway cannot send VCMP traffic (for example, the return packets from the internet) to Edges fast enough.

The following is a sample python script to check vcmp_tx drops. Run the Python script on a Gateway with drops in the queue to view the details of problems.

#!/usr/bin/env python
"""
Check VCG vcmp handoff drops packet check
"""
import os
import sys
import subprocess
import commands
import re
import json
from optparse import OptionParser

# Parse commandline options:
parser = OptionParser(usage="%prog -w <warning threshold> -c <critical threshold> [ -h ]")
parser.add_option("-w", "--warning", action="store", type="string", dest="warn_threshold", help="Count Warning threshold should be in <value>")
parser.add_option("-c", "--critical",action="store", type="string", dest="crit_threshold", help="Count Critical threshold should be in <value>")
(options, args) = parser.parse_args()

json_data= {"vcg_handoff_data": {"count":0,"drops":[]}}
vcg_handoff_file="/tmp/vcg_handoff_drop_check"
bw_threshold=768000

def find_bw_throughput():
    total_bw = 0
    value = 2
    flag = os.path.exists("/opt/vc/etc/dpdk-override.json")
    if flag is True:
       with open("/opt/vc/etc/dpdk-override.json") as json_file:
            data = json.load(json_file)
       value = data["enabled"]
    if value == 0:
      status,output = commands.getstatusoutput("ifstat  -bqTn 1 1")
      if status == 0:
         total_bw = float(output.splitlines()[-1].split()[-1]) + float(output.splitlines()[-1].split()[-2])
    return int(total_bw)

def store_vcg_hanoff_queue_qlength():
    samples_count = 5
    if not os.path.isfile(vcg_handoff_file):
       with open(vcg_handoff_file, 'w') as outfile:
           json.dump(json_data, outfile )

    if os.path.isfile(vcg_handoff_file):
       with open(vcg_handoff_file) as vcg_handoff_data:
           handoff_data = json.load(vcg_handoff_data)
    else:
       with open(vcg_handoff_file, 'w') as outfile:
           json.dump(json_data, outfile )

    if os.path.isfile('/opt/vc/bin/debug.py'):
       L=[]
       f=subprocess.check_output(["/opt/vc/bin/debug.py","--handoff"])
       x=[r.split() for r in f.split('\n')]
       res = list(filter(None, x))

       if handoff_data["vcg_handoff_data"]["count"] == 0:
           for item in res:
              if not item[0] == "name":
                 #handoff_data["vcg_handoff_data"]["drops"].append({item[0]:[item[13]]})
                 handoff_data["vcg_handoff_data"]["drops"].append({item[0]:["1"] * samples_count})
                 handoff_data["vcg_handoff_data"]["count"] = 1
           with open(vcg_handoff_file, 'w') as outfile:
                json.dump(handoff_data, outfile )

       if not handoff_data["vcg_handoff_data"]["count"] == 0:
        count_status = handoff_data["vcg_handoff_data"]["count"] - 1
        for item in res:
            if not item[0] == "name":
               for key in  handoff_data["vcg_handoff_data"]["drops"]:
                   field_num = len(item) - 6
                   if item[0] == key.keys()[0]:
                      key[key.keys()[0]].append(item[field_num])
                      del key[key.keys()[0]][0]
                      #key[key.keys()[0]][count_status]=item[field_num]
        handoff_data["vcg_handoff_data"]["count"] += 1
        if handoff_data["vcg_handoff_data"]["count"] == samples_count + 1:
           handoff_data["vcg_handoff_data"]["count"] = 1

        with open(vcg_handoff_file, 'w') as outfile:
           json.dump(handoff_data, outfile )

def get_status_of_drops(warn_value,crit_value):
    result = []
    diff_drops_data = []
    both_value = False
    total_bw = find_bw_throughput()
    if os.path.isfile(vcg_handoff_file):
       with open(vcg_handoff_file) as vcg_handoff_data:
           handoff_data = json.load(vcg_handoff_data)

    for drops in handoff_data["vcg_handoff_data"]["drops"]:
       for key,value in drops.items():
           if key == "vc_queue_vcmp_tx_0":
               for cumlative_data in range(len(value)):
                  if not cumlative_data == 4:
                     drop_value = int(value[cumlative_data+1]) - int(value[cumlative_data])
                     diff_drops_data.append(drop_value)
                     warn_counter = 0
                     crit_counter = 0
                     for data in diff_drops_data:
                        if int(data) >= int(crit_value) and int(total_bw) > int(bw_threshold):
                           both_value = True
                        if int(data) > int(crit_value):
                          crit_counter +=1
                        if int(data) > int(warn_value) and  int(data) < int(crit_value):
                          warn_counter +=1
                     if both_value == True:
                        if not result:
                           result.append({key:diff_drops_data,"current_bw":total_bw,"status":"critical"})
                     if crit_counter  == 4:
                        result.append({key:diff_drops_data,"current_bw":total_bw,"status":"critical"})
                     if warn_counter == 4:
                        result.append({key:diff_drops_data,"status":"warning"})
    if not result:
       return "OK",diff_drops_data,total_bw
    else:
       return result

if __name__ == '__main__':

   crit_threshold = options.crit_threshold
   warn_threshold = options.warn_threshold
   crit = 0
   if not options.crit_threshold:
      print "CRITICAL: Missing critical threshold value."
      sys.exit(2)
   if not options.warn_threshold:
      print "CRITICAL: Missing warning threshold value."
      sys.exit(2)

   store_vcg_hanoff_queue_qlength()
   result = get_status_of_drops(warn_threshold,crit_threshold)
   result_type = isinstance(result, tuple)
   if result_type != True:
      for item in result:
         if item["status"] == "critical":
            crit = 1

   if result[0] == "OK":
      print "OK: Handoff drops with 5 samples are good and Current drop values %s and Current BW is %s Kbps" %(result[1],result[2])
      sys.exit(0)
   elif crit == 1:
      print "Critical: List of drops which are above %s packets: %s. Current_bw value is in kbps" %(crit_threshold, result)
      sys.exit(2)
   else:
      print "Warning: List of drops which are above %s packets: %s. Current_bw value is in kbps" %(warn_threshold,result)
      sys.exit(1)

vc_queue_vcmp_data_0 and vc_queue_vcmp_data_1 – This is the first stage of processing for VCMP data packets received over VCMP tunnels. The queue handles packet reordering and missing packets. This is the beginning of processing for data packets coming in over a VCMP tunnel.

The drops in the queue indicate that the Gateway cannot receive traffic from Edges fast enough. This may be an indirect indication of packet loss on the Gateway, which requires substantial reordering of packets.

vc_queue_natt_0, vc_queue_natt_1, vc_queue_esp_0 and vc_queue_esp_1 – This is decryption of NATT/ESP encrypted traffic. The traffic that has come in on an encrypted tunnel goes here for setting up the state needed for decryption and is then handed to a decryption processing queue.

The drops in the queue indicate that the Gateway cannot decrypt Non SD-WAN Destination traffic fast enough.

ipv4_bh – This is IPv4 data packet processing, like routing, QoS, flow and peer association, for return packets received from the Internet for NAT traffic or from the PE router for VLAN/VRF traffic.

In all the Gateways, except the non-DPDK enabled Gateways, capacity issues are first observed as drops in the queue. The drops in the queue indicate that the Gateway cannot receive packets fast enough.