KPI Definition Example

Network Reliability KPI

We would like to calculate the reliability of our network by calculating the packet drop rate for some of our devices. The packet drop rate is one sign that the device might be overloaded or that network is congested. We would like to monitor this KPI to have some historical measurements and also become proactive when problems with the devices become evident.
  • kpi name: packet-drop-rate
  • input metrics: packetCount, packetDropCount
  • time window: 60 seconds, tumbling window
  • filter expression: metadata['dataSource'] matches '10\\.118\\.7[23]\\.[1-9]0?'
  • calculation expression: percent(sum(packetDropCount), sum(packetCount))
  • groupBy expression: metadata['dataSource']

This KPI calculates the packet drop percent rate for devices whose IP address falls in the range specified by the given regular expression. Metrics observed by the streaming system for other devices that fall outside of the filter range will be filtered out. Assuming the dataSource attribute is present in the metric event, the groupBy expression ensures that the packet-drop-rate KPI is calculated for each unique value of the attribute. All packetCount and packetDropCount metrics received during the time window are added together (in the sum() operation) followed by the percent operation.

Threshold Definition Example

Network Reliability Threshold

We would like to set a threshold for non-acceptable values for our network reliability KPI packet-drop-rate. It is desired that the packet drop rate remains under 10%, since that is the understood tolerance of the devices and network. Anything above this level will be unacceptable. A packet drop rate above 20% will be a sign that some more critical is occurring and anything above that value is certainly severe and undesirable. This threshold would piggy back of the packet-drop-rate KPI and add to the definition by adding a set of threshold ranges to capture the behavior described above.

threshold name: packet-drop-rate

Table 1. Threshold range settings:
Lable Value >= Value <
normal 0 10
warning 10 20
critical 20 max

This threshold configuration will augment the calculated KPI packet-drop-rate and add an additional tag to the output event to identify the threshold crossing condition as specified in the config.

Non-windowed Threshold Definition Example

Device Availability Threshold

The VMware Telco Cloud Operations Streaming system receives events for device availability (heartbeat) which conveys the availability of certain device or system. The events are received periodically, and we would like to detect when the availability event points to a system down or not available occurrence. The availability event uses a value > 0 for a successful acknowledgement from a device, and 0 for it cannot reach or received no response from it. As such, we want to identify these no-availability events in real time as soon as they are received.
  • threshold name: device-status-threshold
  • filter expression: metadata['dataSource'] matches '10\\.118\\.7[23]\\.[1-9]0?'
  • catalog: device catalog
  • catalog item: device status
Table 2. Threshold range settings:
Label Value >= Value <
device.down 0 1
device.up 1 max

This threshold configuration will generate an event for a device that falls within the filter criteria when an availability event is received and indicates the device is not responding or offline.

Composite KPI Example

Device Availability KPI

We have defined the device availability threshold device-status-threshold for when device events are reported for failure to handshake or communicate with a device. However, these event are sometimes spurious and transient and the device is not actually failed or a hiccup triggered the misidentification. For this reason, we would like to create a KPI that uses the threshold crossing and count those events over a time period to determine with better accuracy if there is a problem. If the availability event is received every 10 seconds, an event count of > 5 over a 1 minute window is probably a good indication that there is a problem with the availability of the device.
  • kpi name: device-down-status-count
  • input threshold: device-status-threshold:device.down
  • time window: 60 seconds, tumbling window
  • filter expression: metadata['dataSource'] matches '10\\.118\\.7[23]\\.[1-9]0?'
  • calculation expression: count(device-status-threshold:device.down)
  • groupBy expression: metadata['dataSource']
Table 3. Threshold range settings
Label Value >= Value <
device.down 5 max