System Monitoring

System Performance – Cyclictest

It is a good practice to first test a system using a general tool that can verify the performance of a repeating inter-layer mechanism. Cyclictest accurately and repeatedly measures the difference between a thread's intended wake-up time and the time at which it actually wakes up, to provide statistics about the latencies in real-time systems. This system can be introduced by the hardware, the firmware, or an operating system (hypervisor).

It is unlikely that this tool is offered as part of a commercial application but can still be installed for testing as a separate VM. It is highly recommended to run this test for at least 24 hours, and good results can reveal a maximum latency under 120 µs. Values over this amount must be investigated.

System Performance – End-to-end and Single App

Decision-making delays, within vendor-provided functions (such as protection overcurrent) or custom end user logic (combinations of timers, latches) must initially be verified for specific architectures. It is expected for the overall timing of the network between the I/O (MU) and the application or workload to be comparable to a legacy equivalent device. The legacy equivalent in this case is the digital relay architecture shown in Network Architectures, which might not be available to benchmark for some end users (those who have not yet invested in digitalization).

Alternatively, a traditional relay architecture can be tested, and can have typical results such as those indicated in the following table (excluding any intentionally added delays).

Table 1. Example Protection Elements
ANSI #/Prot. Element	Description	Decision Time (typical, in cycles)	(In milliseconds)
21	Distance	1.0	16.67
24	Volts/hertz element	1.0	16.67
27	Under voltage	0.5	8.33
46	Negative sequence	1.0	16.67
50/51	Overcurrent	0.2	3.33
59	Overvoltage	0.5	8.33
60	Loss of potential	0.5	8.33
81	Under/over frequency	1.0	16.67
87	Differential	1.5	25.00

These values are only examples used for reference and do not include the duration of actions required external to the devices (such as output contact or HV circuit breaker operation). The timing varies based on the manufacturer and can also vary based on the triggering (for example, fault criteria) or environmental conditions.

Within a controlled environment however, when testing like-for-like applications, the conditional differences in testing across these technologies must be minimal.

An example array of testing can be choosing protection elements or custom logic groups to exercise (which can be the playback of COMTRADE files, which are recordings of actual system events), determining the application inputs required for activation, and executing highly accelerated, repeatable tests within controlled states, by testing equipment located at the Equipment Interface.

The reporting desired from this is the aggregated round trip timing results from each signal execution to the desired return signal receipt, at the equipment interface. As a simple starting point, more rudimentary results can be drawn from a GOOSE ping-pong test. It can be implemented with equipment listed in Test Signal Generation, with the application custom programmed to return a GOOSE signal as soon as it is received, without any intentional delay.

Overall, the virtual relay architecture highlights the requirement to keep network latency to a consistent minimum. Standard requirements for timing and latency, for both critical and non-critical applications, are outlined within IEC 61850-5 (transfer time classes, TT#).

System Performance – End-to-end, App-to-App

The path demonstrated above is between an external device and a single virtual appliance. Latency sensitive traffic may also need to be passed from virtual appliance to virtual appliance. The recommended network path is through the physical switch.

Figure 2. VM to VM Process Bus Communication

Latency testing for this architecture can be accomplished similar to the GOOSE ping-pong test. However, instead of using external testing hardware or software, the vPR appliances themselves can be used to measure the latency as a one-way trip (for example, vPR 1 to vPR 2) or as a round trip to be sent and received at the same end.

With both devices synchronized through PTP, they can record high accuracy, time stamped events to compare. And this testing can be repeated consistently for extended durations to discover the maximum latency.

Alternatively, an identical path can be tested using custom-built Linux VMs (using real-time kernel builds) to host network analysis software (such as iperf). While the aggregate results for bandwidth limits, latency, jitter, and packet loss may be easier to collect, the traffic is not representative of actual power system signals (IEC 61850). However, it can still offer a simpler method to troubleshoot the proposed network path.

Either method described here for application-to-application traffic can also be valid for testing a future path through the virtual distributed switch, which is recommended for the vPAC Ready Infrastructure reference architecture after it can guarantee steady-state, real-time performance.

Stress tooling adjacent to network performance is discussed in the Scalability section.

System Performance – In-Appliance

An appliance incorporating IEC 61850 subscriptions have built-in capabilities for monitoring. These capabilities can be leveraged to determine poor signal quality or packet loss. Supervision of the validity of the time source is available in all consuming end devices, with some providing more extensive statistics regarding clock quality, offset from master, and path delay.

Based on the behavior of MMS, GOOSE, and SV traffic, the SV traffic is most sensitive to packet loss due to the high sampling rate and bandwidth usage. Therefore, a mechanism is built into the standard messaging packet, which contains a sampling counter. This counter increments based on the publishing rate (for example, 4.8kHz), resetting every second.

OEMs of critical applications (such as protection relaying) include an alarm that activates for a set number of consecutive packets being lost. This alarm is critical to continuously monitor, to discover network problems that can be related to traffic congestion, device failure, or cabling problems.

GOOSE subscriptions must include a quality bit, which provides an indication of the reliability of the information being transmitted. This quality bit can be verified at a frequency less than or equal to the maximum message transmission time (configured in the IEC 61850 configuration). The maximum time is the heartbeat at which the bit is continuously sent until a change in state occurs, after which it is multicast at a minimum rate (burst), repeating at increasing intervals until settling again at the maximum. Therefore, an alarm can be set to monitor the GOOSE message quality.

MMS is a client-server (unicast) protocol operating with a less critical latency than SV or GOOSE. However, when used, it is monitored within an application for successful or failed client connections, aborted or rejected associations, failed requests, or failed read or write. Typically, these aspects of the protocol that are in use by the appliance can be monitored within internal or external event recorders and leveraged to generate additional system alarms.

Supervisory Monitoring

There are multiple levels requiring monitoring within the vPAC Ready Infrastructure. At the physical level, status points, warnings, alarms, and other indications from MUs or other I/O translating devices, network switches, and satellite clocks are typically available through several common modes:

Hardwired contacts.
Persistent client and server communication protocols (for example, IEC 61850 MMS, DNP3, SNMP).
Monitoring and logging (for example, Syslog).

Different vendors offer different capabilities. However, it is advantageous to use fail-safe and common methods across from these physical components to the virtual components and their associated host.

Beginning with a host server’s physical components, there are many virtual environment conditions that are monitored by default within the preconfigured vSphere alarms.

Table 2. Preconfigured vSphere Alarms Summary
Alarm Group	Description
Host	Monitors the power state, network connection, CPUs, memory, fan status, voltage, temperature, system boards, battery, other objects, storage connectivity, capacity of IPMI log, BMC connection, errors and warnings, status.
Datastore	Monitors disk usage, disk capacity, vSAN licensing, disk errors, thin provisioning threshold limits, changes in capabilities for volumes, API integration, flash resources, DRS recommendations, cluster space, visibility of datastore to multiple datacenters.
GPU	ECC memory status, thermal conditions.
Virtual Network	Distributed vSwitch VLAN trunk status, MTU status/support, teaming status, changes in network adapters, vSwitch connectivity, redundancy status, degradation, and incorrect configuration.
VMs	CPU usage, memory usage, CPU ready time, total disk latency, number of disk commands cancelled, bus resets, VM errors/warnings, snapshot consolidation needed, inability to migrate, relocate, or orphaned.
Redundancy Mechanisms	Fault tolerance (FT) - starting secondary VM timeout, no available compatible host to run secondary VM, VM state change, change in secondary VM lockstep; High availability (HA) - insufficient cluster resources for failover, failover in progress, status of primary HA agent, host health status, VM failover failure, VM HA restart/reset, HA VM cannot be powered off or restarted for component protection.
Certificate Management	Host certificate status, failure, changes, pre-expiration warning, update failure.
Licensing	Licensing inventory, monitoring of user-defined thresholds, monitoring capacity limits, compatibility, errors.
Services Health Monitoring	Control Agent, Identity service, vSphere Client, ESX Agent Manager, Message Bus Config, License service, Inventory service, vCenter Server, Database, Data service, vService Manager, Performance Charts, Content Library, Transfer service, Dump Collector, vAPI Endpoint service, System and Hardware Manager.

This information can be accessed from vSphere Client, under Monitor for all levels (vCenter, datacenter, individual host, or specific VMs), along with real-time dashboards to display many types of metrics.

Figure 3. Snippet of VM performance overview in vSphere Client GUI

The amount of information can be overwhelming, therefore, here is a recommended starting point for hardware monitoring:

Power supply voltage (for each supply)
System battery health
Fan status (for each fan)
Temperatures
Hard drive health (for each drive)
Datastore limits (for each space)
PCI card health (for each slot)
Licensing status

And software monitoring, within the guest OS:

CPU usage
Memory usage
Storage/disk usage
Communications dropped packets
VMware tools status (if applicable)

These alarms can be used for automation within vSphere to act on a host or VM, and to notify. Notifications can be sent through email or SNMP traps, which can provide the initial indication of a problem.

Troubleshooting problems can then be augmented with Syslog data. Both SNMP and Syslog data can be forwarded to an external server for aggregating information. There are commercial software and hardware options available that can act as protocol converters to allow for the data to be translated into an accepted, existing protocol (such as DNP3) for use by a utility.

Scalability

The various layers of the system must be easily scaled up and down in size. It includes the external networked devices, the pool of compute resources (which can be clustered together), and the functionality that is built into each application.

Networking

Beginning with the external network, the physical devices supporting the system have different scaling mechanisms. These include:

Time Synchronization / PTP – Scaling through participation of network switches as transparent or boundary clocks to deliver timing to multiple ordinary clocks (end devices).
Signal Translation / Digitalization – Scaling through adding physical sensors, merging units, and other I/O devices. Higher efficiency can be obtained based on implementation and protocols chosen (for details, see Considerations for Signal Translation Devices).
Physical Network Switches – Scaling can occur through adding physical ports (which can be accomplished with additional switches) or by increasing existing channel bandwidth (for example, upgrading link capabilities from lower levels, such as 100 Mbps – 1 Gbps, to higher levels of 10, 25, 40 Gbps and more). Compatibility issues might be encountered as bandwidth capabilities are mixed.

The physical network interfaces with the virtual network at the server NICs. Depending on the architecture, this can be a bottleneck for network traffic, especially as the ratio of physical to virtual ports decreases (see Physical to Virtual Network Connections in vSphere, for example). In the vPAC Ready Infrastructure reference architecture, PCI passthrough is used, which establishes a 1:1 relationship between a port and VM.

Figure 4. Physical to Virtual Network Connections in vSphere

A process bus consists of multicast traffic (such as SV, GOOSE, and PTP). This means packets must be duplicated by a switch (physical or virtual, where used) for delivery to all subscribers. Therefore, to minimize network loading, best practices of traffic shaping with managed switching must be applied in the form of at least VLANs or Quality of Service (QoS) settings, to prevent unnecessary processing within the switch.

The network ends at the virtual application. Here the network limitations are based on both the virtual interface bandwidth, and its processing or compute, and built-in capabilities. Typically, IEC 61850 capable devices are provided with a Protocol Implementation Extra Information for Testing (PIXIT) document. This document states limitations and expected test results from the original equipment manufacturer, such as GOOSE subscription or publication maximums and situational behavior to irregular signals. Practical boundaries can be set from this information, in terms of how far to scale up or scale out in testing.

The scalability of the system is highly dependent on SV requirements. Typical process bus traffic parameters (estimated) provide reasoning, as there is a magnitude of difference in bandwidth consumed between SVs and the adjacent message types. The bandwidth used by GOOSE is variable, based on the frequency of state changes in the binary or analog values being facilitated, but is still only a minor fraction of SV, and traffic bursts can be accounted for by providing channel margin.

Table 3. Typical process bus traffic parameters (estimated)
Parameters	PTP	Sampled Values (61850-9-2LE)	Sampled Values (61869-9)	GOOSE
Sampling Rate (Hz)	4	4800	4800	40
Message Size (bytes)	120	150	220	235
Bandwidth (Mbps)	0.004	5.760	4.224	0.075
Notes	Messages per second include sync message, announcement, peer delay request, and peer delay response	Includes over ethernet frame overheads (Ethernet header, SVID, AppID), contains only one ASDU with prescribed 4 currents and 4 voltages.	Frame rate is 1/2 of sampling rate (2x Application Specific Data Units or ASDUs). Also permits mixed signal content (currents and voltages).	Example sampling rate provided (8x messages with 10 total binary values changing state 1x/s).

Looking at the reference architecture in VVS Reference Architecture, scaling exercises can be proposed to verify all components are sized and configured properly to ensure adequate network construction. It is unrealistic to assume any physical network port can pass the maximum number of SV streams for two of the associated vPR workloads. However, the example calculations (61850- 9-2LE SV packet x60) are shown in Example SV Scaling Exercise.

Therefore, the recommended NIC port types between the physical switches and the server NICs (NIC 1 and 2) are recommended to be 1 Gbps. Direct connections between the merging units and switches can still be 100 Mbps, if required.

Table 4. Example SV Scaling Exercise
Example VM (ABB SSC600 SW) maximum number of SV streams supported	30
Number of VMs	2
Bandwidth utilized (Mbps)	345.6
% channel loading allowed	60%
Minimum common data transfer rate	1 Gbps

Another practical approach to network bandwidth testing is to determine the largest possible, ultimate substation deployment, allowing additional headroom for percentage growth, and testing architecture to meet these requirements.

With either method, simulating the large number of SV streams required dictates the use of specialized equipment, or a large number of production merging units. The in-appliance monitoring of SV receipt must be verified with the largest number of streams expected, for an extended period (multiple weeks). No errors are expected in SV or the adjacent traffic (GOOSE, PTP) for subscribers. And, externally, it is recommended the traffic be monitored to measure packet sizes for SV messages and other traffic types, and the total process bus throughput experienced.

Compute

Scaling of the compute environment can be possible when the underlying capabilities for individual server components are not maximized during the initial build. For example, memory, storage, and interfaces can be added in the forms of upgrading or augmenting RAM modules, hard drives, or PCIe cards for various functions (networking or communications, hardware storage redundancy, specialized interfaces). Avoid maximizing a host at the time of initial installation.

For testing compute scalability, it might be prudent to discover how components can be scaled up or scaled out within a production environment scenario. In many cases, the server must be powered-off before replacement or introducing additional equipment. Both active-active and active-passive topologies allow safe workload failover.

The limiting factor for component scaling might be based on a 1:1 or 2:1 relationship established by a vPR application and the network PCI passthrough to the NIC (based on the reference architecture). In this case, it means a PRP NIC can only support one (two ports, PRP DAN) or two (four ports, two PRP DANs) VMs and there must be enough PCIe slots to facilitate the required NICs. A method for supporting a 2:1 relationship is discussed in the Redundancy and Failover Mechanism section.

Application

The scaling of virtual appliances is discussed in the Physical Scale section. Given the behavior of the real-time workloads represented by the commercial vPR appliances, available at present (which is to consume the maximum number of required resources to accomplish its stated capabilities), the maximum number of VMs supportable is simple to factor.

If the requirements for vPR and vAC workloads exceed the power of a single server, then additional servers can be clustered within the ‘A’ and ‘B’ server groups. This is a simple task within vCSA (as indicated at the start of the Configuring vSAN section). It is recommended that an active-active topology be implemented for latency-sensitive, critical applications (vPR), which is to operate them in parallel on separate hardware with no common points of failure.

Active-standby mechanisms can be used for remaining applications (vAC), where High Availability can instantiate a copy of a failed VM on capable hardware.

Testing the scaling capabilities of an environment to host the maximum required number of applications can be relatively simple. For vPR, a VM workload can be installed up to the maximum allowed quantity (vSphere prevents going beyond compute capabilities) and, again, these VMs consistently consume the maximum amount of required resources.

For vAC, simulated workloads can be used (a simple Linux OS VM) with a variable workload generator test tool (Linux stress or stress-ng). It allows configurable amounts of CPU, memory, I/O, and storage resources to be utilized. As new commercial vPR applications are made available, they might have differing efficient forms than what is described in this guide.

These can be in the form of VMs or modern applications, and they might not have pre-programmed persistent resource allocation (like ABB SSC600 SW). Therefore, scaling with new applications require retesting and new considerations than those mentioned in this section. And described in the Physical Scale section, overcommitment can be considered for vAC applications, depending upon organizational philosophies.