This section provides an in-depth overview of Dynamic Multipath Optimization (DMPO) as used by the VMware SD-WAN service.

Overview

VMware SD-WAN™ is a solution that lets enterprise and service providers use multiple WAN transports at the same time. This way, they can increase bandwidth and ensure application performance. The solution works for both on-premise and cloud applications (SaaS/IaaS). It uses a Cloud-Delivered architecture that builds an overlay network with multiple tunnels. It monitors and adapts to the changes in the WAN transports in real time. Dynamic Multipath Optimization (DMPO) is a technology that VMware SD-WAN has developed to make the overlay network more resilient. It considers the real time performance of the WAN links. This document explains the key features and benefits of DMPO.

The following diagram depicts a typical SD-WAN deployment with Multi Cloud Non SDWAN Destinations.

Key Functionalities

DMPO is a technology that VMware SD-WAN uses for data traffic processing and forwarding. It works between the VMware SD-WAN Edge and VMware SD-WAN Gateway devices. These devices are the DMPO endpoints.
  • For enterprise locations (Branch to Branch or Branch to Hub), the Edges create DMPO tunnels with each other.
  • For cloud applications, each Edge creates DMPO tunnels with one or more Gateways.
DMPO has three key features that are discussed below.

Continuous Monitoring

Automated Bandwidth Discovery: Once the WAN link is detected by the VMware SD-WAN Edge, it first establishes DMPO tunnels with one or more VMware SD-WAN Gateways and runs bandwidth test with the closest Gateway. The bandwidth test is performed by sending short bursts of bi-directional traffic and measuring the received rate at each end. Since the Gateway is deployed at the Internet PoPs, it can also identify the real public IP address of the WAN link in case the Edge interface is behind a NAT or PAT device. A similar process applies for a private link. For the Edges acting as the Hub or headend, the WAN bandwidth is statically defined. However, when the branch Edge establishes a DMPO tunnel with the Hub Edges, they follow the same bandwidth test procedures similar to how it is done between an Edge and a Gateway on the public link.

Continuous Path Monitoring: Dynamic Multipath Optimization (DMPO) performs continuous, uni-directional measurements of performance metrics - loss, latency and jitter of every packet, on every tunnel between any two DMPO endpoints, Edge or Gateway. VMware SD-WAN’s per-packet steering allows independent decisions in both uplink and downlink directions without introducing any asymmetric routing. DMPO uses both passive and active monitoring approaches. While there is user-traffic, DMPO tunnel header contains additional performance metrics, including sequence number and timestamp. This enables the DMPO endpoints to identify lost and out-of-order packets, and calculate jitter and latency in each direction. The DMPO endpoints communicate the performance metrics of the path between each other every 100 ms.

While there is no user traffic, an active probe is sent every 100 ms and, after 5 minutes of no high priority user-traffic, the probe frequency is reduced to 500 ms. This comprehensive measurement enables the DMPO to react very quickly to the change in the underlying WAN condition, resulting in the ability to deliver sub-second protection against sudden drops in bandwidth capacity and outages in the WAN.

MPLS Class of Service (CoS): For a private link that has CoS agreement, DMPO can be configured to take CoS into account for both monitoring and application steering decisions.

Dynamic Application Steering

Application-aware Per-packet Steering: Dynamic Multipath Optimization (DMPO) identifies traffic using layer 2 to 7 attributes, e.g., VLAN, IP address, protocol, and applications. VMware SD-WAN performs application-aware per-packet steering based on business policy configuration and real-time link conditions. The business policy contains out of the box Smart Defaults that specifies the default steering behavior and priority of more than 2500 applications: Customers can immediately use dynamic packet steering and application-aware prioritization without having to define any policy.

Throughout its lifetime, any traffic flow is steered onto one or more DMPO tunnels, in the middle of the communication, with no impact to the flow. A link that is completely down is referred to as having an outage condition. A link that is unable to deliver SLA for a given application is referred to as having a brownout condition. VMware SD-WAN offers sub-second outage and sudden drops in bandwidth capacity protection. With the continuous monitoring of all the WAN links, DMPO detects sudden loss of SLA or outage condition within 300-500 ms and immediately steers traffic flow to protect the application performance, while ensuring no impact to the active flow and user experience. There is one minute hold time from the time that the brownout or outage condition on the link is cleared before DMPO steers the traffic flow back onto the preferred link if specified in the business policy.

Intelligent learning enables application steering based on the first packet of the application by caching classification results. This is necessary for application-based redirection, e.g., redirect Netflix onto the branch Internet link, bypassing the DMPO tunnel, while backhauling Office 365 to the enterprise regional hub or data center.

Example: Smart Defaults specifies that Microsoft Skype for Business is High Priority and is Real-Time application. Assuming there are 2 links with latency of 50 ms and 60ms respectively. Assume all other SLAs are equal or met. DMPO will chose the link the better latency, i.e. link with 50ms latency. If the current link to which the Skype for Business traffic is steered experiences high latency of 200 ms, within less than a second the packets for the Skype for Business flow is steered on to another link which has better latency of 60 ms.

Bandwidth Aggregation for Single Flow: For the type of applications that can benefit from more bandwidth, e.g. file transfer, DMPO performs per-packet load balancing, utilizing all available links to deliver all packets of a single flow to the destination. DMPO takes into account the real time WAN performance and decides which paths should be use for sending the packets of the flow. It also performs resequencing at the receiving end to ensure there is no out-of-order packets introduced as a result of per-packet load balancing.

Example: Two 50 Mbps links deliver 100Mbps of aggregated capacity for a single traffic flow. QoS is applied at both the aggregate and individual link level.

On-demand Remediation

Error and Jitter Correction: In a scenario where it may not be possible to steer the traffic flow onto the better link, e.g., single link deployment, or multiple links having issues at the same time, Dynamic Multipath Optimization (DMPO) can enable error corrections for the duration the WAN links have issues. The type of error corrections used depends on the type of applications and the type of errors.

Real-time applications such as voice and video flows can benefit from Forward Error Correction (FEC) when there is packet loss. DMPO automatically enables FEC on single or multiple links. When there are multiple links, DMPO will select up to two of the best links at any given time for FEC. Duplicated packets are discarded and out-of-order packets are re-ordered at the receiving end before delivering to the final destination.

DMPO enables jitter buffer for the real-time applications when the WAN links experience jitter. TCP applications such as file transfer benefit from Negative Acknowledgement (NACK). Upon the detection of a missing packet, the receiving DMPO endpoint informs the sending DMPO endpoint to retransmit the missing packet. Doing so protects the end applications from detecting packet loss and, as a result, maximizes TCP window and delivers high TCP throughput even during lossy condition.

When the packet loss surpasses a specific threshold, it prompts the initiation of Adaptive Forward Error Correction (FEC) through packet duplication. The error-correction applied is based on the traffic class:

  • Transactional/Bulk traffic: In this case, we apply a NACK based retransmit algorithm, which is done at the VCMP protocol level where we attempt to correct the error condition before handing over the packet to the application.
  • Realtime traffic: In this case, we apply adaptive FEC to replicate packets (activate/deactivate upon loss SLA violation) and/or jitter buffer correction (upon jitter SLA violation – this one can only be activated and will persist for the life of the flow).

The link SLA (loss, latency, jitter) is continually being monitored and measured on a periodic basis and FEC (packet duplication) will be activated upon threshold violation for real-time traffic (different values for voice vs. video applications).

In a single WAN link scenario, duplicate packets are transmitted on the same link adjacent to one another. Since packet drops due to congestion are random, it is statistically unlikely that two adjacent packets will be dropped, greatly increasing the likelihood that one of the packets will make it through to the destination. The replicated packets are sent on separate links in the case of two or more WAN links.

Adaptive FEC is triggered on a per-flow basis in real-time based on measured packet loss thresholds, and disabled in real-time once packet loss no longer exceeds the activation threshold. This ensures that available bandwidth is used as efficiently as possible, unnecessary packet duplication is avoided, and resource overhead is reduced. Another significant benefit of VMware’s Adaptive FEC approach is that the effect of packet loss in the transport network on end-user devices is minimized or eliminated.When end-user devices do not see packet drops, they avoid retransmissions and TCP congestion avoidance mechanisms like slow start, which can negatively impact overall throughput, application performance, and end-user experience.

DMPO Real World Results

Scenario 1: Branch-to-branch VoIP call on a single link. The results in the below figure demonstrate the benefits of on-demand remediation using FEC and jitter remediation on a single Internet link with traditional WAN and VMware SD-WAN. A mean opinion score (MOS) of less than 3.5 is unacceptable quality for a voice or video call.

Scenario 2: TCP Performance with and without VMware SD-WAN for Single and Multiple Links. These results show how NACK enables per-packet load balancing.

Scenario 3: Hybrid WAN scenario with an outage on the MPLS link and both jitter and loss on the Internet (Comcast) link. These results show how DMPO protects applications from sub-second outages by steering them to Internet links and enabling on-demand remediation on the Internet link.

Business Policy Framework and Smart Defaults

The business policy lets the IT administrator control QoS, steering, and services for the application traffic. Smart Defaults provides a ready-made business policy that supports over 2500 applications. DMPO makes steering decisions based on the type of application, real time link condition (congestion, latency, jitter, and packet loss), and the business policy. Here is an example of a business policy.

Each application has a category. Each category has a default action, which is a combination of Business Priority, Network Service, Link Steering, and Service Class. You can also define custom applications.

Each application has a Service Class: Real Time, Transactional, or Bulk. The Service Class determines how DMPO handles the application traffic. You cannot change the Service Class for the default applications, but you can specify it for your own custom applications.

Each application also has a Business Priority: High, Normal, or Low. The Business Priority determines how DMPO prioritizes and applies QoS to the application traffic. You can change the Business Priority for any application.

There There are three types of Network Services: Direct, MultiPath, and Internet Backhaul. By default, an application is assigned one of the default Network Services, which can be modified by the customers.

  • Direct: This action is typically used for non-critical, trusted Internet applications that should be sent directly, bypassing DMPO tunnel. An example is Netflix. Netflix is considered a non-business, high-bandwidth application and should not be sent over the DMPO tunnels. The traffic sent directly can be load balanced at the flow level. By default, all the low priority applications are given the Direct action for Network Service.
  • MultiPath: This action is typically given for important applications. By inserting the Multipath service, the Internet-based traffic is sent to the VMware SD-WAN Gateway. The table below shows the default link steering and on-demand remediation technique for a given Service Class. By default, high and normal priority applications are given the Multipath action for Network Service.
  • Internet Backhaul: This action redirects the Internet applications to an enterprise location that may or may not have the VMware SD-WAN Edge. The typical use case is to force important Internet applications through a site that has security devices such as firewall, IPS, and content filtering before the traffic is allowed to exit to the Internet.

Link Steering Abstraction With Transport Group

Across different branch and hub locations, there may be different models of the VMware SD-WAN Edge with different WAN interfaces and carriers. In order to enforce the centralized link steering policy using Profile, it is important that the interfaces and carries are abstracted. Transport Group provides the abstraction of the actual interfaces of the devices and carriers used at various locations. The business policy at the Profile level can be applied to the Transport Group instead, while the business policy at the individual Edge level can be applied to Transport Group, WAN Link (carrier), and Interfaces.

Link Steering by Transport Group

Different locations may have different WAN transports, e.g. WAN carrier name, WAN interface name, DMPO uses the concept of transport group to abstract the underlying WAN carriers or interfaces from the business policy configuration. The business policy configuration can specify the transport group (public wired, public wireless, private wired, etc.) in the steering policy so that the same business policy configuration can be applied across different device types or locations, which may have completely different WAN carriers and WAN interfaces, etc. When the DMPO performs the WAN link discovery, it also assigns the transport group to the WAN link. This is the most desirable option for specifying the links in the business policy because it eliminates the need for IT administrators to know the physical connectivity or WAN carrier.

Link Steering by Interface

The link steering policy can be applied to the interface, e.g. GE2, GE3, which will be different depending on the Edge model and the location. This is the least desirable option to use in the business policy because IT administrators have to be fully aware of how the Edge is connected to be able to specify which interface to use.

Link Steering and On-demand Remediation

There are four possible options for Link Steering – Auto, Preferred, Mandatory, and Available.

Link Selection: Mandatory– Pin the traffic to the link or the transport group. The traffic is never steered away regardless of the condition of the link including outage. On-demand remediation is triggered to mitigate brownout condition such as packet loss and jitter.

Example: Netflix is a low priority application and is required to stay on the public wired links at all times.

Link Selection: Preferred– Select the link to be marked as "preferred". Depending on the type of WAN links available on the Edge, there are three possible scenarios:

  • Where the preferred Internet link has multiple public WAN link alternatives: Application traffic stays on the preferred link as long as it meets SLA for that application, and steers to other public links once the preferred link cannot deliver the SLA needed by the application. In the situation that there is no link to steer to, meaning all public links fail to deliver the SLA needed by the application, on-demand remediation is enabled. Alternatively, instead of steering the application away as soon as the current link cannot deliver the SLA needed by the application, DMPO can enable the on-demand remediation until the degradation is too severe to be remediated, then DMPO will steer the application to the better link.
    • Example: Prefer the video collaboration application on the Internet link until it fails to deliver the SLA needed by video, then steer to a public link that meets this application's SLA.
  • Where the preferred Internet link has multiple public WAN link and private WAN link alternatives: Application traffic stays on the preferred link as long as it meets SLA for that application, and steers to another public link once the preferred link cannot deliver the SLA needed by the application. The preferred link will NOT steer to a private link in the event of an SLA failure, and would only steer to that private link in the event both the preferred link and another public link were both either unstable or down completely. In the situation that there is no link to steer to, meaning another public links failed to deliver the SLA needed by the application, on-demand remediation is enabled. Alternatively, instead of steering the application away as soon as the current link cannot deliver the SLA needed by the application, DMPO can enable the on-demand remediation until the degradation is too severe to be remediated, then DMPO will steer the application to a better link.
    • Example A: Prefer the video collaboration application on the Internet link until it fails to deliver the SLA needed by video, then steer to a public link that meets this application's SLA.
    • Example B: Prefer the video collaboration application on the Internet link until it goes unstable or drops completely, other public links are also unstable or have also dropped completely, then steer to an available private link.
  • Where the preferred Internet link has only private WAN link alternatives: Application traffic stays on the preferred link regardless of the SLA status for that application, and will not steer to another private links even if the preferred link cannot deliver the SLA needed by the application. In place of steering to the private links on an SLA failure for that application, on-demand remediation is enabled. The preferred link would steer to the private link(s) would only steer to another private link(s) in the event that the preferred link was either unstable or down completely.
    • Example: Prefer the video collaboration application on the Internet link until the link goes unstable or drops completely, and then steer to an available private link.
Note: The default manner in which a private link is treated with reference to a preferred link (in other words, that a preferred link will only steer to a private link if the preferred link is unstable or offline) is configurable through a setting on the Orchestrator UI.

Link Selection: Available– This option picks the available link as long as it is up. DMPO enables on-demand remediation if the link fails to meet the SLA. DMPO will not steer the application flows to another link unless the link is down.

Example: Web traffic is backhauled over the Internet link to the hub site using the Internet link as long as it is active, regardless of SLA.

Link Selection: Auto– This is the default option for all applications. DMPO automatically picks the best links based on the type of application and enables on-demand remediation when needed. There are four possible combinations of Link steering and On-demand Remediation for Internet applications. Traffic within the enterprise (VPN) always goes through the DMPO tunnels, so it always gets the benefits of on-demand remediation.

The below examples explain the default DMPO behavior for different type of applications and link conditions. Please see the appendix section for the default SLA for different application types.

Example: Real-Time applications.

  1. Scenario: There is one link that meets the SLA for the application.

    Expected DMPO behavior: It picks the best available link.

  2. Scenario: There is one link with packet loss above the SLA for the application.

    Expected DMPO behavior: It enables FEC for the real-time applications on this link.

  3. Scenario: There are two links with loss on only one link.

    Expected DMPO behavior: It enables FEC on both links.

  4. Scenario: There are multiple links with loss on multiple links.

    Expected DMPO behavior: It enables FEC on the two best links.

  5. Scenario: There are two links but one link is unstable, i.e. it misses three consecutive heartbeats.

    Expected DMPO behavior: It marks the link as unusable and steers the flow to the next best available link.

  6. Scenario: There are two links with both jitter and loss.

    Expected DMPO behavior: It enables FEC and jitter buffer on both links. Jitter buffer is enabled when jitter is more than 7 ms for voice and more than 5 ms for video. The sending DMPO endpoint tells the receiving DMPO endpoint to enable jitter buffer. The receiving DMPO endpoint buffers up to 10 packets or 200 ms of traffic, whichever is first. It uses the original timestamp in the DMPO header to calculate the flow rate for de-jitter buffer. If the flow is not constant, it disables jitter buffering.

Example: Transactional and bulk applications.Enables NACK if packet loss exceeds the threshold that is acceptable per application type (see the appendix for this value).

Secure Traffic Transmission

DMPO encrypts both the payload and the tunnel header with IPsec transport mode end-to-end for private or internal traffic. The payload contains the user traffic. DMPO supports AES128 and AES256 for encryption. It uses the PKI and IKEv2 protocols for IPsec key management and authentication.

Protocols and Ports Used

DMPO uses the following ports:

  • UDP/2426 – UDP/2426: This port is for overlay tunnel management and information exchange between the two DMPO endpoints (Edges and Gateways). It is also for data traffic that is already secured or not important, such as SFDC traffic from branch to the cloud between Edge and Gateway. SFDC traffic is encrypted with TLS.
  • UDP/500 and UDP/4500 – These ports are for IKEv2 negotiation and for IPSec NAT transparency.
  • IP/50 – This protocol is for IPSec over native IP protocol 50 (ESP) when there is no NAT between the two DMPO endpoints.

Appendix: QoE threshold and Application SLA

DMPO uses the SLA threshold below for different types of applications. It will immediately take action to steer the affected application flows or perform on-demand remediation when the WAN link condition exceeds one or more thresholds. Packet loss is calculated by dividing the number of lost packets by the total packets in the last 1-minute interval. The DMPO endpoints communicate the number of lost packets every second. The QoE report also reflects this threshold.

DMPO will also take action immediately when it loses communications (no user data or probes) within 300 ms.

Note: Beginning in Release 5.2.0, users have the capability to modify the threshold values for latency for video, voice, and transactional traffic types through a Customizable QoE feature. This means that customers can include high latency links as part of the selection process and the Orchestrator applies the new values to the QoE monitoring page.