Data Forwarder and Duplicate Handling

The Carbon Black Cloud Data Forwarder is a distributed, horizontally-scalable service for dynamically processing large volumes of variable streaming data.

Data Forwarder is built with both performance and cost management as key goals of its architecture. These goals are common for any massively distributed data processing engine in a commercial setting.

One of the key challenges in processing streaming data at scale at reasonable cost is that there are fundamental tradeoffs:

Ensuring no data is lost
Ensuring all data is processed
Ensuring all data is processed in a reasonable time
Minimizing record duplication

Typical for multi-tenant data processing, this requires use of horizontal scale (also known as parallel data processing nodes). When two or more nodes in a data processing system are reading from the same queue, it is necessary to have logic that arbitrates who is responsible for processing one or more of the records in that queue.

Further, when processing high volume data, it is generally more efficient and cost-effective to assign data records in batches, rather than assign and process one record at a time. The assignee will then commit a checkpoint after it has completed work on its batch, thus indicating to the system as a whole that the records in that batch have been successfully processed.

Handling Failure Modes

Failures happen in all computing, and there will be measures put in place to recover from such failures. One failure mode that is particularly relevant is "what happens when one node, assigned and processing a batch of data, but not yet having committed a checkpoint for that batch, dies before finishing and checkpointing?"

Given that it is impossible to know the exact state of completion without the checkpoint to affirmatively verify, such systems must assume that at least some of that batch's data has not been processed — effectively leaving the system no choice but to assign the entire batch to another node for processing.

In the case of the Carbon Black Cloud Data Forwarder, this means that it is possible for some but not all events or alerts in a batch to have been successfully forwarded before the system re-assigns that batch of records to another node. In this case, the already-forwarded events will be sent again, together with those events that had not previously been forwarded.

In rare circumstances this can happen multiple times. In such cases, multiple data-processing nodes can fail to complete-and-checkpoint the task of forwarding a particular batch of events, so the data can be re-processed multiple times.

We observe that the Carbon Black Cloud Data Forwarder typically duplicates no more than 1% of all events, and no more than 0.1% of all alerts. These frequencies can and do vary (up or down) by customer, by time of day and by activity level of individual customer's endpoints.