When a process launches on an endpoint, the command line for that process is sent to the Carbon Black EDR server.

If the server stored the whole command line as one item and allowed open ended queries of it, query performance would be extremely poor to the point of making search unusable. Instead, the server breaks each command line up into smaller component “tokens” to be stored for use when you enter a command line query.

Tokenization requires that decisions be made about which components of a command become their own token and which components are treated as delimiters between tokens. These decisions involve trade-offs since the same character may be used in different ways in a command. This topic describes how tokenization is done for Carbon Black Hosted EDR instances and Carbon Black EDR 6.3.0 servers (and later). If you are upgrading, see also Tokenization Changes on Server Upgrade.

Tokenization Rules

With enhanced tokenization, the following characters are converted to white spaces and removed before the command-line is tokenized.

Characters Removed Before Tokenization

\ “ ‘ ( ) [ ] { } , = < > & | ;

Several frequently used characters are intentionally not removed before tokenization. These include:

  • Percent ( % ) and dollar ( $ ), often used for variables
  • Dash ( - ), period ( . ), and underscore ( _ ), often found as parts of file names
  • These additional characters: ^ @ # ! ?

Parsing Forward Slashes

The forward slash ( / ) character is handled differently depending upon its position. If it is the start of the entire command line, it is assumed to be part of the path. If it is at the start of any other token in the command line, it is assumed to be a command line switch.

There is one situation in which this parsing rule may not produce the results you want. It is not efficient for the command line parser to distinguish between a command line switch and a Unix-style absolute path. Therefore, Linux and macOS absolute paths passed on the command line are tokenized as if the beginning of the path were a command line switch. So a command line of /bin/ls /tmp/somefile will produce the tokens bin , ls , /tmp and somefile , incorrectly considering /tmp a command line switch.

Parsing Colons

The colon (:) character is handled differently depending upon its position and whether it is repeated. If it is the end of a token, it is assumed to be something the user would want to search for like a drive letter, so it is included. If there are multiple colons at the end of a token or if the colons are not at the end of a token, they are converted to white space for tokenization purposes.

File Extension Tokens

File extension tokens allow searching for either just the file extension or the entire command or file name. In other words, “word.exe” in a command line becomes two tokens: “.exe” and “word.exe”.

Wildcards

There is support for the "?" and "*" characters as wildcards when used as a non-leading character in a query, allowing you to search for any single character or multiple variable characters within a token, respectively.

Note: Do not use wildcards as leading characters in a search.

Tokenization Changes on Server Upgrade

This section is relevant to users upgrading from a pre-6.3.0 version of Carbon Black EDR. If 6.3.0 is your first version of Carbon Black EDR or if you are using a Carbon Black Hosted EDR instance, you do not need to review this section.

Beginning with version 6.1.0, Carbon Black EDR included tokenization option that improved command-line searches. This is standard for Carbon Black Hosted EDR instances, and beginning with version 6.3.0, it is also standard for Carbon Black EDR installations. It adds the following specific improvements, which are described in more detail below:

  • More special characters are removed before tokenization.
  • Forward slash “/” is interpreted as a command line switch or a path character depending upon position.
  • Colon “:” is interpreted as part of a drive letter token or converted to white space depending upon position and repetition.
  • File extensions are stored as a separate token as well as part of a file or path name.
  • Wildcards are supported in non-leading positions within a query.

These changes result in simpler queries, better and faster search results, and reduced storage requirements for tokenized command lines.

Note: If you upgraded from a pre-6.3.0 Carbon Black EDR release and configured Watchlists that use command line queries, these might require a re-write to take advantage of the new tokenization. Review your Watchlist entries to make sure they return the intended results.

Example: Enhanced vs. Legacy Tokenization

The following example shows how the enhanced tokenization in Carbon Black EDR version 6.3.0 differs from the previous version. It can help you convert some older queries to the new standard:

"C:\Windows\system32\rundll32.exe" /d srrstr.dll,ExecuteScheduledSPPC

Using legacy tokenization, the command was broken into the following tokens:

“c:

windows

system32

rundll32.exe”

d

srrstr.dll,executescheduledsppc

The enhanced tokenization in Carbon Black EDR version 6.3.0 breaks the same command into the following tokens:

c:

windows

system32

rundll32.exe

.exe

/d

srrstr.dll

.dll

executescheduledsppc

Examples of new search capabilities due to this tokenization include:

  • You can search for .exe or .dll as part of the command line query.
  • Because of more complex parsing of the forward slash, you can explicitly search for a ‘/d’ command line argument and not worry about false positives from just searching for the letter ‘d’.
  • You can use a wildcard and search for ‘"execute*’ if you want to find a specific term passed to the command line.
  • You do not have to include extraneous single or double quote marks to find a drive letter or command path.

Retention Maximization and cmdline Searches

On the Edit Group page for a sensor group, you can specify Retention Maximization options that help control the information that is recorded on the server to manage bandwidth and processing costs.

See Advanced Settings.

As part of this feature, the process cmdline field for parent processes store also store the cmdlines of their child processes (childprocs) that are affected by a retention setting. This is done because these childprocs do not have process documents of their own to store this information and so the expanded parent cmdline provides a way to search cmdlines for processes no longer recorded separately.

A side-effect of including the cmdlines of these childprocs in the parent’s cmdline info is that a cmdline search intended to match only the parent process’s cmdline will also match against the children. This can result in the parent process getting falsely tagged as a feed hit based on matching a childproc that was not judged to be interesting enough to justify the creation of a complete process doc. Keep this in mind when choosing Retention Maximization settings.