This topic provides answers to frequently asked questions about tokenization.

Question 1

If you have the following filemod:

c:\users\myusername\appdata\local\temp\{1f73cc2c-c826-414e-8d07-457bed7d2ad2} - oprocsessid.dat

where the GUID portion seems to change but oprocsessid.dat stays the same, how can you search to find that filemod path that ends with oprocsessid.dat – that is, where the variable GUID is in the filename of the .dat file in this example?

Answer: Platform Search has no special handling of GUID in any field other than regmod_name. Because the search index only tokenizes the entire filename (in this example, the filename is {1f73cc2c-c862-414e-8d07-457bed7d2ad2} - oprocsessid.dat), a search on filemod_name:oprocessid.dat fails.

However, a wildcard in place of a GUID will work. Although not ideal at the start of the queried value, a wildcard used similar to this, filemod_name:appdata/local/temp/*-oprocessid.dat, can help you focus on any filemods that include oprocessid.dat at the end of the filename.

Question 2

Regarding command line tokenization, why does the following Platform Search not provide any search results?

Search:
fileless_scriptload_cmdline:net.webclient

Expected results:

"iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1')); choco upgrade -y python2 visualstudio2017-workload-vctools; Read-Host 'Type ENTER to exit'"

Answer: Platform Search does not convert a period (.) character to whitespace. You can search for the whole string, a wildcard version, or for tokens that end with .xxxx or .yyyy.xxxx. It makes those tokens assuming those could be file extensions or double file extensions.

In the preceding example, you could search for system.net.webclient or .net.webclient or .webclient or any of those tokens with wildcards in them.

In general, the cmdline fields (process_cmdline, childproc_cmdline, parent_cmdline, and fileless_scriptload_cmdline) tokenize on spaces and the characters \ ( ) [ ] { } ; " ' < > & | , =

If any of those characters are in the command line, they are converted to spaces in the search backend. These characters are still returned in API response data with the original characters, and that search becomes a phrase.

For example, if you are interested in searching for cmd /c "echo LINE1 > bad.vbs&&echo LINE2 >> bad.vbs", the tokens you can search for in this command line are:

cmd /c echo
line1 bad.vbs .vbs
echo line2 bad.vbs
.vbs

You can also combine these tokens in double quotes to query on phrases such as process_cmdline:"cmd /c".

If you include any of the other characters (properly escaped if necessary), they become whitespace.

Question 3

What is the maximum length of a token I can search on?

Answer: In cases where a field's string data has > 32K characters, you can search up to the first 32K characters in that field.

For example, this search works for any subset of the first 32K characters in a process_cmdline:
process_name:powershell.exe AND process_cmdline:WwBCAHkAdAB*

Question 4

How can I search for substrings in a tokenized text field like watchlist_name?

Answer: Fields like watchlist_name, event_description, device_os_version, and many binary headers like process_publisher are tokenized into individual words. For example, a watchlist name of "Carbon Black Endpoint Visibility Take Action", has tokens for "Carbon", "Black", "Endpoint", "Visibility", "Take", and "Action". You can either wildcard individual tokens or search for a phrase, but not both, to find results that match on the watchlist named "Carbon Black Endpoint Visibility Take Action":

Works
watchlist_name:Carbon*
Works
watchlist_name:Carbon\ Black
Works
watchlist_name:"Carbon Black"
Does not Work
watchlist_name:Carbon\ Black*

Question 5

What can I do with regex that is compatible with Platform Search tokenization?

Answer: You can only search for a single token using a regular expression. The token must be in lowercase without special characters.

Question 6

How can I use trailing whitespaces in queries?

Answer: Do not use a trailing space (\\ or \\ \\) at the end of a field query or filter. Instead, use a trailing wildcard or use tokenization to create a query or filter by field name. For example, to match c:\windows\system32\cacls.exe:

Works
c\\:\\\\windows\\\\system32\\\\*
Does not Work
c\\:\\\\windows\\\\system32

See also Searching cmdline Fields using Wildcards.