This topic provides answers to frequently asked questions about tokenization.
Question 1
If you have the following filemod:
c:\users\myusername\appdata\local\temp\{1f73cc2c-c826-414e-8d07-457bed7d2ad2} - oprocsessid.dat
where the GUID portion seems to change but oprocsessid.dat stays the same, how can you search to find that filemod path that ends with oprocsessid.dat – that is, where the variable GUID is in the filename of the .dat file in this example?
Answer: Platform Search has no special handling of GUID in any field other than regmod_name
. Because the search index only tokenizes the entire filename (in this example, the filename is {1f73cc2c-c862-414e-8d07-457bed7d2ad2} - oprocsessid.dat), a search on filemod_name:oprocessid.dat
fails.
However, a wildcard in place of a GUID will work. Although not ideal at the start of the queried value, a wildcard used similar to this, filemod_name:appdata/local/temp/*-oprocessid.dat
, can help you focus on any filemods that include oprocessid.dat at the end of the filename.
Question 2
Regarding command line tokenization, why does the following Platform Search not provide any search results?
fileless_scriptload_cmdline:net.webclient
Expected results:
"iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1')); choco upgrade -y python2 visualstudio2017-workload-vctools; Read-Host 'Type ENTER to exit'"
Answer: Platform Search does not convert a period (.
) character to whitespace. You can search for the whole string, a wildcard version, or for tokens that end with .xxxx
or .yyyy.xxxx
. It makes those tokens assuming those could be file extensions or double file extensions.
In the preceding example, you could search for system.net.webclient
or .net.webclient
or .webclient
or any of those tokens with wildcards in them.
In general, the cmdline fields (process_cmdline
, childproc_cmdline
, parent_cmdline
, and fileless_scriptload_cmdline
) tokenize on spaces and the characters \ ( ) [ ] { } ; " ' < > & | , =
If any of those characters are in the command line, they are converted to spaces in the search backend. These characters are still returned in API response data with the original characters, and that search becomes a phrase.
For example, if you are interested in searching for cmd /c "echo LINE1 > bad.vbs&&echo LINE2 >> bad.vbs"
, the tokens you can search for in this command line are:
cmd | /c | echo |
line1 | bad.vbs | .vbs |
echo | line2 | bad.vbs |
.vbs |
You can also combine these tokens in double quotes to query on phrases such as process_cmdline:"cmd /c"
.
If you include any of the other characters (properly escaped if necessary), they become whitespace.
Question 3
What is the maximum length of a token I can search on?
Answer: In cases where a field's string data has > 32K characters, you can search up to the first 32K characters in that field.
process_cmdline
:
process_name:powershell.exe AND process_cmdline:WwBCAHkAdAB*
Question 4
How can I search for substrings in a tokenized text field like watchlist_name
?
Answer: Fields like watchlist_name
, event_description
, device_os_version
, and many binary headers like process_publisher
are tokenized into individual words. For example, a watchlist name of "Carbon Black Endpoint Visibility Take Action", has tokens for "Carbon", "Black", "Endpoint", "Visibility", "Take", and "Action". You can either wildcard individual tokens or search for a phrase, but not both, to find results that match on the watchlist named "Carbon Black Endpoint Visibility Take Action":
Works | watchlist_name:Carbon* |
Works | watchlist_name:Carbon\ Black |
Works | watchlist_name:"Carbon Black" |
Does not Work | watchlist_name:Carbon\ Black* |
Question 5
What can I do with regex that is compatible with Platform Search tokenization?
Answer: You can only search for a single token using a regular expression. The token must be in lowercase without special characters.
Question 6
How can I use trailing whitespaces in queries?
Answer: Do not use a trailing space (\\
or \\ \\
) at the end of a field query or filter. Instead, use a trailing wildcard or use tokenization to create a query or filter by field name. For example, to match c:\windows\system32\cacls.exe
:
Works | c\\:\\\\windows\\\\system32\\\\* |
Does not Work | c\\:\\\\windows\\\\system32 |
See also Searching cmdline Fields using Wildcards.