Learning to Rank Strings Output for Speedier Malware Analysis
Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious software binaries. When performing malware analysis, they successively apply these tools in order to gradually gather clues about a binary’s function, design detection methods, and ascertain how to contain its damage. One of the most useful initial steps is to inspect its printable characters via the Strings program. A binary will often contain strings if it performs operations like printing an error message, connecting to a URL, creating a registry key, or copying a file to a specific location – each of which provide crucial hints that can help drive future analysis.
Manually filtering out these relevant strings can be time consuming and error prone, especially considering that:
- Relevant strings occur disproportionately less often than irrelevant strings.
- Larger binaries can output upwards of tens of thousands of individual strings.
- The definition of "relevant” can vary significantly across individual human analysts.
Investigators would never want to miss an important clue that could have reduced their time spent performing the malware analysis, or even worse, led them to draw incomplete or incorrect conclusions. In this blog post, we will demonstrate how the FireEye Data Science (FDS) and FireEye Labs Reverse Engineering (FLARE) teams recently collaborated to streamline this analyst pain point using machine learning.
Each string returned by the Strings program is represented by sequences of 3 characters or more ending with a null terminator, independent of any surrounding context and file formatting. These loose criteria mean that Strings may identify sequences of characters as strings when they are not human-interpretable. For example, if consecutive bytes 0x31, 0x33, 0x33, 0x37, 0x00 appear within a binary, Strings will interpret this as “1337.” However, those ASCII characters may not actually represent that string per se; they could instead represent a memory address, CPU instructions, or even data utilized by the program. Strings leaves it up to the analyst to filter out such irrelevant strings that appear within its output. For instance, only a handful of the strings listed in Figure 1 that originate from an example malicious binary are relevant from a malware analyst’s point of view.
Figure 1: An example Strings output containing 44 strings for a toy sample with a SHA-256 value of eb84360ca4e33b8bb60df47ab5ce962501ef3420bc7aab90655fd507d2ffcedd.
Ranking strings in terms of descending relevance would make an analyst’s life much easier. They would then only need to focus their attention on the most relevant strings located towards the top of the list, and simply disregard everything below. However, solving the task of automatically ranking strings is not trivial. The space of relevant strings is unstructured and vast, and devising finely tuned rules to robustly account for all the possible variations among them would be a tall order.
Learning to Rank Strings Output
This task can instead be formulated in a machine learning (ML) framework called learning to rank (LTR), which has been historically applied to problems like information retrieval, machine translation, web search, and collaborative filtering. One way to tackle LTR problems is by using Gradient Boosted Decision Trees (GBDTs). GBDTs successively learn individual decision trees that reduce the loss using a gradient descent procedure, and ultimately use a weighted sum of every trees’ prediction as an ensemble. GBDTs with an LTR objective function can learn class probabilities to compute each string’s expected relevance, which can then be used to rank a given Strings output. We provide a high-level overview of how this works in Figure 2.
In the initial train() step of Figure 2, over 25 thousand binaries are run through the Strings program to generate training data consisting of over 18 million total strings. Each training sample then corresponds to the concatenated list of ASCII and Unicode strings output by the Strings program on that input file. To train the model, these raw strings are transformed into numerical vectors containing natural language processing features like Shannon entropy and character co-occurrence frequencies, together with domain-specific signals like the presence of indicators of compromise (e.g. file paths, IP addresses, URLs, etc.), format strings, imports, and other relevant landmarks.
Figure 2: The ML-based LTR framework ranks strings based on their relevance for malware analysis. This figure illustrates different steps of the machine learning modeling process: the initial train() step is denoted by solid arrows and boxes, and the subsequent predict() and sort() steps are denoted by dotted arrows and boxes.
Each transformed string’s feature vector is associated with a non-negative integer label that represents their relevance for malware analysis. Labels range from 0 to 7, with higher numbers indicating increased relevance. To generate these labels, we leverage the subject matter knowledge of FLARE analysts to apply heuristics and impose high-level constraints on the resulting label distributions. While this weak supervision approach may generate noise and spurious errors compared to an ideal case where every string is manually labeled, it also provides an inexpensive and model-agnostic way to integrate domain expertise directly into our GBDT model.
Next during the predict() step of Figure 2, we use the trained GBDT model to predict ranks for the strings belonging to an input file that was not originally part of the training data, and in this example query we use the Strings output shown in Figure 1. The model predicts ranks for each string in the query as floating-point numbers that represent expected relevance scores, and in the final sort() step of Figure 2, strings are sorted in descending order by these scores. Figure 3 illustrates how this resulting prediction achieves the desired goal of ranking strings according to their relevance for malware analysis.
Figure 3: The resulting ranking on the strings depicted in both Figure 1 and in the truncated query of Figure 2. Contrast the relative ordering of the strings shown here to those otherwise identical lists.
The predicted and sorted string rankings in Figure 3 show network-based indicators on top of the list, followed by registry paths and entries. These reveal the potential C2 server and malicious behavior on the host. The subsequent output consisting of user-related information is more likely to be benign, but still worthy of investigation. Rounding out the list are common strings like Windows API functions and PE artifacts that tend to raise no red flags for the malware analyst.
While it seems like the model qualitatively ranks the above strings as expected, we would like some quantitative way to assess the model’s performance more holistically. What evaluation criteria can we use to convince ourselves that the model generalizes beyond the coverage of our weak supervision sources, and to compare models that are trained with different parameters?
We turn to the recommender systems literature, which uses the Normalized Discounted Cumulative Gain (NDCG) score to evaluate ranking of items (i.e. individual strings) in a collection (i.e. a Strings output). NDCG sounds complicated, but let’s boil it down one letter at a time:
- “G” is for gain, which corresponds to the magnitude of each string’s relevance.
- “C” is for cumulative, which refers to the cumulative gain or summed total of every string’s relevance.
- “D” is for discounted, which divides each string’s predicted relevance by a monotonically increasing function like the logarithm of its ranked position, reflecting the goal of having the most relevant strings ranked towards the top of our predictions.
- “N” is for normalized, which means dividing DCG scores by ideal DCG scores calculated for a ground truth holdout dataset, which we obtain from FLARE-identified relevant strings contained within historical malware reports. Normalization makes it possible to compare scores across samples since the number of strings within different Strings outputs can vary widely.
Figure 4: Kernel Density Estimate of NDCG@100 scores for Strings outputs from the holdout dataset. Scores are calculated for the original ordering after simply running the Strings program on each binary (gray) versus the predicted ordering from the trained GBDT model (red).
In practice, we take the first k strings indexed by their ranks within a single Strings output, where the k parameter is chosen based on how many strings a malware analyst will attend to or deem relevant on average. For our purposes we set k = 100 based on the approximate average number of relevant strings per Strings output. NDCG@k scores are bounded between 0 and 1, with scores closer to 1 indicating better prediction quality in which more relevant strings surface towards the top. This measurement allows us to evaluate the predictions from a given model versus those generated by other models and ranked with different algorithms.
To quantitatively assess model performance, we run the strings from each sample that have ground truth FLARE reports though the predict() step of Figure 2, and compare their predicted ranks with a baseline of the original ranking of strings output by Strings. The divergence in distributions of NDCG@100 scores between these two approaches demonstrates that the trained GBDT model learns a useful structure that generalizes well to the independent holdout set (Figure 4).
In this blog post, we introduced an ML model that learns to rank strings based on their relevance for malware analysis. Our results illustrate that it can rank Strings output based both on qualitative inspection (Figure 3) and quantitative evaluation of NDCG@k (Figure 4). Since Strings is so commonly applied during malware analysis at FireEye and elsewhere, this model could significantly reduce the overall time required to investigate suspected malicious binaries at scale. We plan on continuing to improve its NDCG@k scores by training it with more high fidelity labeled data, incorporating more sophisticated modeling and featurization techniques, and soliciting further analyst feedback from field testing.
It’s well known that malware authors go through great lengths to conceal useful strings from analysts, and a potential blind spot to consider for this model is that the utility of Strings itself can be thwarted by obfuscation. However, open source tools like the FireEye Labs Obfuscated Strings Solver (FLOSS) can be used as an in-line replacement for Strings. FLOSS automatically extracts printable strings just as Strings does, but additionally reveals obfuscated strings that have been encoded, packed, or manually constructed on the stack. The model can be readily trained on FLOSS outputs to rank even obfuscated strings. Furthermore, since it can be applied to arbitrary lists of strings, the model could also be used to rank strings extracted from live memory dumps and sandbox runs.
This work represents a collaboration between the FDS and FLARE teams, which together build predictive models to help find evil and improve outcomes for FireEye’s customers and products. If you are interested in this mission, please consider joining the team by applying to one of our job openings.