Log Parsing With AI: Faster and With Greater Accuracy

Network security logs are a ubiquitous record of system runtime states and messages of system activities and events. They become the primary source of system behavior and are critical when triaging abnormalities in otherwise normal system execution. The logs are usually unstructured textual messages that are difficult to go through manually because of the ever-increasing rate at which they are created. The raw data from the logs is unstructured, noisy, and inconsistent; thus, some preprocessing and parsing is essential.

Parsing logs with regular expressions is the most widely utilized method available for network log analysis. A regular expression (regex) is a sequence of characters specifying how to match a sequence of characters. Outside of one-off parsing, you are most likely going to use regular expressions to repeatedly parse and normalize log files as part of the analysis infrastructure. However, as the log file format changes, regular expressions fail, and this can create failures in how log data is processed and evaluated. This is often the case as log structures vary in source, format, and time. As the number of sources increases, the number of custom regex parsers increases as well.

Advances in NLP

To mitigate the need to create hundreds of custom parsers for each log, natural language processing (NLP) methods are now utilized to automate the task of parsing network security logs. These initial NLP techniques were N-gram analysis, distance measures (Jaccard, Levenshtein), and word embeddings (word2vec). These methods attempt to evaluate the raw log data, extract necessary features from it (source, time, action), and restructure the log in a way it can be analyzed using common techniques. NLP methods are used when the features of the logs are unknown.

The last few years have yielded advances in NLP that take advantage of more complex neural network word representations than were seen in word2vec. Bidirectional Encoder Representations from Transformers (BERT), introduced by Google researchers, is one such innovation. The bidirectional encoder takes two sequences for encoding; one is the normal sequence, and the other is the reverse of it. It consists of two encoders for encoding the two sequences. For the final output, both encoding results are considered. The bidirectional training of language models gives them deeper insight into the context of the text.

Enter cyBERT

While BERT has achieved state-of-the-art results in a variety of NLP tasks related to written human language, applying its pretrained base model directly to network security logs required additional experimentation and training as well as adjustment of the size of the input sequences that could be fed into a BERT model. This resulted in cyBERT (https://github.com/rapidsai/clx/tree/branch-0.11/notebooks/cybert).

The cyBERT project is an ongoing experiment to train and optimize transformer networks to provide flexible and robust parsing of logs of heterogeneous network security data. It is part of the Cyber Log Accelerators (CLX) library, used to bring the GPU acceleration of RAPIDS to real-world cybersecurity use cases. The goal of cyBERT and CLX is to allow network security personnel, cyber data scientists, digital forensic analysts, and threat hunters to develop network security log data workflows that do not require custom regex parsing processes to get the data into a format for evaluation and diagnosis.

Network security logs contain file paths, IP addresses, port numbers, and hexadecimal values in a firm order versus what you would see in a typical string of words. The combination of these log inputs can lead to complex regex that can change depending on the source or the time of creation. cyBERT removes the need to create the regex parsers as it determines each of the log inputs intuitively without having to account for every combination of characters.

A Game Changer

cyBERT is built to be general enough that an organization can take it and train it for its custom network behavior. Instead of using the default corpus of English-language words in BERT, cyBERT is developed using a custom tokenizer and representation trained from scratch on a large corpus of diverse cyber logs. Providing a toolset powered by NLP to perform log parsing is a game changer in the critical and time-sensitive area of cybersecurity.


Subscribe to Big Data Quarterly E-Edition