Securing information systems and data is a foundation for any organization. Detection of insider threats can be a considerable challenge for threat detection systems and security analysts. This is due to the difficulty of determining non-normal actions from internal system behavior data.
Each organization’s internal network behavior is very complex. Access controls, data permissions, credentials, tokens, and application integrations are applied to the many business units, cadres, and teams based on their purpose to fulfill the organization’s goals. In most cases, the users of each of these groups will be accessing different utilities within the organization multiple times a day. Any time that data or proprietary information is accessed, a log is generated to account for the action. The activities carried out by system users generate massive volumes of data over time.
Most engineers will consolidate logs into a single location with access to the cloud, making this even easier to accomplish. The logs will be stored in their “raw” format in addition to going through a transformation and enrichment workflow to structure logs from heterogeneous sources into datasets with specific schemas. From here, behavior of users and additional methods of threat categorization can be developed.
Rules-based threat categorizations take this data and look for certain behaviors and return results that meet a certain condition. Statistics-based threat categorizations can look for meaningful outliers of risk from a distribution generated by sampling from the overall population.
Threat Categorization
Many researchers have devoted their efforts to building threat categorization models with machine learning. One traditional approach to the insider threat detection problem is supervised learning, which builds data classification models from training data captured in the logs. Unfortunately, the training process for supervised learning methods tends to be time-consuming and expensive while dealing with highly imbalanced data.
Attempts to combat challenges associated with training time and data have led to new approaches leveraging deep learning and graph analytics on GPU hardware. In the last decade, new innovations in machine learning frameworks that were written in the Python coding language have leveraged the power of GPUs. These frameworks provide the ability to explore, transform, and apply algorithms on the data in the GPU but at a speed never before seen on a CPU.
This allows for a reduction in training time and expense as well as the opportunity to apply advanced analytical techniques on the data that is not subject to the same constraints of traditional machine learning methods. For example, a deep autoencoder is a deep learning algorithm that can represent nonlinear relationships in the data and does not necessarily require a label associated with each log to generate a feature that represents a baseline of user behavior on a network. The autoencoder is able to encode normal behavior, and, when decoded, there will be minimal error. If new feature vectors are generated containing non-normal behavior, the decoded output would result in an error that could trigger an alert for further investigation.
Graph Analytics
Graph analytics is a method of analysis concerning the nature of networks and relationships (edges) between entities (vertices). In a dynamic network of many entities, graph analytics can tell us about the behavior and relationships between them. The issue that comes into play is the scale to which many of these networks can grow. If the network is derived from the logs captured from system user interaction, the number of edges could reach billions. Recent innovations in graph analytics on GPUs offer the capability to process networks of this size using standard graph algorithms such as PageRank and Louvain 1,000x faster than comparable CPU methods, resulting in quicker insights and applied responses.
The goal of insider threat detection is to distinguish non-normal behavior from a mountain of recorded actions. To identify potential threats, the organization must establish relevant algorithms and automate these processes in an efficient way. The application of GPUs is no longer limited to graphics and display, but has expanded to scientific computation, engineering simulation, and AI. Leveraging these algorithms and advanced hardware provides the potential to greatly boost the performance of computationally intensive programs for both machine and deep learning.