Cybersecurity Is a Data Problem

Cybersecurity can be summarized as a data problem. The amount of data associated with networks is massive, especially when compared to the percentage of traffic that is actually malicious. There is just too much data to analyze in a single day, and this problem is compounded on a daily basis. Threat tactics are constantly changing, with events occurring at a higher frequency, which is forcing the network security industry to prepare for and react to any questionable situation. To make matters worse, cybersecurity specialists are in very high demand, and there is a limited pool of talent from which to draw.

Traditionally, a cyber-event would be identified and escalated via static rules determined through known signatures, previously identified by a security specialist’s triage, review, and response. Technology is now helping to fill the gap and assist the limited resource pool in successfully executing these processes on a daily basis with increased efficiency.

Machine Learning for Cybersecurity

Machine learning-driven approaches for cybersecurity are dynamic and focus on prevention in order to reduce risk and minimize impact of a successful attack event. All components and steps are automated in an effort to reduce the time and effort of the security teams to address attacks. Over the last decade, security analytics processes have evolved from basic offline batch analysis using statistical metrics to real-time machine learning (and deep learning) techniques.

The next evolution of machine learning approaches is logistical: data pipelines that leverage data ingestion, processing, and inference followed by an action. One such open source framework providing these facilities that was recently announced is called Morpheus.

Morpheus provides a streamlined and customizable pipeline, combining the data-preprocessing, inferencing, post-processing, and decision-making steps, all of which may be tailored to the environment. With such a framework, cybersecurity developers can easily create their own tools to support their very specific business environments. Starting with a pretrained model, developers can customize and optimize those models with their own datasets, and apply the AI pipeline to suit their needs, depending on their network and data sources.

The applications can then be deployed in an on-prem data center, cloud, or hybrid cloud scenario—inspecting traffic in real time, making decisions, and automatically updating security rules for newly found threats across all the data feeds to stop attacks at the front door. Every packet, every application log, and every network flow represents the heterogeneous data that can be ingested and processed.

Considering All Data Feeds

From a data source perspective, it is critically important to consider all the different data feeds. These include publish/subscribe systems (e.g., Kafka and Pulsar), data files (application logs), data direct from a security information and event management (SIEM) system or a security orchestration, automation, and response (SOAR) system, or even other sources of threat intelligence. After processing data from myriad sources, findings can be fed back into systems such as the SIEM or SOAR as insights, policies, or actions as they have been determined through the customized pipeline.

All of this is accomplished using machine learning rather than a rules-based approach. Pretrained models provided by Morpheus can be leveraged to handle a variety of use cases including distributed denial-of-service (DDoS) situations, leaked sensitive information, anomalous behavior profiling, phishing detection, predictive maintenance, network mapping, asset classification, domain generating algorithm (DGA) detection, and generic lightweight online detection of anomalies (LODA). Specifically for LODA, Morpheus can be deployed across many different telemetry streams in parallel to monitor for anomalies—and all this is just the tip of the iceberg.

Limiting the Threat of Attacks

Morpheus has models such as cyBERT (natural language processing) to handle automatic parsing of new and unknown log formats, but it can also leverage XGBoost tree-based models to perform anomaly detection. To add to the simplicity and to support the wide range of models, Morpheus integrates with MLFlow, an open source machine learning model repository, so that models can be managed, trained, and tested offline on historical data before they are deployed into the production network security workflow.

With Morpheus, the foundation is in place to ingest data and then process, analyze, classify, react, and repeat in an automated manner. Morpheus models also run on GPUs, providing engineers the absolute fastest solutions available for cybersecurity in the marketplace, which is critical given that security is based on time-to-action.

With the help of machine learning models and a customizable pipeline framework in Morpheus, security developers can more easily manage the massive influx of data to limit the threat that malware, ransomware, phishing, and other malicious attacks have on organizations, both large and small.



Subscribe to Big Data Quarterly E-Edition