Cybersecurity attacks have been increasing at an exponential rate. In 2018 alone, more than 2,000 data breaches were reported. The impact of these attacks has been calculated at more than $6 trillion. Given these statistics, the security of data lakes is of paramount importance. We all understand the value of a cloud data lake.
Modern cloud technologies make cloud data lakes easy to set up and maintain, and in addition to being virtually limitless, they provide separation of compute and storage, allowing users to run any engine on top of their data. By nature, cloud data lakes are the first place where data lands. Because of this, they have become the most attractive target for cybercrime. For these reasons, organizations need to adopt especially stringent security controls.
Data Lake Security—Understanding the Requirements
Industries have developed standards and regulations to better protect data. Examples of this include CCPA, to enhance privacy rights and consumer protection for residents of the state of California; FISMA, to ensure the security of data in the federal government; GDPR, for the protection of EU citizen data privacy; and HIPAA standards, for managing healthcare information. While all of them are different, and each one treats a different symptom, these ever-evolving regulations have several requirements in common: access control, auditing, and encryption.
Cloud vendors such as Azure and AWS offer several features that help industries implement security best practices on their cloud data lakes to meet these requirements. These built-in controls go all the way from identity security to security management.
Locking Down the Data Layers
Cloud data lakes are the place where all data lands. However, the more granular view lets us see that the basic structure of a cloud data lake is comprised of network interfaces and data in file formats such as Parquet and JSON, as well as technologies that group all these files into tables such as Hive Metastore and AWS Glue. There is also a semantic layer as part of the architecture—technologies such as Dremio, Spark, and Hive enable data analysis directly from the data lake and, most importantly, protocols and client interfaces that allow users to consume the data.
Understanding the difference between the layers of a cloud data lake is important, because each layer will require a different kind of security depending on the accessibility needed by the user dealing with it. The best way to implement security at this level is to decide who is going to be allowed to have access to each of the layers.
Following the least-permissions-required approach to each layer is a fundamental principle to ensure that each user has just the right amount of permissions to complete his tasks without compromising the integrity of the data. Examples of permissions around the data layers include storage buckets being accessible only to compute engines and data engineers, security permissions that are configured using resource-based identity and access management policies, data tables being accessible to data engineers and data scientists with permissions configured through the implementation of users and roles, and semantic layers accessible to business analysts with the help of access policies defined on Active Directory or other authentication systems.
The complexity of a data pipeline directly affects the security of a cloud data lake; there have been situations where policies are not implemented correctly and millions of rows of sensitive data (i.e., voting records, medical records, and credit card information) have been left in unsecured public storage buckets.
The following are some guidelines that will help avoid these kinds of mishaps:
?Design your semantic layer around secure zones: Identify who needs access to what assets. For example, administrators and data engineers will need access to physical data sources, while access to virtual datasets and curated data will be sufficient for analysts and data scientists.
?Apply column- and row-level permissions in the semantic layer: By doing this, you eliminate complexity by not having to make changes at the application level or create multiple protected versions of the same dataset. Data consumers simply receive the data they need with implemented security that will allow them to see just what they need.
?Apply permissions based on the capabilities of the service: Cloud vendors such as AWS and Azure provide a variety of mechanisms, including IAM policies, role-based access, encryption at rest and transit, and key management, just to name a few.
?Secure and govern user access: This will always ensure that the enterprise’s data is not open to the public. It also provides the opportunity to identify who can access the data, as well as what actions they can take.
?Secure and govern users’ rights: This allows companies to control what privileges an authenticated entity can have within the system. It is imperative to have a security plan laid out before the data lake is created, as this will provide an opportunity to define roles and privileges.
?Leverage metadata governance: Securing your data is only part of the story—securing metadata is just as important. Armed with metadata, an attacker can target users as well as applications within your organization and gain access to data. Metadata controlling systems such as AWS’s Glue can help alleviate that issue through IAM-based policies. Similarly, Azure Data Catalog allows you to specify who can access the data catalog and what operations they can perform.