The concept of the data lake has become a hot topic. The data lake retains data in its original format to allow the data to be more flexible for everyone involved. While this sounds fine in theory, it is a more complicated in practice due to the need for governance and security. A recent DBTA webinar covered the topic of the data lake with Emma McGrattan, SVP engineering with Actian, Pete Aven, principal sales engineer with MarkLogic, and Reiner Kappenberger, global product management with HP Security Voltage.
If an enterprise is going to work with big data, McGrattan said, a list requirements are necessary. It must have full ANSI SQL 92 support, be fully ACID compliant, be Hadoop distribution-agnostic, have update capability, the highest concurrency, be highly performant, offer native DBMS security, have a mature, proven planner and optimizer, be native in Hadoop YARN, have a collaborative architecture, and have open APIs.
“Security is incredibly important to people when it comes to unlocking data in a data lake,” said McGrattan. When moving data to a Hadoop environment one should consider access control - to be sure who can access what; role separation to make sure your database administrator can’t access all of the data in the database, security auditing to make sure you know who issued which query from where and when; security alarms in case someone is trying to access something they shouldn’t, and the ability to encrypt data at rest, noted McGrattan.
MarkLogic looks at unlocking the power of the data lake through a NoSQL approach, said Aven. He noted that when organizations want to begin the process of organizing their data they have a few options: get all of the parts together themselves that are going to make their data actionable, get a model kit, or get a pre-built model with accessories. “MarkLogic is used as a complement to Hadoop. HDFS is great for high latency batch process applications, your MapReduce, and inexpensive storage, and then MarkLogic is the real-time low latency,” stated Aven.
Kappenberger discussed big data and the data lake from the viewpoint of security. Location data is an aspect of security that not everyone fully understands, he observed. “If you have geographic data and a person understands that you were shopping at particular place at a particular time, the person only needs four of those records to find you in a dataset that contains those four reference points,” explained Kappenberger. This allows for customers to easily be located in a data lake. Kappenberger suggested that organizations look to the Health Insurance Portability and Accountability Act of1996 (HIPAA) because HIPAA provides a good outline of what data fields should be considered to hold sensitive information.
To watch a replay of this webinar, go here.