What CIOs Should Look for in a Big Data Database

New database vendors focusing on big data and cloud have emerged to help companies make the transition to new computing models and to cope with the exploding volumes of data in the enterprise.

Here are the five key characteristics to look for when evaluating a big data database

  1. Speed and Agility - The infrastructure required for analyzing massive data stores must be able to deliver rapid response to multiple query types, sometimes faster than a relational database, and across a wider variety of data types. Rapid query response times rely upon sophisticated algorithms in the database that learn data patterns and store them as single instances. By understanding and recognizing patterns, queries can run against a smaller slice of data; this is where huge benefits in performance occur. Modern enterprise-class databases should also support a variety of operating systems and platforms in order to be truly flexible and support ever-changing business requirements. These include running natively on the Apache Hadoop Distributed File System (HDFS) and delivering standard SQL query access in addition to supporting open-source access methods such as Hive query and Pig scripts. Fast query access and the flexibility to work with different types of data maximizes the investment in storage, hardware and skilled resources and delivers more rapid time to value.
  2. Data Volume Crunching - As data volumes grow and companies increasingly store both unstructured and structured data on Hadoop, companies need to consider technologies that can cost-effectively manage the growth of the cluster in order to manage the operational costs of running it. Because Hadoop is open source, the consensus often is that the costs are low.  However, that does not hold true when you consider the data triplication that needs to take place for high availability and after hundreds of terabytes are stored, data center costs begin to creep up. Data compression technologies in some systems offer up to 40X reduction on structured and semi-structured data types. The highly compressed data in partition files is easily managed using standard file systems and storage platforms and requires minimal IT resources to set-up and maintain, further reducing overall TCO. Advanced compression techniques helps IT administrators manage data growth with support for a range of data warehousing workloads.
  3. Archiving Chops - Cost-effective archiving is now becoming a more challenging problem to solve especially as organizations are experiencing rapid data growth and more stringent data retention requirements. As companies aim to exploit the large volumes of data stored across many different systems, the expense and management of storing that data for future access and analysis is spinning out of control. Organizations that offer turnkey solutions for archiving data make it easier for large enterprises that wish to offer this as an internal IT service offering. Ultimately it is the business group that decides which data is historical and can be moved to a dedicated archive store, but IT needs to provide the technology stack that makes this a straight-forward service offering that can be automated based on configurable business rules.
  4. Multi-Tenancy - Leveraging the cloud for an archive solution is an ideal platform due to its cost efficiencies at scale and its elastic nature: essentially, pay-as-you-go. For the most part, this is an attractive proposition for large enterprises where costs range from approximately 10c per gigabyte per month compared to $1.50 per gigabyte per month on an on-premise network attached storage (NAS) environment.  Multi-tenancy gives IT department’s extreme flexibility in providing database services to support of new initiatives. Look for database file systems that support multi-tenancy in a cloud environment such as Amazon AWS using S3 for storage or EC2 for query. Databases that are optimized for cloud environments should also be able to provide stringent security when required, for hosting in a private cloud such as EMC Atmos.
  5. Enterprise-Grade Security for Big Data - Hadoop is a popular big data platform because of its properties for enabling low-cost, scalable data management on commodity hardware. However, when it comes to handling high value and sensitive data such as social security numbers, credit card information or other personally-identifiable data, Hadoop does not have the necessary robust security features. A database sitting native on HDFS should cover the five layers of security, privacy and integrity: authentication; access controls and policies; auditing, which records all activity of an authorized user in fine detail; privacy of data through encryption and finally; the immutability layer which ensures that data can never be changed once entered.

About the author:

Jyothi Swaroop is director of product marketing at RainStor.