How Businesses Are Driving Big Data Transformation

<< back Page 3 of 4 next >>

The explanation goes like this. Early developers and adopters were driven to solve truly big data challenges. In the simplest of terms, big data meant big hardware costs and, in order to solve that economic challenge, big data needed to run on the lowest cost “commodity” hardware and software that was designed to be fault-tolerant to cope with high failure rates without disrupting service. This is the purpose of HDFS, though HDFS does not differentiate how a “data node” is configured and this is where IT’s standard order list differs.

Enterprise infrastructure organizations have been maintaining the data center needs of companies for years and have efficiently standardized orders with chosen vendors. In this definition of commodity servers, it’s more about industry standards in parts, and no proprietary hardware could limit the use of these servers as data nodes (or any other server needs in the data center). While big data implementation with hundreds to thousands of servers per cluster strive for the lowest cost “white box” servers from less recognized industry vendors with the lowest cost components, their commodity servers can be as low as $2,000 per server. Similar servers from industry recognized big names with their own components or industry “best of breed” components touting stringent integration and quality testing have averaged $25,000 per server in several recent Hadoop implementations that we have been involved with. We have started to coin these servers as “commodity-plus” for mainstream companies operationalizing Hadoop clusters—and they don’t seem to mind. 

For more articles on big data technologies and trends, download the Free Big Data Sourcebook: Second Edition

Another discussion that continues from the early adopters is how a data node should be configured. Some implementations concerned with truly big data configure data nodes with 25 front-loading bays and multi-terabyte slower SATA drives for the highest capacity within their cluster. Other implementations are more concerned with performance and opt for faster SAS drives at lower capacities but balanced with more servers in the cluster for further increased performance from parallelism. Some hyper-performance-oriented clusters will even opt for faster SSD drives in the cluster. This also leads to discussions regarding multi-core CPUs and how much memory should be in a data node. And, there have been equations for the number of cores related to the amount of memory and number of drives for optimal performance of a data node. We have seen that enterprise infrastructure has leaned more toward fewer nodes in a production cluster (8–32 data nodes) rather than 100-plus nodes. Their reasoning is twofold: More powerful data nodes are actually more interchangeable with data centers’ also converging data virtualization and private cloud strategies. Second, ordering more of the powerful servers can yield increased volume discounts and maintain standardization of IT servers in the data center.

The Data Lake Gains Traction

In 2014, we saw more acceptance of the term “data lake” as an enterprise data architecture concept pushed by Hortonworks and its modern data architecture approach. The “enterprise data hub” is a similar concept promoted by Cloudera and also has some of the industry mindshare. Informally, we saw the data lake term used most often by companies seeking to understand an approach to enterprise data strategy and roadmaps. However, we also saw backlash from industry pundits that called the data lake a “fallacy” or “murky.” Terms such as “data swamp” and “data dump” were also thrown around as how things could go wrong without a good strategy and governance in place. Like the term “big data,” the data lake has started out as a high-level concept to drive further definition and patterns going forward.

Throughout 2014, we worked with companies ready to define a clear, detailed strategy based on the data lake concept for enterprise data strategy. While this is profound, it is very achievable with data management principles that require answers to new questions regarding a new approach to data architecture. Some issues are simple and more technical, such as keeping online archiving of historical data warehouse data still easily accessible by users with revised service-level agreements. Some issues are more fundamental, such as the data lake serving as single repository of all data including being a staging area for the enterprise data warehouse (with lower cost historical persistence for other uses as data scientists are more interested in raw unaltered data). Other concerns are a bit more complex, such as persisting customer or other privacy-compliant data in the data lake for analysis purposes. Data governance is concerned with who has access to privacy-controlled data and how it is used. Data management questioned the duplication of enterprise data and consistency.

<< back Page 3 of 4 next >>