The big data revolution is allowing large enterprises to leverage data to achieve competitive advantage and deliver products and services that would have been almost inconceivable just a few years ago. Governments and other noncommercial organizations are also using data assets to improve accountability, fine-tune service delivery, and (mostly) improve our communities.
These capabilities do not come easily, however. While the new data stores and other software components are generally open source and incur little or no licensing costs, the architecture of the new stacks grows ever more complex, and this complexity is creating a barrier to adoption for more modestly sized organizations.
Hadoop is arguably the most ubiquitous component in today’s big data architecture. Hadoop’s HDFS distributed file system remains the most economical form of storage for the massive amounts of fine-grained data underlying big data initiatives, and MapReduce remains the default batch processing paradigm.
However, MapReduce is gradually being overtaken by more flexible and efficient approaches for complex distributed processing, including YARN-based approaches such as Tez. Meanwhile, HDFS is being supplemented—in some cases, even replaced—by the memory-based Spark framework. Spark runs MapReduce-style workloads many times faster than Hadoop, and provides higher-level programming abstractions than Hadoop MapReduce.
At even higher levels of programming abstraction, there are many options for SQL processing—Hive, Impala, SparkSQL, and others—as well as non-SQL data processing languages such as Pig.
Big data applications rarely work in complete isolation from more traditional data sources, and open source tools such as Flume, SQOOP, and Kafka provide options for ingesting data from files or relational databases, or for efficient processing of data streams. Apache Oozie provides a mechanism for tying these various participants into higher-level workflows.
Complex applications built on these foundations often need to manage datasets requiring more real-time updates or more predictable structure, so, you often see operational databases alongside Hadoop or Spark. These can be nonrelational systems such as MongoDB or Cassandra, but open source relational systems such as MySQL and PostgreSQL are commonplace.
Finally, for our purposes, a big data project needs machine-learning and statistical analysis capabilities. Open source statistical and machine-learning frameworks such as R, Spark’s MLlib, RapidMiner, Weka, and Mahout provide a diverse set of options for deriving meaning from the data.
The above represent the most common open source options for a modern big data project. In addition, virtually all major software and hardware vendors attempt to offer consolidated big data solutions, usually leveraging many of these technologies. For instance, Oracle, Microsoft, and IBM all offer solutions that combine Hadoop, Spark, R, and the other key open source components.
Enterprise adopters are gradually shifting from reluctantly accepting open source software to actively demanding open source. Open source solutions—even when provided by a commercial vendor such as Cloudera—help prevent vendor lock-in and inhibit overall software licensing costs. However, the operational cost of implementing the open source solution can be orders of magnitude higher than that of an integrated commercial stack.
The pioneers of big data and leading application vendors exploiting these new technologies generally hire the best and brightest to overcome the inherent complexities involved in integrating these sometimes disparate technologies. It’s not uncommon for them to employ the inventors and key contributors of the various technologies. This strategy obviously cannot scale down to the mid-market. So, how do smaller companies benefit from the increasingly complex big data stack?
The answer may lie in another technology megatrend: the cloud. If vendors of Hadoop as a service and analytics as a service can hide the complexity of the underlying technologies on a cloud platform, the benefits of the complex big data stack might become available to all.