
This paper provides a high-level overview of how Apache Cassandra™ can be used to replace HDFS, with no programming changes required from a developer perspective, and how a number of compelling benefits can be realized in the process.
Hadoop utilizes a scale-out architecture that makes use of commodity servers configured as a cluster, where each server possesses inexpensive internal disk drives. As the Apache Project’s site states, data in Hadoop is broken down into blocks and spread throughout a cluster. Once that happens, MapReduce tasks can be carried out on the smaller subsets of data that may make up a very large dataset overall, thus accomplishing the type of scalability needed for big data processing.
In general, this divide-and-conquer strategy of processing data is nothing really new, but the combination of HDFS being open source software (which overcomes the need for high-priced specialized storage solutions), and its ability to carry out some degree of automatic redundancy and failover make it popular for modern businesses looking for batch analytics solutions.
However, what these businesses are most interested in is not Hadoop’s underlying storage structure, but rather what it facilitates in delivering: a cost-effective means for analyzing and processing vast amounts of data.