RainStor Provides Big Data Retention on Cloudera’s Distribution including Apache Hadoop

Bookmark and Share

RainStor, an infrastructure software company specializing in online data retention (OLDR), today announced that RainStor 4.5 can be deployed using Cloudera's Distribution including Apache Hadoop.

RainStor says that its technology can be used to retain and access data sets on the Hadoop Distributed File System (HDFS) at a physical footprint at least 97% smaller. The approach combines Hadoop's big data processing, management and analytics with RainStor for compliant data retention on existing, low-cost servers and storage.

"We are focused on storing and retaining data in its original form but at a much lower priced footprint than you would do with a normal relational database or a data warehouse or, in this particular case, even if you ran it on low-cost commodity hardware in Hadoop," Ramon Chen, RainStor's vice president of product management, tells 5 Minute Briefing. Hadoop has provided a new way of processing data at a lower cost "but really it is not making the big data problem smaller," he says. While lower cost commodity hardware can be used to process the data, "in our mind, and in our customers' minds, there is lower cost and then there is really affordable cost," Chen states.

If you look at Hadoop, it still requires at large scale a lot of machines, says Chen. "Not everybody can run Google-scale or Facebook-scale server farms to solve their business problems," says Chen.

According to the company, RainStor on HDFS, using locally attached commodity storage, offers a low initial capital investment and ongoing total cost of ownership for retaining petabytes of data. RainStor's repository compresses the data using a patented value and pattern de-duplication technique and stores it in immutable form on HDFS. RainStor has built-in security, audit trails and granular retention and expiry policies for managing the lifecycle of stored data. Data within RainStor can be accessed through standard structured query language (SQL), specialized RDBMS native SQL and standard BI tools via ODBC/JDBC.

Depending on the Hadoop replication factor, the size of stored data can be a significant multiple of the raw data loaded, and, to counteract this, most Hadoop deployments rely on the use of binary compression (such as LZO), which typically yields on average 5 to 1 compression and comes with a re-inflation performance penalty upon access, according to RainStor. In contrast, it says, RainStor achieves compression rates of 40 to 1 or greater and allows data access without re-inflation.

According to RainStor, its compression, lifecycle management and compliant retention features, combined with HDFS' low cost commodity disk and scale-out benefits, result in value for organizations that require big data analysis and retention.

Cloudera's Distribution including Apache Hadoop is becoming "the gold standard in enterprise Hadoop deployments," notes Chen.

"What we are saying is that RainStor's ability to run and store the data once it is compressed on the Hadoop system is complementary to Hadoop," he adds. There are things that you would want to use Hadoop for that RainStor is not optimized for such as high performance analytics and rapid timeframe numbers crunching, but, Chen asserts, "if you need to hold the information and keep it for compliance purposes, and show the original source data that was the result of the calculation, or store the calculation results themselves, then RainStor provides much better economics."