Statistical Analysis and R in the world of Big Data

Bookmark and Share

The first computer program I ever wrote (in 1979, if you must know) was in the statistical package SPSS (Statistical Package for the Social Sciences), and the second computer platform I used was SAS (Statistical Analysis System). Both of these systems are still around today—SPSS was acquired by IBM as part of its BI portfolio, and SAS is now the world’s largest privately held software company. The longevity of these platforms—they have essentially outlived almost all contemporary software packages—speaks to the perennial importance of data analysis to computing.

Packages such as SAS and SPSS gained traction in academic settings because they allowed scientists and researchers to analyze experimental and research data without the tedium of coding in low level languages such as FORTRAN and COBOL. As computing moved into the mainstream of business process, these statistical packages became an important part of decision support systems that seeded the current massive market for business intelligence tools.  Not surprisingly SAS and SPSS rode this wave to commercial success.

Ironically, the success of these academically spawned packages made them less attractive for academia. Price tags increased, while the focus on business intelligence did not always align with academic desires. 

As a result, professional statisticians sought alternatives to commercial packages. The “S” language, which was designed for statistical programming, seemed an attractive foundation technology. Eventually, an open source implementation of S—called “R”—was released in the late 1990s. 

Bo Cowgill from Google summed up R nicely when he said, “The best thing about R is that it was developed by statisticians. The worst thing about R is that ... it was developed by statisticians.” R has a syntax that is idiosyncratic and disconnected from most other languages. However, R makes up for this in extensibility. Anyone can add a statistical routine to R, and thousands of such routines are available in the CRAN package repository. This repository probably represents the most significant open collection of statistical computer algorithms ever assembled.

Possibly the greatest current weakness of R is scalability. R originally was designed to process in-memory sets using single processor machines. Multithreaded computers and massively large data sets pose a real problem for R.

Revolution Analytics has released a commercial distribution of R based on the open source core that addresses some of these multithreading and memory issues, by linking R to multi-threaded math libraries and adding packages for large data set processing. 

Last year, Oracle released a version of R integrated within its database and big data appliance. The Oracle distribution of R also attempts to provide better threading and memory handling in the base product. In addition, Oracle has included versions of R packages in which the core processing is offloaded into the database. These packages allow the database engine to parallelize the core number crunching (sums, sums of squares, etc.) that is at the foundation of many statistical techniques.

If the term “big data analytics” has any concrete meaning today, it is in the analytics of fine-grained, massively large data sets in Hadoop and similar systems such as Cassandra. So, it’s not surprising that R and Hadoop are two of the key technologies that form the big data analytic stack. Unfortunately, R’s in-memory and threading limitations don’t align well with Hadoop’s massive parallelism and data scale. Not surprisingly, there are significant efforts underway to tie the two together—projects such as RHadoop, RHIPE, and RHIVE are all worth taking a look at.

R arguably represents the most accessible and feature-rich set of statistical routines available. Despite some limitations, it seems poised to be a key technology in big data.