Lifting the Limits of Analytics Using R

Bookmark and Share

Image courtesy of Shutterstock

Statisticians and data miners have been using R software and language for analytics within research and academia environments for some time. But, over the last 2 years the adoption of R has taken off in business analytics and is now most popular analytic language according to a KDD Nuggets Poll.

This community of analysts pride themselves on great results and creating high quality code and analytics. This open source resource commands the attention of students, professionals, data miners and scientists alike who are drawn to the flexibility and the powerful of the R language, the rich set of data manipulation and analytic functions and extensibility of the R project.  This open source community has contributed thousands of packages to meet the diverse needs across research and business analytics.  And, to top off the perks of the software and language – R is free.

But even in the face of all of these benefits, programmers still face fundamental challenges: R runs in-memory and is not thread-safe. This means analysts must work within data and processing limitations imposed by their system. These limitations make it difficult to build enterprise-class applications that scale. If you’re analyzing data stored in a spreadsheet or a few CSV files, then single threaded, in-memory processing is reasonable. However, in the business world, data comes in a variety of formats – web log, customer transactions, store purchases, loyalty information, product information, margins, and metrics – all of which are represented as large separate dataset for analysis. R was not designed for this kind of data volume or workload.

The Challenge with R

R is an in-memory application that allocates and creates data structures in-memory during processing. Basic algorithms such as linear regression create multiple copies of data structures as modifications are made. As a result, a simple analysis on a large dataset can quickly exceed memory limitation, causing your system to slow to a crawl. One solution is to run R in parallel across multiple servers or compute nodes. The problem is R is not thread-safe; meaning multiple instances of R cannot process the same data and provide reliable results. This challenge limits concurrent R processing to row or partition independent tasks, such as calculating a customer score based on data contained within a single row.  If you need to run an analytic function against all your data, this requires specialized parallel programming skills. Some packages get around these memory limitations by using thread-safe routines written in other languages such as SQL, Java or C++.  And, to further optimize the process, these packages can be run in-database to eliminate unnecessary data movement and also leverage the inherent MPP architecture in massively parallel databases.    

Why is Scalability Important?

From a business perspective, scalability is a necessity. Business analytics requires data from multiple sources integrated together in a meaningful way to provide insights into customer behavior and business trends. Analyzing data in silos limits your view, making it impossible to get a full picture of your customer or business. For example, organizations cannot gain insight into the full scope of the customer purchase behavior without a complete view of all transactions across multiple channels. Analyzing the online purchase behavior and global store transactions across multiple store brands combined with customer profile requires scalable analytics.

In these types of situations, a developer working within the confines of their memory limitations must sample their data down to a manageable dataset. Sampling, if not done right, can introduce sampling errors and bias and small, interesting patterns may not register. Within a distributed environment, analyst has to build a thread-safe, parallel version of the algorithm to run across all their data. This requires expertise in the algorithm, parallel programming and their data. There is a real viable need for prebuilt analytics that operates across all your data. This removes the data limitations and takes away the need to understand parallel programming.  The result is faster and more cost effective analytic practice.

This expedited process opens your team up to think strategically and allows them more agility and experimentation. This is where innovation comes from – those “aha” moments providing clarity into your datasets.

Selecting the Right R Solution

R programmers need to embrace parallel technologies to help accelerate and simplify business analytics.  Analytics is iterative in nature; the right solution will encourage easy access to data and scalable analytic methods that may lead to some other interesting insights. Disrupting creative processes is ultimately a hindrance on business practices.

The right R solution should address the following six requirements:

  1. R Interface and Tools: The enterprise R solution must leverage existing R client tools and R language syntax and signature. Any new packages and solutions should extend the R language instead of replacing or require a new interface.  Stay true to the R Project.
  2. Access to any data: The solution should allow self-service access to data to other sources within the R language. The R user should be able to create data frames that access any dataset regardless of its physical location.
  3. Scalable analytics: Be sure to have pre-built data exploration, data manipulation and analytic functions that run on all your data at scale and leverages high speed, parallel processing of the system.  Syntax and signature should follow the R language and return the same results at scale.
  4. Flexibility: One of the benefits of R is its wide variety of packages contributed by the community.  The ultimate solution will have a flexible environment and incorporate any open source R packages and runs it at scale. Packages should be easily installed by the analyst and run in parallel. 
  5. Parallel constructors: The solution should also offer advanced capabilities within the R language to parallelize analytics contained within the R packages leveraging common techniques such as the Split/Apply/Combine constructs.
  6. Deployment:  Model deployment converts the creative analytic project into real business value. The solution should facilitate easy integration of R models and results into the production applications and processes.

There are many R solutions available in the market today. The first step is to understand your current and future analytic requirements. Carefully evaluate solutions to meet your business requirements and priorities and help your organization grow.