Comparing Commercial Versus Open Source Software for Analytics

In order to set up an analytics environment, organizations need to make decisions about the hardware and software technologies to be adopted. Hardware-wise, big data requires specialized infrastructures to store, integrate, clean, and manage the data. Software-wise, many vendors, such as SAS, IBM, Microsoft, Oracle, and Matlab, are currently providing commercial solutions for big data and analytics.

We also see more and more open source, free software solutions (e.g., R, Python, Weka, RapidMiner) being offered in the market. In fact, the popularity of open source analytical software has sparked the debate about the added value of commercial tools. Commercial and open source software each have their merits which should be thoroughly evaluated before any analytical software investment decision is made.

Open Source Solutions

The key advantage of open source software is that it is obviously available for free, which significantly lowers the entry barrier to using it. However, this poses a danger as well since anyone can contribute to it without any quality assurance or extensive prior testing. In heavily regulated environments such as credit risk (Basel Accords), insurance (Solvency II Accord), and pharmaceutics (FDA regulations), the analytical models are subject to external supervisory review because of their strategic impact to society, which is now bigger than ever before. Hence, in these settings, many firms prefer to rely on mature commercial solutions that have been thoroughly engineered and tested, validated, and completely documented.

Many of these solutions also include automatic reporting facilities to generate documentation in each of the settings mentioned. Open source software solutions come without any kind of quality control or warranty which increases the risk of using them in a regulated environment.

Commercial Solutions

A key advantage of commercial solutions is that the software offered is no longer centered on dedicated analytical workbenches for data preprocessing or data mining but on well-engineered, business-focused solutions which automate the end-to-end activities.

As an example, consider credit risk modeling, which starts by framing the business problem (e.g., modeling default risk for a mortgage portfolio) to data preprocessing (e.g., taking care of missing values and outliers) analytical model development (e.g., estimating logistic regression or decision tree models), back-testing (e.g., using traffic light indicator approaches) and benchmarking (e.g., using FICO scores), stress testing (e.g., based on sensitivity and scenario analysis), and regulatory capital calculation.

To automate this entire chain of activities using open source would require various scripts, likely originating from heterogeneous sources, to be matched and connected together, resulting in a possible melting pot of software, whereby the overall functionality and transparency can become unstable and/or unclear.

Contrary to open source software, commercial software vendors also offer extensive help through FAQs, technical support hot lines, newsletters, and professional training courses. Another key advantage of commercial software vendors is business continuity—more specifically, the availability of centralized R&D teams (as opposed to worldwide, loosely connected open source developers). These teams closely follow up on new ana­lytical and regulatory developments, offering a better guarantee that new software upgrades will provide the facilities required. In an open source environment, you need to rely on the community to voluntarily contribute, which provides less of a guarantee.

A disadvantage of commercial software is that it usually comes in pre-packaged, black box routines which, although extensively tested and documented, cannot be inspected by the more sophisticated data scientist. This is in contrast to open source solutions which provide full access to the source code of each of the scripts contributed.

Weighing the Pros and Cons

As a final note, we currently see more and more small and medium-sized enterprises interested in leveraging big data and analytics. Since these firms typically have only limited budgets, they are particularly interested in open source or freeware solutions that can be directly used to analyze their data. Actually, the most popular technologies in use here are web analytics tools (e.g., Google Analytics) to study how companies’ websites are being used, improve their search engine ranking, or decide upon their optimal organic versus paid search online marketing mix.

It is clear that both commercial and open source software have their strengths and weaknesses. Hence, it is likely that both will continue to coexist, and interfaces should be provided for both to collaborate, as is the case for analytics software programs such as SAS and R/Python. The optimal mix also depends upon the size of the firm and the maturity of the big data and analytics projects.

Image courtesy of Shutterstock.

This article first appeared in the Summer issue of Big Data Quarterly Magazine


Subscribe to Big Data Quarterly E-Edition