The Rise of the Data Scientist

The rise of "big data" solutions - often involving the increasingly common Hadoop platform - together with the growing use of sophisticated analytics to drive business value - such as collective intelligence and predictive analytics - has led to a new category of IT professional: the data scientist.

At its simplest level the data scientist merges the disciplines of data processing - such as programming and database management - with data analysis techniques that previously were the province of business analysts and statisticians.  The value of data can only be unlocked by someone who has the ability not just to retrieve and process the data, but also to derive meaning and significance from it.   And this analysis is becoming increasingly mission critical as enterprises strive to improve their competitiveness by leveraging their unique information assets. 

Data science concepts are also showing promise as new paradigms for formal scientific research, where the role of data analysis is becoming an increasingly essential ingredient as the volumes and complexities of scientific data sets increase.

Data analysis has always been a significant component of science.  In the seventeenth century, Johan Kepler famously analyzed mountains of astronomical data - laboriously collected in the previous century by Tycho Brahe - to derive the laws of planetary orbits.   Yet, most scientific discovery has been based on laboratory experiments rather than the exploitation of large sets of data.  

However, over the past few decades we've experienced an explosion in the size of scientific data - from genomics, astronomy, climate, on so on - that need to be mined and analyzed to power new scientific breakthroughs.  The late Jim Greg called this new process of data-driven discovery the "fourth paradigm" of science ( ).

The data managed by the data scientist will often be held in traditional systems supported by a relational database such as Oracle or MySQL.  A lot of data is held in an unstructured format, however, and converting it to relational form is expensive in terms of time, software licenses and storage costs.  For these reasons, the Hadoop platform - which allows massively parallel processing of vast sets of unstructured data on commodity hardware - is often the data scientist's platform of choice. 

Reaching meaningful conclusions from data is, of course, the whole point of data science, and formal statistical analysis is often the best way to do this.  Traditional commercial packages such as SAS and SPSS (now owned by IBM) are often used to conduct such analysis, but, increasingly, the open source "R" package offers a capable and cost-effective solution.

For most of us, the easiest way to make sense of data is by creating effective visualizations using charts and other graphics.  Existing Business Intelligence products such as Tableau, or even Excel, are often sufficient, although there are some very innovative alternative solutions for data display - some of which straddle conventional charting techniques with artistic concepts - to create "data art."

These packaged statistical and visualization solutions rarely offer a complete solution for the data scientist who may need to create complex models or invoke sophisticated machine learning algorithms. The data scientist, therefore, needs to have solid programming skills, often involving the use of specialized techniques such as Map Reduce. 

Given the scientific and competitive value embedded in the today's large data sets, it's not surprising that a new class of professional has emerged to exploit that value.  The role of the data scientist looks like it will be a new and exciting career option for the modern IT professional.