Data science, the ability to sift through massive amounts of data to discover hidden patterns and predict future trends, requires an understanding of many different elements of data analysis. In a 3-hour pre-conference workshop at Data Summit 2018 in Boston, Joe Caserta, founder and president of Caserta, a data analytics consulting and implementation firm, provided a deep dive into the fundamentals of data exploration, mining, and preparation, and the application of the principles of statistical modeling and data visualization in real-world applications.
Speed to value is the new metric that companies care about, and data is a key differentiator, said Caserta, noting that putting data to use must be accomplished now in a day, or at most a week. He added that 63% of organizations realize a positive return on analytic investments within a year, 69% of speed-driven analytics organizations created positive impact on business outcomes, and 74% of respondents anticipate a speed at which executives expect new data driven insights will continue to accelerate.
The reason data science is critical now, said Caserta, is that the costs of compute and storage are dramatically lower than just a few years ago; data generated by all aspects of society has dramatically increased; and there is a need to efficiently learn what there is to know about our data.
The three broad areas of expertise for a data scientists are modern data engineering/data preparation; domain knowledge/business knowledge; and advanced mathematics/statistics, said Caserta, adding that he himself does not know a single person who can do everything that is required of a data scientist. Instead, typically, an organization benefits from a data science team consisting of individuals with a range of skills. In order to be a data scientist, however, he noted it is imperative that an individual have an understanding of the business and its needs.
Caserta explained the steps involved in a data science project, including Business Understanding to gather insight on business requirements; Data Understanding, including data discovery, data profiling, cleansing, and munging; Data Preparation; Data Modeling, including machine learning, and which models to use and when; Evaluation; Deployment; and the role of Data Governance. Increasingly, organizations are employing people in the role of “chief data officer” to guide projects through to completion and it is a job that is on the rise.
One of the most critical aspects of embarking on a data science project is the impact and discomfort that doing things a new way may cause to longstanding employees who may be resistant to change. Providing training and guidance is necessary to give people in established roles a level of comfort, Caserta said.
In terms of key technology and technique takeaways, Caserta noted that the cloud and Spark can provide a relatively low cost and extremely scalable platform for data science; AWS S3 and Google GCS offer great scalability and speed to value without the overhead of structuring data; Spark, with MLlib, offers a great library of established machine learning algorithms, reducing development efforts; and Python and SQL are choices for data science. Go Agile, and follow best practices (CRISP-DM), and also employ “Data Pyramid” concepts (Landing Area, Data Lake, Data Science Workspace, Big Data Warehouse) to ensure data has “just enough governance,” Caserta advised.
Data Summit 2019, presented by DBTA and Big Data Quarterly, is tentatively scheduled for May 21-22, 2019, at the Hyatt Regency Boston with pre-conference workshops on May 20.
Many Data Summit 2018 presenters are making their slide decks available for download.
For more information, go to www.dbta.com/DataSummit/2018/Presentations.aspx.