Experts Discuss Data Science and Machine Learning Best Practices

Surviving and thriving with data science and machine learning means not only having the right platforms, tools and skills, but identifying use cases and implementing processes that can deliver repeatable, scalable business value.

The challenges are numerous, from selecting data sets and data platforms, to architecting and optimizing data pipelines, and model training and deployment. In response, new solutions have emerged to deliver key capabilities in areas including visualization, self-service and real-time analytics.

Along with the rise of DataOps, greater collaboration and automation have been identified as key success factors.

DBTA recently held a webinar with Bethann Noble, director of product marketing, machine learning, Cloudera; Gaurav Deshpande, VP of marketing, TigerGraph; and Will Davis, senior director of product marketing, Trifacta, who discussed new technologies and strategies for expanding data science and machine learning capabilities.

According to Noble, who cited Morgan Stanley Research, “AI, IoT and Augmented Reality are about to reinvent how industries utilize data—and could drive an era of productivity growth in our cities, farms, factories and hospitals.”

However, 91% of organizations struggle to reach data maturity, she said. With Cloudera, businesses can grow to overcome this problem, according to Noble.

Cloudera’s mission consists of:

  • Believing that data can make what is impossible today, possible tomorrow
  • Empowering people to transform complex data into clear and actionable insights
  • Delivering an enterprise data cloud for any data, anywhere, from the Edge to AI

Cloudera provides an enterprise data cloud architecture with multi-function analytics, is hybrid and multi-cloud, is secure and governed, and is an open platform.

Deshpande suggested enterprises consider TigerGraph. TigerGraph offers:

  • Real-time performance: Sub-second response for queries touching tens of millions of entities/relationships
  • Transactional (Mutable): Graph Hundreds of thousands of updates per second, billions of transactions per day
  • Scalability for massive datasets: 100 B+ entities, 1 Trillion+ relationships
  • Deep link multi-hop analytics queries: traverse 10+ hops deep into the graph performing complex calculations
  • Ease of development & deployment
  • Enterprise grade security: encryption at-rest and in-transit and control access to sensitive data based on user role, dept or organization with MultiGraph

Today, every company has access to the same algorithms but not the same data, Davis explained.

“80% of the work in any data project is in cleaning the data,” Davis said, citing DJ Patil, former chief data scientist of the United States. This impacts the entire data team but leading data platforms recognize data cleaning/prep is critical to machine learning.

Trifacta empowers domain experts with intelligent visual interfaces that automate assessment and transformation of data, enables IT to collaboratively curate and operationalize data pipelines authored by domain experts, and establishes an enterprise-wide platform that refines data from a variety of sources, supporting a range of users and use cases.

An archived on-demand replay of this webinar is available here.