Leveraging Data Management Solutions for Successful Machine Learning Projects

Effective machine learning projects are becoming a critical tool for mobilizing businesses to achieve a productive, agile, and innovative edge. Though certainly aspirational, a hankering for automation and intelligence is not enough to power successful machine learning.

Innovation is not without complexity; to address the ongoing obstacles preventing effective machine learning adoption—including data quality, integration, and governance—DBTA hosted a webinar, “Data Management Best Practices for Effective Machine Learning Projects,” gathering experts in the field to offer their insight.

Julian Forero, senior PMM at Snowflake, began the discussion by quantifying the challenges that enterprises face when strategically scaling AI, despite their 3x return. According to a 2020 Forrester survey, 62% of enterprises struggle to operationalize or are stuck in proof of concept; 24% have some models in production but are still formalizing the process; and a mere 14% have a defined, scalable, and repeatable process. 

Forero explained that this gap in ML development to production is due to feature engineering, which is ultimately complicated by disparate teams and tools, processing complexity, and broken data foundations.

The Snowflake platform for data science and ML is designed to provide unified data access, a natively integrated ecosystem, and a multi-language, elastic engine. The platform’s most advantageous features include support for structured and semi-structured data, zero-copy cloning, governed collaboration, and reliable, fast performance. Most notable for ML projects, Snowflake can be used as a feature store, where features are stored and calculated.

Forero elaborated further by introducing Coalition, a provider of cyber insurance and security, as a key tool in enhancing ML projects with Snowflake as a feature store. In such use cases, Coalition has been able to increase underwriting throughput by 16%, as well as increase data science productivity and provide a simplified path for new features to accelerate their journeys to production.

Steve Franks, senior solutions architect at Dataiku, advocated for Dataiku’s unified platform as the solution for remediating data management for successful ML adoption. Built to systemize AI and model lifecycle management, the Dataiku platform offers a ready-to-use full stack that can dramatically accelerate production for ML projects.

Users can scale quickly while maintaining control of AI with Dataiku, explained Franks. They then divided the process into three steps: centralize and prioritize with a focused AI portfolio and control tower; explain and qualify with an accompanying project overview and project risk/value qualification; and finally, deploy and monitor with health monitoring, assertion checks, and scenario deployment. 

Zohar Vittenberg, data science team leader at Explorium, positioned external data—and the right kind—as the necessary data management foundation toward efficient ML adoption. Vittenberg explained that the eruption of alternative data providers creates a fragmented domain approach that ultimately limits an enterprise’s ability to find the exact data it needs.

According to a 2022 Explorium survey, 44% of organizations acquire external data from five or more sources, yet many cannot find the right data, integrate it with their data, or feed it into predictive models. Fifty-two percent of enterprises desire an easy way to match and integrate external data with internal data, while 46% want a better way to find and source the right data for their use case.

Furthermore, the typical data acquisition process is lengthy and expensive; data scientists are forced to waste time and financial resources on manual data search, validation, procurement, integration, monitoring, and maintenance.

Vittenberg then introduced Explorium’s external data platform, designed to holistically catalog organizational data and automatically discover what’s most relevant to its user. The platform can easily integrate, transform, and feed external data into predictive models and end systems, enhancing and streamlining the ML development and production process.

Zach Imholte, deployment strategist at Palantir, introduced Foundry, an Ontology-powered operating system for the modern enterprise, which can be implemented to advance ML projects.

Foundry’s goal is to allow real-time connectivity between data, analytics, and operational teams, ultimately encouraging a system that can enable building and deploying of increasingly sophisticated and valuable applications.

The Ontology is at the center of building AI/ML with power and scale, according to Imholte; it is the nucleus of the system, providing real-time connectivity between data, analytics, and operational teams.

Ontology Hydration, which quickly scales pipelines, allows users to synchronize data, models, and applications in as little as days. The Ontology can be custom-fitted to the context of an enterprise, ranging from easily defined objects to specifying key decisions and behaviors to composable functions.

Accompanying the Ontology is an expanse of APIs that read from and enrich the system, allowing users to build individual apps or support third-party tools. Foundry works to close the loop between AI and ML operations, allowing these models to be continuously improved upon and managed overtime.

To learn more about data management solutions for ML success and customer examples, you can view an archived version of the webinar here.