7 Key Considerations: Data Flows for Machine Learning Projects

The flexibility, agility and ultimate cost of machine learning projects can be significantly impacted by data logistics and dependencies, according to Jim Scott, VP, Enterprise Architecture, at MapR. By improving how they pursue machine learning, Scott contends, organizations can attain benefits in both long-term costs and maintenance.

Recently, Scott discussed common issues he sees in working with data scientists and ways to improve processes for implementing data flows into machine learning that will ultimately make their lives easier.

What are some of the issues involved in data flows for machine learning projects? Is it mainly model management?

Model management is definitely part of it but there is more to it than just that. But for the data scientist it is really their entire workflow which includes model management because the workflow includes going into production and then being able to swap models out while in production.

What do people fail to grasp about workflows for machine learning?

In thinking about the overarching topic of implementing machine learning or AI in a process, it is often assumed that a data scientist can simply do it, build it, and deliver it. However, it is one thing to do it once, but building it a second, third, or tenth time and creating a way to truly turn something that is not quite standard software engineering into something that needs to fit into a  software engineering type of model  is very different.

What are the key challenges?

Not only do data scientists need to write software like software engineers do, but they need to be able to consume the data created by the software engineer.

They need to be able to both version their software—and there are sets of technologies using something like Git to handle that—but then the data scientists also need to be able to version the data that is being used as an input to those machine learning models. And they also need to be able to version the outputs so that, as they are changing their models, they can see whether the changes they have been making to the models are trending in the right direction or the wrong direction. They can verify over time that if they have been making changes to the source data in order to fit how their models work or they types of data that they are pulling in, that they can have a rational approach it, as opposed to trying to figure out what they did 15 iterations ago.

So the pipeline is different for software engineers and data scientists?

It is similar, but it is different enough. Data scientists have to take data into consideration.  They are not just generating data; they have to input and output data, and be able to work on data on both sides of their pipeline. Then, of course, they have got to put this into a production process so, arguably, it is a bit more complicated than a standard software engineering approach.

What else?

The second problem is from the model management perspective.

When a new iteration of a model is required because it is not providing a high enough percentage of accuracy, it is necessary to then test new iteration and compare the results to the first using an event-based  model.

Then, if you are getting the results you want, you would flip the version you want to be live and turn the old one off. That is one of the things we have got documented as part of the MapR rendezvous architecture. You can put as many models as you want into production. They can all listen to the same event streams and they all put the results in one location, and then there is a little piece of business logic sitting on the other side that says: OK, everybody put their answers here and the best answer is selected for use. You might use one model now and then consider the others later or take a blended average of the top three. There is a lot of flexibility in the rendezvous architecture for model management.

What is the goal of the seven steps you have identified?

The biggest problem that this is intended to help address is friction. It is the amount of effort required by the data science community to do their jobs.

A commonly referenced statistic is that is that 90% of a data scientist’s time is spent working on data logistics which means that only 10% of their time is spent actually solving the problem that they have expertise in.  The more that we can reduce the time spent on data logistics, the more opportunity there is for data scientists to focus on their core competency. Data jockeying is not the fun part. It is very mundane and it gets old very fast.

Scott’s seven key recommendations for  implementing data flows into machine learning models include:

  1. Workflows are a critical next step for machine learning and data logistics management. Users must stop performing every step manually if they hope to move a workload into a production environment. Service levels can be maintained more easily when built into a workflow. Exception management becomes much easier to manage within a well-defined pipeline. Of course the proper workflow tooling must be easily moved between environments.
  2. Data dependencies cost more than code dependencies. Code dependencies and the technical debt they create are fairly well understood and with proper development practices are easy to track and control with a well-practiced discipline. Data dependencies on the other hand are less well understood. In a Google white paper they refer to machine learning as the high interest credit card of technical debt and cite how long term costs and agility obstacles can unwittingly be created with relatively straightforward machine learning projects.
  1. Hidden costs absorb future developer / data scientist bandwidth. When users perform steps manually in the AI-SDLC manually there is room for mistakes and latency to complete tasks. When training has been performed on static data and the data now changes to real-time users will be impacted. Being able to version data along the way for traceability is a requirement not only for sanity checking, but for changes over time. Workflows must be leveraged in order to ease the burden on the end-user.
  1. Take a broad view of future interactions.Initial applications could be part of broader operational use cases where shared data needs to be accessed or updated as part of other workflows. Recording the raw data is really a big deal for workflow traceability. Databases were made for updates, streams are safer and a more natural fit for a user pipeline, especially when dealing with real-time data. Workflows must be able to scale out to support successful use cases. When considering adding more data sources in the future, will copying data between systems cause a complete collapse under the gravity or due to latency?
  1. Development with microservices in containers helps alleviate the burden of making changes to a user’s pipeline.Since microservices are deployed as a single unit of work, it becomes very easy to string them together into workflows to create a more modular and nimble system.
  1. Flexibility with respect to AI and machine learning is paramount to success. When new algorithms or new tools come along what do you do? A plan to swap versions of models in a production environment is required. Additionally, can multiple models be run side-by-side for testing?
  1. Plan for workflows to be moved.Whether the workflow is run in an on-premise data center, in the cloud, or at the edge, workflows must be portable. The pipelines must be abstracted from the physical infrastructure to be able to run anywhere.