Repeatable Machine Learning With Kubeflow

In the last couple of years, we have heard a lot about machine learning and the plethora of tools and frameworks available to support these efforts. People have begun to understand that without data and proper data operations practices, these machine learning tools are very difficult to leverage. While there is an abundance of great tools available, there is one persistent complaint, and that is how to achieve simple repeatability. Workflow management must encompass the data scientists’ pipeline for attaining the data and preparing it, building and testing models, and versioning the entire workflow. It must also support versioning the data inputs, outputs, logs, and even performance metrics.

This workflow management task is one of the more critical pieces of infrastructure to support data scientists and the machine learning endeavors of any organization. In the book, Machine Learning Data Logistics, published by O’Reilly, it is noted that 90% of the time and effort that go into machine learning by a data scientist is devoted to data logistics. Anything that can be done to reduce that level of effort should be considered very important to the success of machine learning-based programs.

Most recently, we have seen some workflow tools rise to the occasion. They have gained rapid popularity for their flexibility, ease of adoption, simplicity in extending their capabilities, and—at least in some part—because they plug into Kubernetes (k8s). Plugging into k8s enables the workflow to be operated or scheduled by k8s, removing the often mundane and tedious task of identifying the hardware for which to run a workflow. This appeals to many end users because k8s has effectively become the data center resource manager of choice, the golden standard, if you will, for managing and allocating resources for jobs.

Kubeflow is a workflow tool which prides itself on making machine learning workflows simple to build, scalable, and portable. It provides graphical end-user tools to set up and define the steps in a pipeline. Most importantly, as data scientists build out their use cases, they add more and more steps and, when using Kubeflow, they end up with a documented, repeatable process.

Kubeflow delivers on the promise of portability by enabling users to create and run workflows on their desktops. That same workflow can then be moved to another environment, such as production. Abstractions to separate the shape and size of the environment are built into k8s so the user does not need to worry about how to deploy into a test or production environment. This is a big benefit since development, testing, and production environments nearly always look different from one another, ultimately leading to reduced friction for the data scientist.

Since we are talking about repeatability for machine learning, it is important to keep GPUs in mind. The GPU has a place in many of these workflows. A data scientist may use an algorithm that is CPU- or GPU-based. The process gets more complicated when a GPU is required in an environment, when trying to abstract the environment to make the workflow portable. This is a problem that Kubeflow with k8s solves. Kubeflow supports Chainer for performing model training as well as There are many more popular frameworks that have been integrated into Kubeflow which leverage GPUs. Jupyter notebooks, TensorFlow, MXNet, PyTorch, Katib, Horovod, Istio, Caffe2, and TensortRT are all supported. This is just a short list of the many integrations that exist, and, if nothing else, it should give confidence to anyone considering Kubeflow that it is the workflow tool of choice for ML workloads.

Something can be a great tool, but if the barrier to entry is too high, people won’t use it. Fortunately for data scientists, Kubeflow is easy to get started with. Kubernetes is the key requirement and it is easily attainable for the desktop with microk8s or minikube. Aside from that, there is a “quick start” on the kubeflow website which takes minutes to follow through.

While workflow management is one of the most important items to enable machine success, it is  also important to differentiate between generic workflow management and machine learning-based workflow management tools. This is the difference between the jack-of-all trades and the master of machine learning workflow management. Having depth baked into a solution supporting machine learning is critical because the needs of data scientists are very different from those of traditional software engineers and even systems administrators.


Subscribe to Big Data Quarterly E-Edition