AI pipelines have ample room for improvements; these enhancements can range from optimizing performance, accelerating model loading, utilizing asynchronous hot-swapping, leveraging multi-cloud and multi-region Kubernetes clusters, and improving spin-up times and autoscaling.
Erwann Millon, founding engineer at Krea.ai, led the Data Summit’s session, “Taking a Deep Dive Into Cutting-Edge Techniques for Data Management,” to discuss best practices for maximizing AI pipelines through an in-depth case study of an inference pipeline serving hundreds of models on multi-cloud clusters.
The annual Data Summit conference returned to Boston, May 10-11, 2023, with pre-conference workshops on May 9.
Inference needs revolve around serving many models, at a low cost, with high controllability, fast scaling, and high availability, Millon explained; ultimately, inference is the key to optimizing AI pipelines.
In order to achieve this wide-range, serving thousands of models, quick loading time is critical; using Pytorch and a few lines of code, users can achieve fast load times with Torch Meta Device—a “fake” torch device that doesn’t allocate any memory or initialize parameters.
To accelerate load from disk due to a large number of parameters, Huggingface safetensors can load models between 2-10x faster than pickle, while simultaneously supporting zero-copy direct-to-Cuda loading. Large models can also benefit from chunk state dicts that save as separate files and use multithreading with performance libraries to load in parallel.
Millon posed the question, “Can we load models intelligently and in the background during inference for zero latency?”
Yes, he argued; yet, due to limitations in Python, this can be somewhat difficult. By utilizing parent and child processes that interact with each other, models can be loaded faster. Critically, he noted, this cannot be accomplished on CPU. Instead, a child process can be loaded onto the GPU to circumvent this technicality.
These optimizations can serve hundreds of stable diffusion models from a single A100 with zero latency. Despite this efficiency, it's still not enough to support inference effectively.
Storage can be optimized by sharing models in Kreai.ai’s Kubernetes cluster with an NFS solution underpinned by a write through system to balance spin-up time, agility, and scalability.
Million explained that though these are critical for inference optimization, A100 GPUs are difficult to acquire. He offered the following substitutions:
- Google Anthos
- Scaling on single instances (ECG/GCE)
- NVIDIA Multi-Instance GPUs
- Alternative GPUs like A10Gs
- Compensate VRAM with quantization, model offloading, and model parallelism
- Inference on TPU
Millon’s biggest recommendation is to profile everything; “If you profile everything at depth, this is the best way to identify your biggest performance hits,” he explained.
He concluded by highlighting the following takeaways:
- Load models faster with safetensors and parallelism
- Cache models asynchronous with multiprocessing
- Schedule jobs intelligently
- Maximize throughput
- Colocate everything
- Profile everything
Many Data Summit 2023 presentations are available for review at https://www.dbta.com/DataSummit/2023/Presentations.aspx.