IBM Research and PyTorch Develop Open-Source, Cloud-Native AI Training Network

IBM Research and PyTorch are partnering to enable foundation models with billions of parameters to easily run on standard cloud networking infrastructure, such as Ethernet networking.

Researchers at IBM have been working with the distributed team within PyTorch, the open-source machine learning platform run by the Linux Foundation, to find a way to train large AI models on affordable networking hardware. The group’s research has shown it’s possible to scale and train large models using regular Ethernet based networking on Red Hat’s OpenShift platform.

With PyTorch’s FSDP, the team was able successfully train models with 11 billion parameters using standard Ethernet networking on IBM Cloud. The approach achieves on-par efficiency training models of this size as high-performance computing (HPC) networking systems, making HPC networking infrastructure virtually obsolete for small and medium-scale AI models.

These types of models with billions of parameters, called foundation models, can with little fine-tuning be repurposed from one task to another, removing countless hours of training and labelling, and refitting a model to take on a new task.

Foundation models have been primarily trained on high-end high-performance computing (HPC) infrastructure, which while reliable, are a costly barrier to entry for many looking to train foundation models for their own uses. These systems for training AI models have to be custom designed, rarely relying on commodity hardware options. Top-of-the-line GPUs are paired with low-latency InfiniBand network systems, which are costly to set up and run and also require bespoke operating processes, raising the cost even further.

The infrastructure the team used for this work was essentially off-the-shelf hardware. Running on the IBM Cloud, the system consists of 200 nodes, each with eight Nvidia A100 80GB cards, 96 vCPUs, and 1.2TB CPU RAM.

The GPU cards within a node are connected via NVLink with a card-to-card bandwidth of 600GBps, and the nodes are connected together by two 100Gbps Ethernet links with a The single root I/O virtualization (SR-IOV) interface is a PCIe specification that allows hardware like a network adapter to separate access to resources among PCIe hardware functions.SR-IOV based TCP/IP stack, providing a usable bandwidth of 120Gbps (for 11B model, we observed peak network bandwidth utilization of 32Gbps).

“We wanted to invest more in GPUs—not the networking hardware,” said Raghu Ganti, a master inventor at IBM Research working on scaling foundation models.

This GPU system has been running since May and is configured with the Red Hat OpenShift container platform to run AI workloads. The team is building a production ready software stack for end-to-end training, fine-tuning, and inference of large AI models.

In 2023, the goal of the joint team is to continue scaling this technology to handle even larger models.

For more information about this news, visit