In the last few years, the industry has seen incredible interest and uptake in NVMe (Non-Volatile Memory Express) flash storage. Why you ask, put simply the benefits of higher IOPS and throughput, along with lower latency could not be ignored, even at costs significantly higher than that of spinning disks. However, as adoption has increased, even the traditional benefits of spinning disks have been eroded as capacity of SSDs skyrocketed and economies of scale drove down the cost of flash.
NVMe quickly gained a foothold in the client space as the storage interface of choice for higher end desktops, laptops, and even phones. In the enterprise space NVMe has been used as a direct-attached storage (DAS) option, where storage is accessed through a PCIe interface on the CPU within the server. The advent of NVMe over Fabrics (NVMe-oF) promises to expand the utility of NVMe in the datacenter beyond just a DAS replacement, which will push the demand for flash storage even higher.
What is NVMe-oF?
NVMe-oF is a mapping of the already optimized NVMe protocol on top of a network fabric transport. The fabric technologies getting the most attention right now, and the ones we’ll focus on here, are RoCE (RDMA over Converged Ethernet) and Fibre Channel (FC). Although other fabrics technologies are certainly possible and viable, the majority of products being developed now are using one of these two fabric types.
NVMe-oF defines how NVMe packets can be carried over a RoCE or FC fabric with minimal overhead and extra processing. This is essentially "native" transportation of NVMe over a fabric, whether its FC or RoCE. "Native" transportation means little to no translation of NVMe into an ethernet or fibre channel packet. Translations increase latency, so theewer translations means a shorter code path and lower latency. Thus, NVMe-oF maintains the low latency benefits of NVMe even when the storage media is remote.
What’s Driving NVMe-oF into the Data Center?
As more applications move to the cloud, a virtuous cycle has emerged where the cost of hosting applications in the cloud has decreased dramatically, leading to larger and more complicated workloads migrating to the cloud, leading to further investment in high performance data centers, which in turn drives down the cost of using the cloud. The use case of deploying NVMe beyond just a single server has become critical, and that is the driving force behind NVMe-oF.
By connecting NVMe flash storage directly to a fabric, NVMe-oF allows the creation of large pools of flash memory storage that enable operating on and storing massive datasets. Thus, NVMe-oF is another step in the disaggregation of storage and compute. With NVMe-oF its no longer necessary to access NVMe as DAS on a server. Instead, storage can be provisioned and managed from any single server since it is available to many servers attached to the fabric. In a a hyperscale datacenter with thousands of servers and storage arrays, this really starts to have a high dollar impact on operations. If a datacenter can run certain applications faster, it can schedule more workloads in the same amount of time. The ROI becomes clear very quickly.
Choosing the Right Fabric – RoCE or FC
A fabric enables high availability of storage and reliable access. Put simply a fabric can have multiple routes between a server and storage, and between two storage notes in the case of a zero-copy operation. Through a network fabric it becomes much easier to design and deploy failover operations as there are multiple redundant paths. Further, a fabric enables many more options for load balancing when part of the network becomes congested.
Performance and congestion can be mitigated and managed on an RDMA fabric by enabling PFC (Priority Flow Control), ECN (Explicit Congestion Notification), essentially these can be used to create a lossless network out of Ethernet. A big challenge for implementers will be determining how best to tune their PFC and ECN use to specific NVMe workloads to get the best performance for their combination of hardware and application.
The sheer volume of already purchased Ethernet infrastructure, and the availability of technicians that understand Ethernet make RoCE attractive. Using non-DCB switches and simple NICs can drive cost down further, although the ability to manage traffic will be minimized as well. Clearly we will see many varieties of NVMe-oF deployments.
A fabric has further advantages in that it can be used for different applications and protocols simultaneously. For example, in a Fibre Channel network, FCP and FC-NVMe can be deployed and run across the same FC fabric. FCP was designed to carry SCSI, and has done that well for a long time. However, it’s not ‘native’ for flash, that means too many translations. FC-NVMe reduces the translations and allows much lower latency and higher performance. Both protocols can be run on the same FC fabric simultaneously.
Users can continue to use FCP to connect to SCSI storage for appropriate applications and worklaods. FC-NVMe protocol can be deployed on existing FC infrastructure (assuming bandwidth needs are met) for applications with higher performance needs. Plus, CIO’s like infrastructure investment that allows them to further leverage existing equipment investment.
The Road Ahead for NVMe-oF
NVMe-oF adoption is still just beginning. In 2017, NVMe-oF support was added to the Linux kernel and is being picked up by various Linux distributions.
Real challenges emerge in trying to build systems that can fully leverage the NVMe-oF performance. Several published performance tests have shown that 4 port NVMe drives can consistently push 20+ GBs for both write operations and over 25 GBs for read operations. Therefore, to confidently eliminate bottlenecks, a 25GbE adapter or a 32GFC adapter for each drive is not unreasonable. But the consequences to the system cannot be ignored. 25GbE and 32GFC generally require 8 lanes of PCIe on the host side. Building a system with no bottlenecks and no oversubscription is not easy. Many systems folks are eagerly awaiting PCIe 4.0 to be able push NVMe-oF benefits even further.
From this discussion of NVMe-oF, we can see that there are clear benefits to deploying NVMe-oF in the datacenter, including enabling lower latency, remote storage access, better management and provisioning of flash storage, and the ability to disaggregate compute and storage. Many flavors of implementation will follow as customers will be able to choose the fabric technology that they want, how they want to manage network congestion, failover, and performance. They will also have hardware choices between 25, 40, 100G Ethernet, and 32GFC and 128GFC. The answers to those questions will depend heavily on the workloads being deployed. In a way, NVMe-oF is a complex machine with many moving parts, and no doubt users will find the best way to get maximum performance out of their deployments. It will be exciting to see more use case studies of NVMe-oF deployments in the coming months and years as we start to recognize this technologies true potential.