The IO Acceleration Platform for the Data Center
Companies are refining their data and becoming intelligence manufacturers. Data centers are becoming AI factories enabled by accelerated computing—which has sped-up computing by a million-x. However, accelerated computing requires accelerated IO. NVIDIA Magnum IO™ is the architecture for parallel, intelligent data center IO. It maximizes storage, network, and multi-node, multi-GPU communications for the world’s most important applications, using large language models, recommender systems, imaging, simulation, and scientific research.
NVIDIA Magnum IO utilizes storage IO, network IO, in-network compute, and IO management to simplify and speed up data movement, access, and management for multi-GPU, multi-node systems. Magnum IO supports NVIDIA CUDA-X™ libraries and makes the best use of a range of NVIDIA GPU and NVIDIA networking hardware topologies to achieve optimal throughput and low latency.
[Developer Blog] Magnum IO - Accelerating IO in the Modern Data Center
In multi-GPU, multi-node systems, slow CPU, single-thread performance is in the critical path of data access from local or remote storage devices. With storage IO acceleration, the GPU bypasses the CPU and system memory, and accesses remote storage via 8x 200 Gb/s NICs, achieving up to 1.6 TB/s of raw storage bandwidth.
Technologies Included:
NVIDIA NVLink®, NVIDIA Quantum InfiniBand, Ethernet networks, and RDMA-based network IO acceleration reduce IO overhead, bypassing the CPU and enabling direct data transfers to GPUs at line rates.
In-network computing delivers processing within the network, eliminating the latency introduced by traversing to the endpoints and any hops along the way. Data processing units (DPUs) introduce software defined, network hardware-accelerated computing, including pre-configured data processing engines and programmable engines.
To deliver IO optimizations across compute, network, and storage, users need deep telemetry and advanced troubleshooting techniques. Magnum IO management platforms empower research and industrial data center operators to efficiently provision, monitor, manage, and preventatively maintain the modern data center fabric.
NVIDIA Magnum IO interfaces with NVIDIA high performance computing (HPC) and AI libraries to speed up IO for a broad range of use cases—from AI to scientific visualization.
Today, data science and machine learning (ML) are the world's largest compute segments. Modest improvements in the accuracy of predictive ML models can translate into billions of dollars to the bottom line.
To enhance accuracy, the RAPIDS™ Accelerator library has a built-in accelerated Apache Spark shuffle based on UCX that can be configured to leverage GPU-to-GPU communication and RDMA capabilities. Combined with NVIDIA networking, NVIDIA Magnum IO software, GPU-accelerated Spark 3.0, and RAPIDS, the NVIDIA data center platform is uniquely positioned to speed up huge workloads at unprecedented levels of performance and efficiency.
GPUDirect Storage (GDS) has been integrated with RAPIDS for ORC, Parquet, CSV, and Avro readers. RAPIDS CuIO has achieved up to a 4.5X performance improvement with Parquet files using GDS on large scale workflows.
Adobe Achieves 7X Speedup in Model Training with Spark 3.0 on Databricks for a 90% Cost Savings
To unlock next-generation discoveries, scientists rely on simulation to better understand complex molecules for drug discovery, physics for new sources of energy, and atmospheric data to better predict extreme weather patterns. Leading simulation and applications leverage NVIDIA Magnum IO to enable faster time to insight. Magnum IO exposes hardware-level acceleration engines and smart offloads, such as RDMA, NVIDIA GPUDirect, and NVIDIA SHARP, while bolstering the high bandwidth and ultra-low latency of NVIDIA InfiniBand and NVIDIA NVLink networked GPUs.
In multi-tenant environments, user applications may be unaware of indiscriminate interference from neighboring application traffic. Magnum IO, on the latest NVIDIA Quantum-2 InfiniBand platform, features new and improved capabilities for mitigating the negative impact on a user’s performance. This delivers optimal results, as well as the most efficient HPC and ML deployments at any scale.
Magnum IO Libraries and HPC Apps
VASP performance improves significantly when MPI is replaced with NCCL. UCX accelerates scientific computing applications, such as VASP, Chroma, MIA-AI, Fun3d, CP2K, and Spec-HPC2021, for faster wall-clock run times.
NVIDIA HPC-X increases CPU availability, application scalability, and system efficiency for improved application performance, which is distributed by various HPC ISVs. NCCL, UCX, and HPC-X are all part of the HPC-SDK.
Fast Fourier Transforms (FFTs) are widely used in a variety of fields, ranging from molecular dynamics, signal processing, and computational fluid dynamics (CFD) to wireless multimedia and ML applications. By using NVIDIA Shared Memory Library (NVSHMEM)™, cuFFTMp is independent of the MPI implementation and operates closer to the speed of light, which is critical as performance can vary significantly from one MPI to another.
The Qualitative Data Analysis (QUDA) Lattice Quantum Chromodynamics library can use NVSHMEM for communication to reduce overheads from CPU and GPU synchronization, and improve compute and communication overlap. This reduces latencies and improves strong scaling.
Multi-Node Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale
Largest Interactive Volume Visualization - 150TB NASA Mars Lander Simulation
The emerging class of exascale HPC and trillion parameter AI models for tasks like superhuman conversational AI require months to train, even on supercomputers. Compressing this to the speed of business to complete training within days requires high speed, seamless communication between every GPU in a server cluster, so they can scale performance. The combination of NVIDIA NVLink, NVIDIA NVSwitch, NVIDIA Magnum IO libraries and strong scaling across servers delivers AI training speedups of up to 9X on Mixture of Experts (MoE) models. This allows researchers to train massive models at the speed of business.
Magnum IO Libraries and Deep Learning Integrations
NCCL and other Magnum IO libraries transparently leverage the latest NVIDIA H100 GPU, NVLink, NVSwitch, and InfiniBand networks to provide significant speedups for deep learning workloads, particularly recommender systems and large language model training.
The benefits of NCCL include faster time to model training accuracy, while achieving close to 100 percent interconnect bandwidth between servers in a distributed environment.
Magnum IO GPUDirect Storage (GDS) has been enabled in the Data Loading Library (DALI) through the Numpy reader operator. GDS brings up to 7.2X the performance increase in deep learning inference with DALI compared to baseline Numpy.
Enabling researchers to continue pushing the envelope of what's possible with AI requires powerful performance and massive scalability. The combination of NVIDIA Quantum-2 InfiniBand networking, NVLink, NVSwitch, and the Magnum IO software stack delivers out-of-the-box scalability for hundreds to thousands of GPUs operating together.
Performance Increases 1.9X on LBANN with NVSHMEM vs. MPI
GPUs are being used to accelerate complex and time consuming tasks in a range of applications from on-air graphics to real-time stereoscopic image reconstruction.
NVIDIA GPUDirect for Video technology allows 3rd party hardware to efficiently communicate with NVIDIA GPUs and minimize historical latency issues. With NVIDIA GPUDirect for Video, IO devices are fully synchronized with the GPU and the CPU to minimize wasting cycles copying data between device drivers.
GPUDirect Storage (GDS) integrates with cuCIM, an extensible toolkit designed to provide GPU accelerated IO, computer vision, and image processing primitives for N-dimensional images with a focus on biomedical imaging.
In the following two examples, NVIDIA IndeX® is used with GDS to accelerate the visualization of the very large data sets involved.
Visualize Microscopy Images of Living Cells in Real Time with NVIDIA Clara™ Holoscan
> NVIDIA Magnum IO GitHub > NVIDIA GPUDirect Storage: A Direct Path Between Storage and GPU Memory > Accelerating IO in the Modern Data Center: Network IO > Accelerating NVSHMEM 2.0 Team-Based Collectives Using NCCL > Optimizing Data Movement in GPU Applications with the NVIDIA Magnum IO Developer Environment > Accelerating Cloud-Native Supercomputing with Magnum IO > Access MOFED
Sign up for NVIDIA Magnum IO news and updates.
Facilitates IO transfers directly to the GPU memory, removing the expensive data path bottlenecks to and from the CPU/system memory. Avoids the latency overhead of an extra copy through system memory, which impacts smaller transfers and relieves the CPU utilization bottleneck by operating with greater independence.
LEARN MORE ›
Read Blog: GPUDirect Storage: A Direct Path Between Storage and GPU Memory
Watch Webinar: NVIDIA GPUDirect Storage: Accelerating the Data Path to the GPU
Logically presents networked storage, such as NVMe over Fabrics (NVMe-oF), as a local NVMe drive, allowing the host OS/Hypervisor to use a standard NVMe-driver instead of a remote networking storage protocol.
Set of libraries and optimized NIC drivers for fast packet processing in user space, providing a framework and common API for high speed networking applications.
Provides access for the network adapter to read or write memory data buffers directly in peer devices. Allows RDMA-based applications to use the peer device computing power without the need to copy data through the host memory.
Production-grade communication framework, based on open source, for data centric and high performance applications. Includes a low-level interface that exposes fundamental network operations supported by underlying hardware. Package includes: MPI and SHMEM libraries, Unified Communication X (UCX), NVIDIA SHARP, KNEM, and standard MPI benchmarks.
Brings topology-aware communication primitives through tight synchronization between the communicating processors. NCCL accelerates collective operations and reduces wall-clock run time. NCCL is integrated with various RAPIDs ML components, Rapids Analytics Framework Toolkit (RAFT), and DASK-cuML. cuML is a suite of libraries that implement ML algorithms and mathematical primitives functions. NCCL is integrated with PyTorch, NVIDIA Merlin™ HugeCTR, NVIDIA Nemo Megatron, NVIDIA Riva, TensorFlow container, and MXNET container.
NVSHMEM is the programming model that allows applications to issue fine grained accesses across the distributed 4th generation NVLink scale-up interconnect, while overlapping it with computation. This allows for significant speed up for distributed scientific computing applications, such as cuFFT using NVSHMEM.
NVSHMEM offers a parallel programming interface based on the OpenSHMEM standard, creating a global address space for data spanning the memory of multiple GPUs across multiple servers.
UCX is an open-source, production-grade communication framework for data-centric and high performance applications. Includes a low-level interface that exposes fundamental network operations supported by underlying hardware. Also includes a high-level interface to construct protocols found in MPI, OpenSHMEM, PGAS, Spark, and other high performance and DL applications.
UCX provides GPU-accelerated point-to-point communications, bringing the best performance, while utilizing the NVLINK, PCIe, Ethernet, or InfiniBand connectivity between GPU compute elements.
The set of features that accelerate switch and packet processing. ASAP2 offloads data steering and security from the CPU into the network boosts efficiency, adds control, and isolates them from malicious applications.
The NVIDIA® BlueField DPU® offloads critical network, security, and storage tasks from the CPU, serving as the best solution for addressing performance, networking efficiency, and cyber security concerns in the modern data center.
Reduces MPI communication time and improves overlapping between compute and communications. Employed by NVIDIA Mellanox InfiniBand adapters to offload the processing of MPI messages from the host machine onto the network card, enabling a zero copy of MPI messages.
Improves upon the performance of data reduction and aggregation algorithms, such as in MPI, SHMEM, NCCL, and others, by offloading these algorithms from the GPU or the CPU to the network switching elements or DPU, and eliminating the need to send data multiple times between InfiniBand and 4th gen NVLink endpoints. SHARP integration boosts NCCL performance by 4X and demonstrates a 7X performance increase for MPI collective latency. SHARP issupported by UFM, HPC-X, NCCL, and most industry standards based MPI packages.
Introduce holistic visibility, troubleshooting, and DevOps into your modern data center network with NVIDIA NetQ, a highly scalable, modern network operations toolset that validates your NVIDIA® Cumulus® Linux and SONiC fabrics in real time.
Provides debugging, monitoring, management, and efficient provisioning of fabric in data centers for InfiniBand. Supports real-time network telemetry with AI-powered cyber intelligence and analytics.