Researchers from NVIDIA, IBM, and Cornell University have been working on a potential solution for fast, fine-grain access to large amounts of data storage for GPU-accelerated applications called Big Accelerator Memory (BaM). BaM is an architecture that includes a software component for turning an SSD into shared memory for the GPU via direct access. The research paper can be found here.
We propose a novel system architecture called BaM (Big accelerator Memory). The goal of BaM’s design is to extend GPU memory capacity and enhance the effective storage access bandwidth while providing high-level abstractions for the GPU threads to easily make on-demand, fine-grain access to massive data structures in the extended
The idea is to replace current forms of shared pooled memory that GPUs presently use, which incorporate system memory via the CPU or additional VRAM when multiple graphics cards are linked together. When it comes to data-centric GPU applications, memory is often a limitation, and these current solutions tend to be inefficient. BaM hopes to mitigate those limitations by providing an API for NVMe direct storage access.
An existing solution to the problem of insufficient GPU memory capacity for these massive data structures is to pool together the memory capacity of multiple GPUs to meet the capacity requirement , , , ,  and use fast shared memory interconnects like NVLink  for the GPUs to access each other’s memory. The entire data structure is first sharded into the GPU memories. The computation then identifies and accesses the portions that are actually used. This approach has two drawbacks. First, the entire data structure needs to be moved from storage to the GPU memories even if only a portion might be accessed, which can add significantly to the application start-up latency. Second, the data structure size determines the number of GPUs required for the application, which may far exceed the compute resource requirement of the workload.
Using the host memory, whose capacity typically ranges from 128GB to 2TB today, to help hold the sharded data structure can reduce the total number of GPUs used . We will refer to the use of host memory to extend the GPU memory capacity as the DRAM-only solution. Because multiple GPUs tend to share the same CPU and thus the host memory in data center servers, such DRAM-only solutions tend to add only a fraction of the host-memory size to each GPU’s memory capacity. For example, in NVIDIA DGX A100 systems, each host memory is shared by 8 GPUs. Thus, using host memory only extends the effective size of each GPUmemory by 1/8 of the limited size of the host memory.
It is hoped that by mostly removing the CPU from memory access, there will be much less I/O overhead, resulting in faster transfer speeds while freeing up the processor for other tasks.
Unfortunately, such a CPU-centric strategy causes excessive CPU-GPU synchronization overhead and/or I/O traffic amplification, diminishing the effective storage bandwidth for emerging applications with fine-grain data-dependent access patterns like graph and data analytics, recommender systems, and graph neural networks.
The research team is planning on making their results open-source so others can develop BaM to their own needs. They’ve already tested it using off-the-shelf NVMe drives and GPUs.
This will not be the first time GPUs have had an SSD solution, as AMD once tried it with its Radeon Solid State Graphics. We’ve also seen fantastic transfer speeds with the PlayStation 5, while Microsoft has gone on to incorporate its own means of communication between the GPU and SSD with the Direct Storage API in Windows 10/11.
One thing that is certain is that as NVMe transfer speeds increase, from data-centric applications to console and PC gaming, everyone is wanting a slice of the NVMe pie now.