Support >
  About cybersecurity >
  How is the memory wall between the CPU and GPU removed?
How is the memory wall between the CPU and GPU removed?
Time : 2025-12-13 13:27:46
Edit : Jtti

When running deep learning training or scientific computing tasks on a cloud server equipped with a GPU, a common headache is that a significant portion of the program's time isn't spent on the GPU's computational tasks, but rather on data preparation. Specifically, this involves moving data back and forth between CPU-managed main memory and the GPU's dedicated video memory. This process is slow because the CPU and GPU have their own independent memory spaces. Data transfer requires copying and moving data through the PCIe bus "corridor." To improve efficiency, CUDA introduces a mechanism called Unified Virtual Address Space and Memory Mapping, aiming to open a more direct door in this wall, or even attempt to dismantle it.

The traditional workflow is straightforward but inefficient. The CPU prepares the data in system memory and then calls an explicit copy function, such as `cudaMemcpy`, before the data is transferred to the GPU's video memory via the PCIe bus. After the GPU completes the computation, the result requires another explicit copy before being returned to system memory for CPU use. These two copies, especially when dealing with large amounts of data or frequent interactions, become a bottleneck in the entire process. CUDA's memory mapping mechanism, sometimes called zero-copy memory, offers a different approach. Its core is allowing GPU threads to directly access data stored in the system's main memory, eliminating the need for explicit copying. This doesn't mean the data is moved out of thin air; rather, through the collaboration of the memory management unit and the driver, the GPU can directly read and write specific areas of system memory via the PCIe bus.

The technological foundation for this is unified virtual addresses. On 64-bit operating systems and supported GPUs, the CUDA driver creates a large, unified virtual address space. This space encompasses host memory, GPU memory, and memory from other devices. For any pointer in the program, the CUDA runtime can determine which physical location it actually points to. With this foundation, the next crucial step is creating mapped memory. Programmers can use the `cudaHostAlloc` function, specifying the `cudaHostAllocMapped` flag, to allocate a special type of host memory. This memory is "locked" in physical memory during allocation and will not be swapped to disk by the operating system. Simultaneously, the driver creates a mapping for this same physical memory in the GPU's page table. Therefore, the same physical memory has a virtual address pointer on the host side and a corresponding virtual address pointer on the device side. GPU threads only need to use the pointer on the device side; their access requests are captured by the GPU's memory management unit and forwarded directly to system memory via the PCIe bus to complete the read/write operation. From a programming model perspective, the data seems to be right next to the GPU, even though it may be physically far away.

```c float *host_ptr, *device_ptr;
size_t size = N * sizeof(float);
// Allocate memory on the host that can be directly mapped and accessed by the GPU
cudaHostAlloc(&host_ptr, size, cudaHostAllocMapped);
// Get the pointer to this memory in the GPU address space
cudaHostGetDevicePointer(&device_ptr, host_ptr, 0);
// Now, the GPU kernel can directly use device_ptr to read and write the host memory pointed to by host_ptr
my_kernel<<<...>>>(device_ptr, N);

Of course, this convenience comes at a cost. The most direct cost is access latency and bandwidth. The speed at which the GPU accesses system memory via PCIe is much lower than the speed at which it accesses its own high-speed video memory. Therefore, mapped memory is best suited for scenarios where access patterns are not frequent, or where data is only read by the GPU once. For example, as a buffer for data input, or to receive small amounts of results generated by GPU computation. If data needs to be repeatedly and intensively read and written within the GPU core, copying it to GPU memory before computation is significantly more efficient. Another key point is synchronization. Since both the CPU and GPU can directly access the same physical memory, their access order must be carefully coordinated to avoid data races and errors. Typically, CUDA events or stream synchronization are used to ensure the CPU reads data written by the GPU at a safe time, and vice versa.

In cloud environments, you might use different GPU instances with varying PCIe bandwidths and topologies. For instances with higher PCIe bandwidth, the benefits of using mapped memory are more pronounced. A practical pattern is to use mapped memory as a flexible buffer pool that can exceed the size of GPU memory. When processing very large datasets that cannot be loaded into GPU memory all at once, a block of mapped memory can be reused, allowing the GPU to process it in batches. However, it's crucial to carefully evaluate whether the overhead of PCIe data transfer in such batch processing can be outweighed by the GPU's computational throughput. Another common use case is processing intermediate results that require frequent exchange between the CPU and GPU, avoiding the overhead of multiple copies. The unified memory concept introduced in later versions of CUDA can be seen as a further development and automation of this mapping approach. The unified memory system provides a unified memory pool from which programmers allocate pointers. The physical migration of data (between system memory and video memory) is automatically managed behind the scenes by drivers and hardware, triggered by page faults. This simplifies programming, but it still relies on the underlying support of fundamental mechanisms such as mapping and direct access.

 

Pre-sales consultation
JTTI-Defl
JTTI-Selina
JTTI-Eom
JTTI-Coco
JTTI-Jean
JTTI-Ellis
JTTI-Amano
Technical Support
JTTI-Noc
Title
Email Address
Type
Sales Issues
Sales Issues
System Problems
After-sales problems
Complaints and Suggestions
Marketing Cooperation
Information
Code
Submit