Backend services running on high-bandwidth cloud servers in Japan, especially systems that need to handle high-frequency data exchange, real-time analysis, or high-concurrency requests, often encounter a core bottleneck: the copying overhead of data when flowing between different components or processes. This situation can be addressed using memory-mapped file zero-copy techniques. Their core idea is very straightforward—keep data in place, or move it only a minimum of the necessary times, thus freeing up the CPU.
Let's delve into mmap. The essence of mmap is to allow processes to access files as if they were in regular memory. When you call `void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset)`, the operating system does not immediately read the entire file content into physical memory. It simply allocates a region of a specified size in your process's virtual address space and establishes a mapping relationship with the target file. This operation itself is lightweight. For example, if you read a single byte from the mapped region and that portion of the file content has not yet been loaded, the CPU will generate a page fault. The operating system catches this exception, loads the corresponding "page" of data (usually 4KB) from disk into physical memory, and then resumes your program's execution. This process is completely transparent to the program; you're accessing what appears to be a large, pre-loaded array. For writes, modifications first occur in the page cache in memory, and the operating system decides in the background when to write these "dirty pages" back to disk.
On high-bandwidth cloud servers in Japan, this offers two major advantages: First, it greatly simplifies the processing logic for large data files; you don't need to manually manage buffers, read and write offsets. Second, and more importantly, when multiple processes map the same file, they essentially see the same page cache in physical memory. This means that data written by process A is almost instantly visible to process B, naturally enabling efficient and fast inter-process shared memory communication, avoiding the cumbersome setup and management required by traditional shared memory APIs (such as System V SHM).
However, mmap alone is not enough. When data needs to be sent from such a memory-mapped area to the network, or from a disk file to the network, traditional data paths remain highly redundant. Consider a common cloud service task: sending a static file (such as a video or software package) from a server to a client via an HTTP response. The traditional `read` and `write` (or `send`) workflow is as follows: the application calls `read`, causing a context switch to kernel mode. The kernel reads the file data from the disk (via the page cache) into a kernel buffer. Then, the data is copied from the kernel buffer to a user-space buffer provided by the application. The application then calls `write` or `send`, triggering another context switch, copying the data from the user-space buffer to the kernel's socket buffer, and finally sending it out by the network card driver. In this process, data is moved back and forth between kernel and user space at least twice, consuming valuable CPU cycles and memory bandwidth.
Zero-copy technology was developed to eliminate these redundant copying operations. The most typical and widely used system call is `sendfile`. Its function prototype, `ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count)`, clearly indicates its intent: to directly send data from one file descriptor (usually pointing to a real file) to another file descriptor (usually pointing to a network socket). The entire data flow occurs entirely within the kernel address space: file data is moved directly from the page cache by the DMA (Direct Memory Access) engine to the network card buffer, ready for transmission. The entire process bypasses user space, achieving true "zero-copy." This is a revolutionary performance improvement for high-bandwidth cloud servers in Japan that provide services such as downloading large amounts of static files or streaming media. It significantly reduces CPU utilization, enabling the server to support a higher number of concurrent connections with the same hardware resources.
Combining mmap and zero-copy allows for the construction of extremely efficient data processing pipelines. A typical architecture is: the producer process uses mmap to write data to a memory-mapped file (or a persistent shared memory region). The consumer process also mmaps this file to directly read the data. When a consumer needs to forward processing results or raw data to the network, it no longer needs to read the data into its own user-space buffer. Instead, it can directly use `sendfile` to send the descriptor and data offset information of the mapped file directly to the network socket. In this process, from data production to final transmission over the network, only one DMA copy may occur initially when the data is loaded from disk to the page cache, and a second DMA copy occurs between the page cache and the network card buffer, completely avoiding user-space involvement.
```c
// Example: The producer process writes to the shared memory region via mmap
int fd = open("/dev/shm/shared_data", O_RDWR | O_CREAT, 0666);
ftruncate(fd, DATA_SIZE); // Expand the file size to match the shared memory region
void *shm_ptr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
// Produce data to the memory pointed to by shm_ptr...
memcpy(shm_ptr, data, data_len);
// The consumer process maps the same region and prepares to send it via sendfile
int shm_fd = open("/dev/shm/shared_data", O_RDONLY);
void *shm_ptr_consumer = mmap(NULL, DATA_SIZE, PROT_READ, MAP_SHARED, `shm_fd, 0);`
// Consumers can directly read from `shm_ptr_consumer`...
// When sending over the network, use `sendfile` (assuming the target data is at file offset `offset` and has a length of `length`)
`int sock_fd = ...;` // Connected socket
`off_t offset = ...;`
`sendfile(sock_fd, shm_fd, &offset, length);`
Of course, adopting these "hardcore" optimizations requires caution. They give more control to the developer, but also mean more responsibility. You need to handle synchronization issues properly. When multiple processes read and write to the same mapped region, you must use mechanisms such as semaphores and mutexes to protect data consistency. For `mmap`, you need to pay attention to the memory granularity, which is pages, and be careful to handle the efficiency problems that may arise from non-page aligned access. For `sendfile`, note that it is suitable for file-to-socket transfers and has limitations on file size and type on some older systems.
EN
CN