The video server is using full bandwidth but not full CPU usage? The bottleneck might be in the network card interrupt.-Jtti

Support >

The video server is using full bandwidth but not full CPU usage? The bottleneck might be in the network card interrupt.

Time : 2026-06-05 14:04:02

Edit : Jtti

　　The video server's bandwidth is already running at hundreds or even thousands of megabits per second, the network card is practically smoking, but a quick check of `top` shows the CPU utilization is only 20-30%, seemingly idle. The server configuration is high-end, the bandwidth is sufficient, but the video still buffers, and users are complaining. At this point, most people would suspect: Is there a problem with the program? Are the kernel parameters not tuned correctly? Is there a problem with the server they bought? All are possible. But there's a particularly common yet easily overlooked bottleneck—network card interrupts.

　　First, understand one thing: a CPU not running at full capacity doesn't mean it's not busy.

　　Many people have a misconception about CPU utilization, thinking that "a CPU utilization of 30% means it's idle." This idea barely made sense in the single-core era, but now we're in the multi-core era, and the situation is completely different.

　　A 30% CPU utilization means that out of 100 CPU cores, 70 are idle, and 30 are working. But here's the problem: if those 30 working cores are occupied by network card interrupts, the remaining 70 idle cores are useless. Because the network card interruption binds to a few cores, the other cores can't help.

　　This is like a delivery station with 100 employees, but only 30 have permission to open packages, while the remaining 70 can only watch helplessly as packages pile up. You ask the station manager, "Are we manpower enough?" He says, "We're only using 30% of our manpower," but the truth is—the 30 people who can actually work are exhausted.

　　Therefore, don't just look at the overall CPU utilization; look at "which core is busy." If a single core is at 100% utilization while other cores are at 0%, the overall utilization is only 1%, but your network performance is already terrible.

　　What exactly is a network card interruption?

　　Let's briefly explain what an interrupt is. After the network card receives a data packet, it can't just dump the data into memory and leave it; it has to notify the CPU to retrieve it. How is it notified? It sends an "interrupt signal"—essentially the network card shouting to the CPU, "Work's here, get processing!"

　　Upon hearing this interrupt signal, the CPU has to drop what it's doing, retrieve the data packets received by the network card, unpack and reassemble them layer by layer, and deliver them to the application. This is the entire interrupt handling process.

　　Where's the problem? The network card is too fast. A 10 Gigabit network card can receive millions of data packets per second. If each packet triggers an interrupt, how busy would the CPU be? It wouldn't be able to do anything else; it would be constantly responding to interrupts, responding to interrupts, responding to interrupts.

　　This is the so-called "interrupt storm"—the CPU is overwhelmed by network card interrupts and has no time to process the actual business logic. Your video server may appear to be working incredibly hard, but in reality, the CPU spends most of its time dealing with network card interrupt requests.

　　Interrupt Affinity: A Seriously Overlooked Parameter

　　Now you know that interrupts consume CPU time. Can we distribute network card interrupts across different CPU cores for processing?

　　Of course, this is called "interrupt affinity." Simply put, you can manually specify: interrupts from network card 0 are handled by CPU core 2, interrupts from network card 1 are handled by CPU cores 3 and 4, and so on.

　　However, many servers are configured by default to send all network card interrupts to CPU core 0. The result is that core 0 is overwhelmed by interrupts, reaching 100% utilization, while dozens of other cores are idle.

　　How do you check this? Use `cat /proc/interrupts`. You'll see an interrupt number in each line, a CPU core in each column, and the number below indicates the number of times that interrupt was processed on that core. If you find that the numbers for a particular interrupt number are concentrated on only the first core, then interrupt load balancing is not being implemented.

　　How do you fix this? You can use the `echo` command to change the `smp_affinity` value of the interrupt number to your desired CPU mask. For example, to make interrupt 47 run only on CPU2 and CPU3, set `/proc/irq/47/smp_affinity` to 0xC (binary 1100, representing the 3rd and 4th cores). Of course, this requires root privileges.

　　Many modern systems have the `irqbalance` service, which can automatically perform interrupt load balancing. However, this service is sometimes not very intelligent—it might not recognize your NUMA architecture and distribute network card interrupts to distant CPU cores, leading to even worse performance. Therefore, in some performance-sensitive scenarios, `irqbalance` is turned off, and cores are manually bound.

　　More advanced methods: RPS and RFS

　　If you find manually binding interrupts too cumbersome, or your network card hardware does not support multi-queues, there are software-level solutions: RPS (Receive Packet Steering) and RFS (Receive Flow Steering).

　　RPS means that after receiving a data packet, it is not processed directly on the current CPU core, but the processing task is distributed to other cores. This avoids interrupts crowding onto a single core.

　　RFS goes a step further: it not only distributes packets to other cores but also tries to ensure that packets from the same data stream (such as the same TCP connection) are always processed by the same core. This results in a higher CPU cache hit rate and better performance.

　　To enable RFS, you need to add a CPU mask to `/sys/class/net/eth0/queues/rx-0/rps_cpus`, telling the system which cores can handle packets in this network interface card queue. You can set it to involve all cores, for example, `ffff` represents the first 16 cores.

　　However, this approach has its costs. Forwarding packets from one core to another incurs overhead, especially when your system is nearing its performance limits; this forwarding overhead can become a new bottleneck.

　　Another easily overlooked point: PCIe bandwidth

　　Many people stop troubleshooting at this point, thinking they've covered the basics. But let me tell you, there's an even more hidden bottleneck—PCIe bandwidth.

　　When the network interface card is plugged into a PCIe slot, all received data must pass through the PCIe bus before the CPU can read it into memory. If your server has multiple 10 Gigabit Ethernet adapters (10 Gigabit network cards) and other high-speed devices (such as NVMe SSDs and GPUs), the PCIe lanes may not be sufficient.

　　For example, the theoretical bandwidth of PCIe 3.0 x8 is approximately 7.88 GB/s, which seems large, right? But if you have two 10 Gigabit Ethernet adapters running at full capacity simultaneously, each at 1.25 GB/s, that's 2.5 GB/s. Add to that NVMe SSD read/write operations and GPU DMA transfers, and the PCIe bandwidth will be fully utilized in no time.

　　What happens when PCIe bandwidth is fully utilized? The network card statistics will show a large number of "rx_missed_errors," meaning that the bus is too busy, the network card's own cache is full, and new data packets are dropped before they can be loaded into memory.

　　How do you troubleshoot this? Use `lspci -vvv` to check the PCIe link information of the network card and see if the link width and speed have reached the nominal values. Sometimes, even when you plug it into an x16 slot, it might only run in x4 mode—this could be due to a loose connection or a BIOS setting issue.

　　The Special Characteristics of Video Servers: Many Small Packets, Many Interrupts

　　Video servers are more unique than other types of servers because video streams are typically transmitted via RTP/UDP, resulting in very small packets. A 1080p video stream usually has an MTU of 1500 bytes, but the actual load might only be around 1200 bytes. Furthermore, the video encoding frame rate might be 30fps or 60fps, with each frame further divided into multiple packets. This results in a considerable number of packets per second.

　　The smaller the packets, the more interrupts need to be processed for the same bandwidth. For the same 1Gbps traffic, if all packets are 1500-byte large packets, it's approximately 80,000 packets per second; but if all packets are 64-byte small packets, it's close to 2 million packets per second. 2 million interrupts are devastating for the CPU.

　　This is why some video servers, when running high-definition streaming, experience CPU spikes even when only 300Mbps of bandwidth is being used—because while the bitrate is low, the packet count is enormous.

　　A classic optimization technique for this is network interface card (NIC) aggregation. Multiple NICs are bonded together using load balancing (e.g., balance-rr or 802.3ad) to distribute traffic across them. Each NIC has its own interrupt queue, theoretically distributing the interrupt load across more CPU cores.

　　In summary: The problem of a video server running at full bandwidth but with idle CPU is highly misleading. Many people's first reaction is to "increase bandwidth," but the bandwidth is often already more than needed; or to "upgrade the CPU," but the CPU isn't actually working. The real problem often lies in the interrupt handling path. Either interrupts are concentrated on a few cores, the PCIe bus is being preempted by other devices, or there are too many small packets causing excessively high interrupt frequencies.

　　When troubleshooting, remember this order: first check the utilization rate of each core, then the distribution of outages, then the packet loss statistics of the network interface card (NIC), and finally check the PCIe link status. Following this chain will most likely lead you to the root cause. Next time you encounter this kind of "false idle" situation, you'll know where to start.

Previous one:What are the differences between single-machine and cluster defense for Hong Kong DDoS protected servers? Next one:What is 95 billing? A bandwidth billing model you must understand if you're running a video website!

Relevant contents

What is 95 billing? A bandwidth billing model you must understand if you're running a video website! Is your Hong Kong DDoS protected server experiencing a high false positive rate? Adjust these parameters to make your protection "smarter." Unlimited protection equals zero leakage? In-depth analysis of the elastic cleaning capabilities and SLA commitments of Hong Kong high-defense servers. High-cost HK Server Still Slow? Bandwidth, Cache & Architecture Matter What should be noted when conducting packet loss tests in server virtualization environments (e.g., VMware, KVM)? How to choose the right configuration for a live streaming server? A comprehensive table of parameters for online users, bandwidth, and DDoS protection. What are the advantages of using a Hong Kong CN2 server for cross-border e-commerce backends? How to estimate server load? Solve Night Lag of US Servers | Premium CN2 GIA Network Jtti anti-ddos servers are 40% off! Terabit-level defense with millisecond response times, ignores DDoS/CC attacks, and supports stress testing. What are the differences between T3 and T4 server room tiers in Hong Kong? Which tier should a business choose?