Under normal circumstances, CPU utilization fluctuates with business requests and computing tasks. However, if it remains close to 100% for an extended period, it indicates that the server performance has reached its limit, potentially leading to slower application response, increased database query latency, or even business interruption. In this situation, operations personnel need to promptly investigate the cause to determine whether certain processes are abnormally consuming CPU, or whether the server is under malicious attack or affected by other external factors, in order to implement targeted solutions.
First, from a basic troubleshooting perspective, it's necessary to obtain the server's CPU usage and load status. Linux servers can be monitored in real-time using commands such as `top`, `htop`, `vmstat`, and `iostat`. For example, the `top` command can be used to view the CPU usage of each process:
top -b -n 1
In the output, the %CPU column displays the CPU usage percentage of each process, and the load average displays the system load average over 1 minute, 5 minutes, and 15 minutes. If a process is found to have abnormally high CPU usage, such as a PHP-FPM process, Java service, or database process consistently using more than 50% of the CPU, it can be initially determined that the problem lies with the process itself. The `ps` command can be used for further confirmation.
ps aux --sort=-%cpu | head -n 20
This command lists the top 20 processes by CPU usage, facilitating quick identification of abnormal processes. If only a single process is experiencing abnormal CPU usage, it might be due to issues with business logic, infinite loops, or insufficient database query optimization. In this case, it's necessary to analyze application logs and code to optimize algorithms or adjust service configurations to reduce CPU consumption.
Besides single-process anomalies, sometimes sustained 100% CPU usage can be caused by the overall load of multiple processes or threads. For example, high concurrency requests can lead to an increase in the number of web server processes, a full database connection pool, or intensive background task execution. In such cases, it's necessary to analyze the server's business patterns and request characteristics. Historical CPU usage data can be collected using commands like `sar` or `vmstat` to determine if the load coincides with peak business request periods.
sar -u 1 10
vmstat 1 10
These tools can help determine whether high CPU load is due to short-term business fluctuations or persistent anomalies, allowing for different strategies: short-term spikes can be mitigated by expanding server resources or adding load balancers, while persistent anomalies require further investigation of processes and potential attacks.
Besides investigating processes, another possibility is that the server is under external attack. Common attack types include DDoS attacks and malicious scripts flooding APIs. DDoS attacks are usually accompanied by an abnormal increase in network traffic, and high CPU usage is due to the system needing to handle a large number of connection requests or packet filtering. In this case, network connections and traffic can be analyzed using netstat, ss, or iftop.
netstat -anp | grep ESTABLISHED | sort -k5
iftop -i eth0
These commands allow you to view which IP addresses or ports are experiencing a large number of connections. If you find certain IPs accessing the network abnormally frequently, it's highly likely that the system is under attack. In such cases, you need to take immediate protective measures, such as blocking the abnormal IPs at the firewall level, using DDoS protection provided by your cloud service provider, or restricting access to specific ports.
After locating the process experiencing abnormal CPU usage, operations personnel can take various measures to reduce the load. For business processes, they can optimize code logic, reduce loop calculations, or improve algorithm efficiency; for high database loads, they can add indexes, optimize queries, or implement database sharding; for web servers, they can adjust the number of threads, connection pool, or caching strategies. For example, in Nginx, you can adjust the number of worker processes and the maximum number of connections:
worker_processes auto;
events {
worker_connections 1024;
}
Furthermore, the proper use of caching and queues can effectively reduce CPU pressure. Caching frequently accessed data in memory using Redis or Memcached reduces the number of database queries. For asynchronous tasks, RabbitMQ, Kafka, or Celery can be used to process time-consuming operations in the background, preventing synchronous requests from blocking the CPU.
During troubleshooting and optimization, establishing monitoring and alerting mechanisms is crucial. Monitoring platforms provided by Prometheus, Grafana, Zabbix, or cloud service providers can monitor CPU utilization, process status, load balancing, and network traffic in real time, and set threshold alerts. For example, when CPU utilization exceeds 80% for 5 consecutive minutes, notifications can be sent to operations personnel via email, SMS, or webhooks for timely response. Combined with automated scripts, new requests can be automatically limited or services can be restarted when thresholds are reached, reducing CPU pressure.
Finally, operational experience shows that sustained high CPU usage is often the result of multiple factors. Besides process anomalies and attacks, insufficient hardware configuration, memory shortages leading to frequent swapping, and disk I/O bottlenecks can also indirectly increase CPU load. Therefore, a comprehensive investigation should combine system performance metrics, application logs, and business access patterns to gradually eliminate the source of the problem. For long-term high-load businesses, horizontal scaling of the server cluster, architecture optimization, and the addition of load balancing nodes should be considered to ensure system stability under high concurrency and large data volume scenarios.
In summary, persistent 100% CPU usage on cloud servers requires investigation from multiple perspectives. First, analyze CPU usage using tools such as top, ps, and vmstat to determine whether it's an abnormality in a single process or excessive overall load. Second, combine network monitoring and log analysis to determine if there are DDoS attacks or malicious programs. Third, reduce CPU consumption by optimizing code, adjusting databases, regulating web server parameters, and using caching and asynchronous tasks. Finally, establish monitoring and alarm mechanisms to monitor server performance status in real time. Through systematic investigation and optimization, operations personnel can quickly locate problems and take targeted measures, not only resolving high CPU usage but also improving overall server performance and stability, ensuring business continuity.
EN
CN