Skip to main content

Practice and summary of server CPU 100% exception troubleshooting

Problem background

Yesterday afternoon, I suddenly received an operation and maintenance email alert. It showed that the CPU utilization rate of the data platform server reached 98.94%, and it has been continuously above 70% in the recent period. It seems that hardware resources have reached the bottleneck and need to be expanded, but careful thinking will I found that our business system is not a high concurrency or CPU-intensive application. The utilization rate is a bit too exaggerated. The hardware bottleneck should not arrive so soon. There must be a problem with the business code logic.

1. Check ideas

1.1 Locate high-load processes

First, log in to the server and use the top command to confirm the specific situation of the server, and then analyze and judge according to the specific situation.

By observing the load average and the load evaluation standard (8 cores), you can confirm that the server has a high load;

Observing the resource usage of each process, it can be seen that the process with a process id of 682 has a higher CPU share

1.2 Locate specific abnormal services

Here we can use the pwdx command to find the business process path based on the pid, and then locate the person in charge and the project:

It can be concluded that the process corresponds to the web service of the data platform.

1.3 Locate abnormal threads and specific lines of code

The traditional scheme is generally 4 steps: order by with P: 1040 // First find maxLoad(PID) in order of process load -Hp process PID: 1073 // find the relevant load thread PID
3.printf "0x%x\n" thread PID: 0x431 // Convert the thread PID to hexadecimal to prepare for searching the jstack log later
4.jstack process PID | vim +/hexadecimal thread PID-// For example: jstack 1040|vim +/0x431-
But for online problem positioning, every second count. The above 4 steps are still too tedious and time-consuming. Oldratlee, who introduced Taobao before, encapsulated the above process into a tool: It is convenient to locate such problems on the line:

It can be concluded that the execution CPU of a time tool method in the system is relatively high. After locating the specific method, check whether the code logic has performance problems.

2. Root cause analysis

After the previous analysis and investigation, a time tool problem was finally located, which caused the server load and CPU usage to be too high.
Exception method logic: Convert the timestamp to the corresponding specific date and time format;
Upper call: Calculate all the seconds from the early morning of the day to the current time, convert it into the corresponding format and put it in the set to return the result;
Logic layer: Corresponds to the query logic of the real-time report of the data platform. The real-time report will come at a fixed time interval, and there are multiple (n) method calls in one query.
Then it can be concluded that if the current time is 10 am of the same day, the number of calculations for a query is 106060n = 36,000n calculations, and as time increases, the number of single queries will increase linearly as it approaches midnight. Because a large number of query requests of modules such as real-time query and real-time alarm require multiple calls to this method, a large amount of CPU resources are occupied and wasted.

3. Solution

After locating the problem, the first consideration is to reduce the number of calculations and optimize the exception method. After investigation, it is found that when used in the logic layer, the content in the set collection returned by this method is not used, but the size value of the set is simply used. After confirming the logic, simplify the calculation (the current number of seconds-the the number of seconds in the early morning of the day) through a new method, replace the called method, and solve the problem of excessive calculation. After going online, the server load and cpu usage rate were observed. Compared with the abnormal time period, it dropped 30 times and returned to the normal state. So far, the problem has been resolved.

4. Summary

In the coding process, in addition to implementing business logic, we must also pay attention to the optimization of code performance. A business requirement that can be achieved, and what can be achieved more efficiently and more elegantly are actually the embodiment of two completely different engineers' abilities and realms, and the latter is also the core competitiveness of engineers.
After the code is written, do more reviews and think more about whether it can be implemented in a better way.
Don’t miss any small details on online issues! Details are the devil. Technical students need to have the thirst for knowledge and the spirit of pursuing excellence. Only in this way can they continue to grow and improve.


Popular posts from this blog

AMD's GPU technology enters the mobile phone chip market for the first time

In addition to the release of the Exynos2100 processor, Samsung also confirmed a major event at this Exynos event, that is, the custom GPU that they have worked with AMD for many years will soon appear and will be used on the next flagship machine. The current Exynos2100 processor uses ARM’s Mali-G78GPU core with a total of 14 cores, so the GPU architecture developed by Samsung will be the next Exynos processor, and the GPU will be the focus. This is probably the meaning of Exynos2100’s GPU stacking. The key reason. Dr. InyupKang, president of Samsung’s LSI business, confirmed that the next-generation mobile GPU in cooperation with AMD will be used in the next flagship product, but he did not specify which product. Samsung is not talking about the next-generation flagship but the next one, so it is very likely that a new Exynos processor will be available this year, either for the GalaxyNote21 series or the new generation of folding screen GalaxyZFold3. In 2019, AMD and Samsung reached

Apple and Intel want to join the game, what happened to the GPU market?

Intel recently announced that it will launch Xe-LP GPU at the end of this year, officially entering the independent GPU market, and will hand over to TSMC for foundry. At the 2020 WWDC held not long ago, Apple also revealed that it is possible to abandon AMD's GPU and use a self-developed solution based on the ARM architecture. It will launch a self-developed GPU next year. What happened to the GPU market? Why are the giants entering the game?    Massive data calls for high-performance GPU    Why has the demand for GPUs increased so rapidly in recent years? Because we are entering an era where everything needs to be visualized. Dai Shuyu, a partner of Aiwa (Beijing) Technology Co., Ltd., told a reporter from China Electronics News that visualization requires a large amount of graphics and image computing capabilities, and a large amount of high-performance image processing capabilities are required for both the cloud and the edge.    Aiwa (Beijing) Technology Co., Ltd. is an enterp

NVIDIA officially launches RTX 30 series mobile graphics cards

In the early morning of January 13, NVIDIA officially launched the RTX30 series of mobile graphics cards at the CES2021 exhibition. Ampere-based GPUs have also reached the mobile terminal, mainly including RTX3080, RTX3070 and RTX3060 models. In addition to improving game performance, the RTX30 series of mobile graphics cards have twice the energy efficiency of the previous generation, and support the third-generation Max-Q technology, mainly supporting DynamicBoost2.0 dynamic acceleration technology, WisperMode2.0 noise control, ResizableBAR (similar to AMD’s SAM technology) and DLSS. The third-generation Max-Q technology uses AI and new system optimization to make high-performance gaming laptops faster and more powerful than ever. These technologies include: ·DynamicBoost2.0: The CPU and GPU powers of traditional gaming notebooks are fixed, while games and creative applications are dynamic, and the requirements for the system will vary with the number of frames. With DynamicBoost2.0,