Yesterday afternoon, I suddenly received an operation and maintenance email alert. It showed that the CPU utilization rate of the data platform server reached 98.94%, and it has been continuously above 70% in the recent period. It seems that hardware resources have reached the bottleneck and need to be expanded, but careful thinking will I found that our business system is not a high concurrency or CPU-intensive application. The utilization rate is a bit too exaggerated. The hardware bottleneck should not arrive so soon. There must be a problem with the business code logic.
1. Check ideas
1.1 Locate high-load processes
First, log in to the server and use the top command to confirm the specific situation of the server, and then analyze and judge according to the specific situation.
By observing the load average and the load evaluation standard (8 cores), you can confirm that the server has a high load;
Observing the resource usage of each process, it can be seen that the process with a process id of 682 has a higher CPU share
1.2 Locate specific abnormal services
Here we can use the pwdx command to find the business process path based on the pid, and then locate the person in charge and the project:
It can be concluded that the process corresponds to the web service of the data platform.
1.3 Locate abnormal threads and specific lines of code
The traditional scheme is generally 4 steps:
1.top order by with P: 1040 // First find maxLoad(PID) in order of process load
2.top -Hp process PID: 1073 // find the relevant load thread PID
3.printf "0x%x\n" thread PID: 0x431 // Convert the thread PID to hexadecimal to prepare for searching the jstack log later
4.jstack process PID | vim +/hexadecimal thread PID-// For example: jstack 1040|vim +/0x431-
But for online problem positioning, every second count. The above 4 steps are still too tedious and time-consuming. Oldratlee, who introduced Taobao before, encapsulated the above process into a tool: show-busy-java-threads.sh. It is convenient to locate such problems on the line:
It can be concluded that the execution CPU of a time tool method in the system is relatively high. After locating the specific method, check whether the code logic has performance problems.
2. Root cause analysis
After the previous analysis and investigation, a time tool problem was finally located, which caused the server load and CPU usage to be too high.
Exception method logic: Convert the timestamp to the corresponding specific date and time format;
Upper call: Calculate all the seconds from the early morning of the day to the current time, convert it into the corresponding format and put it in the set to return the result;
Logic layer: Corresponds to the query logic of the real-time report of the data platform. The real-time report will come at a fixed time interval, and there are multiple (n) method calls in one query.
Then it can be concluded that if the current time is 10 am of the same day, the number of calculations for a query is 106060n = 36,000n calculations, and as time increases, the number of single queries will increase linearly as it approaches midnight. Because a large number of query requests of modules such as real-time query and real-time alarm require multiple calls to this method, a large amount of CPU resources are occupied and wasted.
After locating the problem, the first consideration is to reduce the number of calculations and optimize the exception method. After investigation, it is found that when used in the logic layer, the content in the set collection returned by this method is not used, but the size value of the set is simply used. After confirming the logic, simplify the calculation (the current number of seconds-the the number of seconds in the early morning of the day) through a new method, replace the called method, and solve the problem of excessive calculation. After going online, the server load and cpu usage rate were observed. Compared with the abnormal time period, it dropped 30 times and returned to the normal state. So far, the problem has been resolved.
In the coding process, in addition to implementing business logic, we must also pay attention to the optimization of code performance. A business requirement that can be achieved, and what can be achieved more efficiently and more elegantly are actually the embodiment of two completely different engineers' abilities and realms, and the latter is also the core competitiveness of engineers.
After the code is written, do more reviews and think more about whether it can be implemented in a better way.
Don’t miss any small details on online issues! Details are the devil. Technical students need to have the thirst for knowledge and the spirit of pursuing excellence. Only in this way can they continue to grow and improve.