Although SPEC2006 may have been replaced by SPEC2017, we have accumulated a lot of experience with SPEC2006. Considering the problems we encountered in the data center infrastructure, this is our best choice for the first round of raw performance analysis.
Single-threaded performance is still very important, especially in maintenance and setup situations. In many cases, it may be running a large bash script, trying a very complex SQL query, or configuring new software, and the user does not use all the kernels at all.
Although SPEC CPU2006 is more oriented towards high-performance computing and workstations, it contains a wide variety of integer workloads. We firmly believe that we should try to imitate how performance-critical software is compiled instead of trying to get the highest score. To this end, we:
Use 64-bit GCC: Currently the most commonly used compiler on Linux, for integer workloads, a very good comprehensive compiler, it will not try to "break" benchmarks (libquantum...), nor will it only support specific architectures ;
Use 4 and 8.3 versions: standard compiler with Ubuntu 18.04 LTS and 19.04;
Use -Ofast -fno-strict-aliasing optimization: to achieve a good balance between performance and keeping it simple;
Add "-std=gnu89" to the portability settings to solve the problem that some tests cannot be compiled.
The ultimate goal is to measure performance in non-actively optimized applications. In these applications, usually for some reason, a multi-threaded unfriendly task will make us wait. The disadvantage is that there are still quite a few cases where GCC will generate sub-optimal code, which will cause a big sensation compared with the results of ICC or AOCC. They are optimized to find specific optimizations in SPEC code.
The first is the single-threaded result. It is worth noting that due to the turbo technology, the clock speed of all processors will be higher than the reference clock speed.
Xeon E5-2699 v4 ("Broadwell") can be upgraded to 6 GHz. Note: These are the old results compiled with GCC 5.4;
Xeon 8176 ("Skylake-SP") can be increased to 8 GHz;
EPYC 7601 ("Naples") can be increased to 2 GHz;
The frequency of EPYC 7742 ("Rome") is increased to 4 GHz. The result was compiled with GCC 7.4 and 8.3.
Unfortunately, we cannot test the data of Intel Xeon 8280 on time. However, the Intel Xeon 8280 will provide very similar results, with the main difference being that it runs at a 5% increase in clock speed (4 GHz vs 3.8 GHz). So we expect the result will be 3-5% higher than Xeon 8176.
According to the special specification permission rules, since these results have not been officially submitted to the special specification database, we must declare them as evaluation results.
SPEC CPU analysis is always complicated, it mixes the type of code generated by the compiler and the CPU architecture.
First of all, the most interesting data point is that the code generated by GCC 8 seems to be a big improvement for EPYC processors. We repeated the single-threaded test three times and the results were consistent.
Hmmer is one of the branch-intensive benchmarks, and it is also the other two workloads that have a greater impact on branch prediction (the percentage of branch misses is slightly higher) gobmk, sjeng uses the new TAGE predictor, which performs better on the second-generation EPYC.
Why the low IPC omnetpp ("network sim") did not show any improvement is a mystery to us, and we expect a larger L3 cache will help. However, this is a test that really likes large caches, so Intel Xeon processors are very advantageous (38.5-55 MB L3).
The video coding benchmark "h264ref" also relies on the L3 cache to some extent, but the benchmark is more dependent on DRAM bandwidth. It is obvious that EPYC 7002 has a higher DRAM bandwidth.
The pointer tracking benchmark (XML processing and pathfinding) performed poorly on the previous generation EPYC (compared to Xeons), but showed a very significant improvement on the EPYC 7002.
Multi-core SPEC CPU2006
For the record, we believe that the standard CPU "speed" indicator is not of much value in estimating server CPU performance. Most applications will not run many completely independent processes in parallel; there will be at least some interaction between threads.
We need to emphasize it again: SPECint rate testing may not be realistic. If you start 112 to 256 instances, it will cause a huge bandwidth bottleneck, no synchronization, and 100% consistent CPU load, all of which are very unrealistic in most integer applications.
The specific rate estimation results emphasize all the advantages of the new EPYC processor: more cores, higher bandwidth. At the time, it ignored a minor drawback: higher internal latency. So this is the ideal situation for EPYC processors.
However, even if we consider that AMD has a 45% memory bandwidth advantage, and Intel's latest chip (8280) provides about 7% to 8% of performance, this is also very amazing. On average, the SPECint rate of EPYC 7742 is twice that of the best embedded Intel Xeon processor available.
Interestingly, we see that most interest rate benchmarks run on the P1 clock or the highest p-1 state. For example, this is the result we see when we run libquantum:
Some benchmark tests such as h264ref run at a lower clock.
Current servers do not allow us to make accurate power measurements, but it would be very shocking if the AMD EPYC 7742 can stay within the 225-watt workload range when running integer workloads on all cores at 3.2 gigahertz. Long story short: the new EPYC 7742 seems to be able to run integer workloads on all cores while supporting higher clocks than comparable Intel models.
Comments
Post a Comment