Skip to main content

How much better is AMD's second-generation server CPU SPEC performance than Intel?



Although SPEC2006 may have been replaced by SPEC2017, we have accumulated a lot of experience with SPEC2006. Considering the problems we encountered in the data center infrastructure, this is our best choice for the first round of raw performance analysis.

Single-threaded performance is still very important, especially in maintenance and setup situations. In many cases, it may be running a large bash script, trying a very complex SQL query, or configuring new software, and the user does not use all the kernels at all.

Although SPEC CPU2006 is more oriented towards high-performance computing and workstations, it contains a wide variety of integer workloads. We firmly believe that we should try to imitate how performance-critical software is compiled instead of trying to get the highest score. To this end, we:

Use 64-bit GCC: Currently the most commonly used compiler on Linux, for integer workloads, a very good comprehensive compiler, it will not try to "break" benchmarks (libquantum...), nor will it only support specific architectures ;
Use 4 and 8.3 versions: standard compiler with Ubuntu 18.04 LTS and 19.04;
Use -Ofast -fno-strict-aliasing optimization: to achieve a good balance between performance and keeping it simple;
Add "-std=gnu89" to the portability settings to solve the problem that some tests cannot be compiled.

The ultimate goal is to measure performance in non-actively optimized applications. In these applications, usually for some reason, a multi-threaded unfriendly task will make us wait. The disadvantage is that there are still quite a few cases where GCC will generate sub-optimal code, which will cause a big sensation compared with the results of ICC or AOCC. They are optimized to find specific optimizations in SPEC code.

The first is the single-threaded result. It is worth noting that due to the turbo technology, the clock speed of all processors will be higher than the reference clock speed.

Xeon E5-2699 v4 ("Broadwell") can be upgraded to 6 GHz. Note: These are the old results compiled with GCC 5.4;
Xeon 8176 ("Skylake-SP") can be increased to 8 GHz;
EPYC 7601 ("Naples") can be increased to 2 GHz;
The frequency of EPYC 7742 ("Rome") is increased to 4 GHz. The result was compiled with GCC 7.4 and 8.3.

Unfortunately, we cannot test the data of Intel Xeon 8280 on time. However, the Intel Xeon 8280 will provide very similar results, with the main difference being that it runs at a 5% increase in clock speed (4 GHz vs 3.8 GHz). So we expect the result will be 3-5% higher than Xeon 8176.

According to the special specification permission rules, since these results have not been officially submitted to the special specification database, we must declare them as evaluation results.



SPEC CPU analysis is always complicated, it mixes the type of code generated by the compiler and the CPU architecture.



First of all, the most interesting data point is that the code generated by GCC 8 seems to be a big improvement for EPYC processors. We repeated the single-threaded test three times and the results were consistent.

Hmmer is one of the branch-intensive benchmarks, and it is also the other two workloads that have a greater impact on branch prediction (the percentage of branch misses is slightly higher) gobmk, sjeng uses the new TAGE predictor, which performs better on the second-generation EPYC.

Why the low IPC omnetpp ("network sim") did not show any improvement is a mystery to us, and we expect a larger L3 cache will help. However, this is a test that really likes large caches, so Intel Xeon processors are very advantageous (38.5-55 MB L3).

The video coding benchmark "h264ref" also relies on the L3 cache to some extent, but the benchmark is more dependent on DRAM bandwidth. It is obvious that EPYC 7002 has a higher DRAM bandwidth.

The pointer tracking benchmark (XML processing and pathfinding) performed poorly on the previous generation EPYC (compared to Xeons), but showed a very significant improvement on the EPYC 7002.

Multi-core SPEC CPU2006

For the record, we believe that the standard CPU "speed" indicator is not of much value in estimating server CPU performance. Most applications will not run many completely independent processes in parallel; there will be at least some interaction between threads.



We need to emphasize it again: SPECint rate testing may not be realistic. If you start 112 to 256 instances, it will cause a huge bandwidth bottleneck, no synchronization, and 100% consistent CPU load, all of which are very unrealistic in most integer applications.

The specific rate estimation results emphasize all the advantages of the new EPYC processor: more cores, higher bandwidth. At the time, it ignored a minor drawback: higher internal latency. So this is the ideal situation for EPYC processors.

However, even if we consider that AMD has a 45% memory bandwidth advantage, and Intel's latest chip (8280) provides about 7% to 8% of performance, this is also very amazing. On average, the SPECint rate of EPYC 7742 is twice that of the best embedded Intel Xeon processor available.

Interestingly, we see that most interest rate benchmarks run on the P1 clock or the highest p-1 state. For example, this is the result we see when we run libquantum:



Some benchmark tests such as h264ref run at a lower clock.




Current servers do not allow us to make accurate power measurements, but it would be very shocking if the AMD EPYC 7742 can stay within the 225-watt workload range when running integer workloads on all cores at 3.2 gigahertz. Long story short: the new EPYC 7742 seems to be able to run integer workloads on all cores while supporting higher clocks than comparable Intel models.

Comments

Popular posts from this blog

AMD's GPU technology enters the mobile phone chip market for the first time

In addition to the release of the Exynos2100 processor, Samsung also confirmed a major event at this Exynos event, that is, the custom GPU that they have worked with AMD for many years will soon appear and will be used on the next flagship machine. The current Exynos2100 processor uses ARM’s Mali-G78GPU core with a total of 14 cores, so the GPU architecture developed by Samsung will be the next Exynos processor, and the GPU will be the focus. This is probably the meaning of Exynos2100’s GPU stacking. The key reason. Dr. InyupKang, president of Samsung’s LSI business, confirmed that the next-generation mobile GPU in cooperation with AMD will be used in the next flagship product, but he did not specify which product. Samsung is not talking about the next-generation flagship but the next one, so it is very likely that a new Exynos processor will be available this year, either for the GalaxyNote21 series or the new generation of folding screen GalaxyZFold3. In 2019, AMD and Samsung reached

Apple and Intel want to join the game, what happened to the GPU market?

Intel recently announced that it will launch Xe-LP GPU at the end of this year, officially entering the independent GPU market, and will hand over to TSMC for foundry. At the 2020 WWDC held not long ago, Apple also revealed that it is possible to abandon AMD's GPU and use a self-developed solution based on the ARM architecture. It will launch a self-developed GPU next year. What happened to the GPU market? Why are the giants entering the game?    Massive data calls for high-performance GPU    Why has the demand for GPUs increased so rapidly in recent years? Because we are entering an era where everything needs to be visualized. Dai Shuyu, a partner of Aiwa (Beijing) Technology Co., Ltd., told a reporter from China Electronics News that visualization requires a large amount of graphics and image computing capabilities, and a large amount of high-performance image processing capabilities are required for both the cloud and the edge.    Aiwa (Beijing) Technology Co., Ltd. is an enterp

NVIDIA officially launches RTX 30 series mobile graphics cards

In the early morning of January 13, NVIDIA officially launched the RTX30 series of mobile graphics cards at the CES2021 exhibition. Ampere-based GPUs have also reached the mobile terminal, mainly including RTX3080, RTX3070 and RTX3060 models. In addition to improving game performance, the RTX30 series of mobile graphics cards have twice the energy efficiency of the previous generation, and support the third-generation Max-Q technology, mainly supporting DynamicBoost2.0 dynamic acceleration technology, WisperMode2.0 noise control, ResizableBAR (similar to AMD’s SAM technology) and DLSS. The third-generation Max-Q technology uses AI and new system optimization to make high-performance gaming laptops faster and more powerful than ever. These technologies include: ·DynamicBoost2.0: The CPU and GPU powers of traditional gaming notebooks are fixed, while games and creative applications are dynamic, and the requirements for the system will vary with the number of frames. With DynamicBoost2.0,