Skip to main content

Which server has the best CPU memory performance? AMD or Intel?

Benchmark configuration and method

An MCT mentor conducted a test on EPYC 7601, Skylake, and Cascade Lake machines.
Except for the EPYC 7742 server running Ubuntu 19.04, all tests were performed on Ubuntu server 18.04 LTS. Because 19.04 has verified support for Rome, and there are two weeks of testing time. Linux kernel 4.19 and higher versions support Rome (including X2APIC/IOMMU patch, using 256 threads).
The server configuration of this test is different, and the capacity of the dynamic random access memory is also different. This is of course because Xeons has six memory channels, while EPYC processors have eight channels. All the tests this time are suitable for 128 GB, so the capacity of dynamic random access memory should not have much impact on performance.
AMD Daytona-Dual EPYC 7742
AMD provided a "Daytona XT" server, which is a reference platform built by ODM Quanta (D52BQ-2U).
Although the 225W TDP CPU requires an additional heat sink and heat sink, it can still run at room temperature.
AMD EPYC 7601—(2U chassis)
Intel Xeon "Purley" server-S2P2S 3Q (2U chassis) supports hyper-threading and Intel virtualization acceleration

Memory subsystem: bandwidth

As mentioned earlier, using John McCalpin's streaming bandwidth benchmark test to measure the full bandwidth potential has become an extreme tuning problem that requires a very deep understanding of the platform.
If we use previous binary files, neither the first generation nor the second-generation EPYC can exceed 200-210 GB/s. Although we now have 8-channel DDR4-3200, it feels like it has encountered a "bandwidth limitation." Therefore, we use the best binary results produced by Intel and AMD using AVX-512 (Intel) and AVX-2 (AMD).
The result is expressed in gigabytes per second.
Set the number of nodes per socket to 4, AMD can also set a higher number. With 4 nodes per socket, AMD reports up to 353 GB/s. NPS4 will cause CCX to only access the memory controller with the lowest latency on the central IO hub chip.
These numbers are only related to a small portion of high-performance computing applications optimized by AVX (-256/512). Compared with Intel's best SKU (28 cores), AMD claims to have a 45% advantage. We have every reason to believe them, but they are only related to specific areas.
For more than 95% of other areas, the impact of memory latency is much greater than peak bandwidth.

Memory subsystem: latency

For reasons such as scalability and price, AMD chose to share the core design among mobile, desktop, and servers. Rome's Core Complex (CCX) is still consistent with the previous generation.
The difference is that each CCX communicates with the central input and output hub instead of the 4 chips in the 4-node NUMA layout (this option can still be controlled by the NPS4 switch, so that each CCD is located in its sIOD quadrant and those local memories The local of the controller avoids switching between sIOD quadrants, thereby avoiding some delays). Therefore, since the performance of modern processors relies heavily on the cache subsystem, we are very curious about the delay that will occur when the server thread accesses more and more pages in the cache hierarchy.
We are using our own internal latency test. We are particularly interested in estimating the structured delay of the processor, which means that we are trying to ignore the missing TLB. In addition to the dynamic random access memory delay, in the dynamic random access memory delay, the delay measurement between platforms becomes more complicated. , We revert to a completely random number.
On October 1st of this year, we discovered that the initial delay figures were inaccurate, and then updated this article with a new test tool with more representative figures.
When you start to observe the cache depth beyond L2, everything becomes very interesting. After AMD reached a speed of 512KB, Intel's speed was 1MB, but AMD's L2 has a speed advantage over Intel's larger cache.
Where AMD's speed advantage is becoming more and more obvious is the L3 cache, which is significantly faster than Intel's chips. The biggest difference is that for EPYC 7742, AMD’s L3 is only native to 4-core CCX, which is now twice that of 7601 and can reach 16MB.
At present, this is a double-edged sword for the AMD platform: on the one hand, the total cache of the EPYC processor is much larger, and the total cache of the 7742 processor is as high as 256 megabytes, which is four times that of the 7601 processor. Many platforms far exceed Intel's (Xeon 8180, 8176, and 8280 processors have a total cache size of 38.5 MB, and Xeon E5-2699 v4 processors have a total cache size of 55 MB).
The disadvantage of AMD is that although they have more caches, EPYC 7742 is more like 16 CCXs, and they all have a very fast 16 MB L3. Although 64-core is now a large NUMA node, 64-core chips are basically 16x 4 cores, each with 16 MB L3 cache. Once you exceed the 16 MB cache, the prefetcher can alleviate the blow, but you will access the main dynamic random access memory.
The strange thing is that accessing data that resides on the same chip but not on the same CCX is as slow as accessing data on a completely different chip. This is because no matter where the other CCX is, whether it is near the same chip or on the other side of the chip, data access must still reach the I/O chip through the intermediate frequency, and then return.
Must this be a bad thing? The answer is: Not most of the time. First, in most applications, the L3 cache can only answer a low percentage of accesses. Secondly, each core of CCX has no less than 4 MB of L3 available space, far exceeding the available space of Intel core (1.375 MB). The prefetcher has more space to ensure that the data exists before it is needed.
However, database performance may still decrease. For example, storing most of the indexes in the cache can improve performance, especially OLTP access tends to be very random. Second, relatively slow communications on the central hub will slow down synchronous communications. Intel claims that on the 28-core Intel Xeon 8280 processor, OLTP hammer DB runs 60% faster than on the EPYC 7601 processor. This fact proves this.
But for most high-end CPUs, they will run many parallel applications, such as running microservices, docker containers, virtual machines, mapping/reducing smaller data blocks, and parallel high-performance computing jobs. In almost all cases, a 4-core 16 MB L3 is sufficient.
Think about it carefully, when running an 8-core virtual machine, there may be a small problem that affects the performance a little.
In short, AMD did not use the larger 8-core CCX, so there is still some performance to be improved. We look forward to seeing what will happen to the platform in the future.

Memory subsystem: TinyMemBench

We use Andrei's custom memory latency test to verify the value of LMBench.
Latency tools can also measure bandwidth, which is much clearer than we can access dynamic random access memory once we exceed 16 MB. When Andre compared our Ryzen 9 3900x times number, he pointed out:

Compared with Ryzen 3000, some prefetchers seem to be adjusted for Rome. In fact, prefetchers are not as aggressive as consumer products. We believe that AMD made this choice because if the prefetcher takes up too much bandwidth, a considerable number of applications (Java and high-performance computing) will be affected. influences. It can help test performance by reducing the aggressiveness of Rome's pre-memory.

Measuring the average time of random memory access in different size buffers shows that the larger the buffer, the greater the relative contribution of TLB, L1/L2 cache misses and dynamic random access memory access. All these numbers represent extra time, which needs to be added to the L1 cache delay (4 cycles).
We tested double random reads because we wanted to know how the memory system handles multiple read requests.
This figure shows how the larger L3 cache of EPYC 7742 results in lower latency between 4 and 16 MB compared to EPYC 7601. Compared with Intel's mesh (8280) and ring topology (E5), the L3 cache inside CCX is also very fast (2-8 MB).
However, once we access more than 16 MB of memory, Intel has a clear advantage because the shared L3 cache is slower but has a larger capacity. When we tested the new EPYC processor in a more advanced NUMA setting (NPS = 4 settings, meaning 4 nodes per socket), the 64 MB latency was reduced from 129 to 119. We quote AMD's project:
In the network processor 4, the NUMA domain is reported to the software in such a way that the small chip always accesses the near (2-channel) dynamic random access memory. In NPS1, 8ch is hardware interleaved, and there is more waiting time to reach the other one. It varies with multiple pairs of dynamic random access memory channels, with the farthest channel being approximately 20-25 nanoseconds apart from the nearest channel (depending on the speed). Generally speaking, the delay of a channel pair is +~6-8ns, +~8-10ns, +~20-25ns, and the delay of the physically closest channel pair is +~ 6-8 ns, +~ 8-10 ns , +~ 20-25 ns. "


Popular posts from this blog

AMD's GPU technology enters the mobile phone chip market for the first time

In addition to the release of the Exynos2100 processor, Samsung also confirmed a major event at this Exynos event, that is, the custom GPU that they have worked with AMD for many years will soon appear and will be used on the next flagship machine. The current Exynos2100 processor uses ARM’s Mali-G78GPU core with a total of 14 cores, so the GPU architecture developed by Samsung will be the next Exynos processor, and the GPU will be the focus. This is probably the meaning of Exynos2100’s GPU stacking. The key reason. Dr. InyupKang, president of Samsung’s LSI business, confirmed that the next-generation mobile GPU in cooperation with AMD will be used in the next flagship product, but he did not specify which product. Samsung is not talking about the next-generation flagship but the next one, so it is very likely that a new Exynos processor will be available this year, either for the GalaxyNote21 series or the new generation of folding screen GalaxyZFold3. In 2019, AMD and Samsung reached

Apple and Intel want to join the game, what happened to the GPU market?

Intel recently announced that it will launch Xe-LP GPU at the end of this year, officially entering the independent GPU market, and will hand over to TSMC for foundry. At the 2020 WWDC held not long ago, Apple also revealed that it is possible to abandon AMD's GPU and use a self-developed solution based on the ARM architecture. It will launch a self-developed GPU next year. What happened to the GPU market? Why are the giants entering the game?    Massive data calls for high-performance GPU    Why has the demand for GPUs increased so rapidly in recent years? Because we are entering an era where everything needs to be visualized. Dai Shuyu, a partner of Aiwa (Beijing) Technology Co., Ltd., told a reporter from China Electronics News that visualization requires a large amount of graphics and image computing capabilities, and a large amount of high-performance image processing capabilities are required for both the cloud and the edge.    Aiwa (Beijing) Technology Co., Ltd. is an enterp

NVIDIA officially launches RTX 30 series mobile graphics cards

In the early morning of January 13, NVIDIA officially launched the RTX30 series of mobile graphics cards at the CES2021 exhibition. Ampere-based GPUs have also reached the mobile terminal, mainly including RTX3080, RTX3070 and RTX3060 models. In addition to improving game performance, the RTX30 series of mobile graphics cards have twice the energy efficiency of the previous generation, and support the third-generation Max-Q technology, mainly supporting DynamicBoost2.0 dynamic acceleration technology, WisperMode2.0 noise control, ResizableBAR (similar to AMD’s SAM technology) and DLSS. The third-generation Max-Q technology uses AI and new system optimization to make high-performance gaming laptops faster and more powerful than ever. These technologies include: ·DynamicBoost2.0: The CPU and GPU powers of traditional gaming notebooks are fixed, while games and creative applications are dynamic, and the requirements for the system will vary with the number of frames. With DynamicBoost2.0,