Benchmark configuration and method
An MCT mentor conducted a test on EPYC 7601, Skylake, and Cascade Lake machines.
Except for the EPYC 7742 server running Ubuntu 19.04, all tests were performed on Ubuntu server 18.04 LTS. Because 19.04 has verified support for Rome, and there are two weeks of testing time. Linux kernel 4.19 and higher versions support Rome (including X2APIC/IOMMU patch, using 256 threads).
The server configuration of this test is different, and the capacity of the dynamic random access memory is also different. This is of course because Xeons has six memory channels, while EPYC processors have eight channels. All the tests this time are suitable for 128 GB, so the capacity of dynamic random access memory should not have much impact on performance.
AMD Daytona-Dual EPYC 7742
AMD provided a "Daytona XT" server, which is a reference platform built by ODM Quanta (D52BQ-2U).
Although the 225W TDP CPU requires an additional heat sink and heat sink, it can still run at room temperature.
AMD EPYC 7601—(2U chassis)
Intel Xeon "Purley" server-S2P2S 3Q (2U chassis) supports hyper-threading and Intel virtualization acceleration
Memory subsystem: bandwidth
As mentioned earlier, using John McCalpin's streaming bandwidth benchmark test to measure the full bandwidth potential has become an extreme tuning problem that requires a very deep understanding of the platform.
If we use previous binary files, neither the first generation nor the second-generation EPYC can exceed 200-210 GB/s. Although we now have 8-channel DDR4-3200, it feels like it has encountered a "bandwidth limitation." Therefore, we use the best binary results produced by Intel and AMD using AVX-512 (Intel) and AVX-2 (AMD).
The result is expressed in gigabytes per second.
Set the number of nodes per socket to 4, AMD can also set a higher number. With 4 nodes per socket, AMD reports up to 353 GB/s. NPS4 will cause CCX to only access the memory controller with the lowest latency on the central IO hub chip.
These numbers are only related to a small portion of high-performance computing applications optimized by AVX (-256/512). Compared with Intel's best SKU (28 cores), AMD claims to have a 45% advantage. We have every reason to believe them, but they are only related to specific areas.
For more than 95% of other areas, the impact of memory latency is much greater than peak bandwidth.
Memory subsystem: latency
For reasons such as scalability and price, AMD chose to share the core design among mobile, desktop, and servers. Rome's Core Complex (CCX) is still consistent with the previous generation.
The difference is that each CCX communicates with the central input and output hub instead of the 4 chips in the 4-node NUMA layout (this option can still be controlled by the NPS4 switch, so that each CCD is located in its sIOD quadrant and those local memories The local of the controller avoids switching between sIOD quadrants, thereby avoiding some delays). Therefore, since the performance of modern processors relies heavily on the cache subsystem, we are very curious about the delay that will occur when the server thread accesses more and more pages in the cache hierarchy.
We are using our own internal latency test. We are particularly interested in estimating the structured delay of the processor, which means that we are trying to ignore the missing TLB. In addition to the dynamic random access memory delay, in the dynamic random access memory delay, the delay measurement between platforms becomes more complicated. , We revert to a completely random number.
On October 1st of this year, we discovered that the initial delay figures were inaccurate, and then updated this article with a new test tool with more representative figures.
When you start to observe the cache depth beyond L2, everything becomes very interesting. After AMD reached a speed of 512KB, Intel's speed was 1MB, but AMD's L2 has a speed advantage over Intel's larger cache.
Where AMD's speed advantage is becoming more and more obvious is the L3 cache, which is significantly faster than Intel's chips. The biggest difference is that for EPYC 7742, AMD’s L3 is only native to 4-core CCX, which is now twice that of 7601 and can reach 16MB.
At present, this is a double-edged sword for the AMD platform: on the one hand, the total cache of the EPYC processor is much larger, and the total cache of the 7742 processor is as high as 256 megabytes, which is four times that of the 7601 processor. Many platforms far exceed Intel's (Xeon 8180, 8176, and 8280 processors have a total cache size of 38.5 MB, and Xeon E5-2699 v4 processors have a total cache size of 55 MB).
The disadvantage of AMD is that although they have more caches, EPYC 7742 is more like 16 CCXs, and they all have a very fast 16 MB L3. Although 64-core is now a large NUMA node, 64-core chips are basically 16x 4 cores, each with 16 MB L3 cache. Once you exceed the 16 MB cache, the prefetcher can alleviate the blow, but you will access the main dynamic random access memory.
The strange thing is that accessing data that resides on the same chip but not on the same CCX is as slow as accessing data on a completely different chip. This is because no matter where the other CCX is, whether it is near the same chip or on the other side of the chip, data access must still reach the I/O chip through the intermediate frequency, and then return.
Must this be a bad thing? The answer is: Not most of the time. First, in most applications, the L3 cache can only answer a low percentage of accesses. Secondly, each core of CCX has no less than 4 MB of L3 available space, far exceeding the available space of Intel core (1.375 MB). The prefetcher has more space to ensure that the data exists before it is needed.
However, database performance may still decrease. For example, storing most of the indexes in the cache can improve performance, especially OLTP access tends to be very random. Second, relatively slow communications on the central hub will slow down synchronous communications. Intel claims that on the 28-core Intel Xeon 8280 processor, OLTP hammer DB runs 60% faster than on the EPYC 7601 processor. This fact proves this.
But for most high-end CPUs, they will run many parallel applications, such as running microservices, docker containers, virtual machines, mapping/reducing smaller data blocks, and parallel high-performance computing jobs. In almost all cases, a 4-core 16 MB L3 is sufficient.
Think about it carefully, when running an 8-core virtual machine, there may be a small problem that affects the performance a little.
In short, AMD did not use the larger 8-core CCX, so there is still some performance to be improved. We look forward to seeing what will happen to the platform in the future.
Memory subsystem: TinyMemBench
We use Andrei's custom memory latency test to verify the value of LMBench.
Latency tools can also measure bandwidth, which is much clearer than we can access dynamic random access memory once we exceed 16 MB. When Andre compared our Ryzen 9 3900x times number, he pointed out:
Compared with Ryzen 3000, some prefetchers seem to be adjusted for Rome. In fact, prefetchers are not as aggressive as consumer products. We believe that AMD made this choice because if the prefetcher takes up too much bandwidth, a considerable number of applications (Java and high-performance computing) will be affected. influences. It can help test performance by reducing the aggressiveness of Rome's pre-memory.

Measuring the average time of random memory access in different size buffers shows that the larger the buffer, the greater the relative contribution of TLB, L1/L2 cache misses and dynamic random access memory access. All these numbers represent extra time, which needs to be added to the L1 cache delay (4 cycles).
We tested double random reads because we wanted to know how the memory system handles multiple read requests.
This figure shows how the larger L3 cache of EPYC 7742 results in lower latency between 4 and 16 MB compared to EPYC 7601. Compared with Intel's mesh (8280) and ring topology (E5), the L3 cache inside CCX is also very fast (2-8 MB).
However, once we access more than 16 MB of memory, Intel has a clear advantage because the shared L3 cache is slower but has a larger capacity. When we tested the new EPYC processor in a more advanced NUMA setting (NPS = 4 settings, meaning 4 nodes per socket), the 64 MB latency was reduced from 129 to 119. We quote AMD's project:
In the network processor 4, the NUMA domain is reported to the software in such a way that the small chip always accesses the near (2-channel) dynamic random access memory. In NPS1, 8ch is hardware interleaved, and there is more waiting time to reach the other one. It varies with multiple pairs of dynamic random access memory channels, with the farthest channel being approximately 20-25 nanoseconds apart from the nearest channel (depending on the speed). Generally speaking, the delay of a channel pair is +~6-8ns, +~8-10ns, +~20-25ns, and the delay of the physically closest channel pair is +~ 6-8 ns, +~ 8-10 ns , +~ 20-25 ns. "
Comments
Post a Comment