Skip to main content

On server CPU interconnection efficiency

The new AMD SP3 slot is also called SoC due to its integration with traditional South Bridge chipsets such as SATA controllers. Of course, the specific motherboard design can also be used to connect the southbridge PCIe.
I'm writing this in large part because Of my March essay on "The Ideals and Realities of Going Beyond Xeon? AMD's Naples Server." At that time, the EPYC released in the past two days was not well understood, and it simply compared AMD's new-generation CPU with Intel Xeon (QPI/UPI) interconnection bandwidth and other aspects on paper.

An old friend also left a message at the bottom of the article. After seeing more information today, I can finally make an answer, and further discuss the communication efficiency of EPYC processor between slots and cores, not just the Numbers on paper :) AMD PPT ulus? Is Die connection 4 or 6
In the above picture, for the convenience of the WeChat article, I have cut a part of the PPT from the foreign website, and finally I see the number description I want. The Fabric bandwidth between the four dies on one CPU (in the same slot) is 42.6GB/s per 4B (32bit width) connection. Bi-dir is calculated here, I prefer to use 21.3GB/s full-duplex.
So, the total bandwidth of the folio between the 4 DIES is 4 x 42.6≈170GB/s (85GB/s full-duplex)? The reason why I put a question mark here is that I have a question about whether To multiply by 4 or by 6. It seems reasonable to just draw the "8" shape of the AMD connector symbol, but the actual chip connector is a little inefficient if only these. I'll make a chart below and continue the discussion.
Two other sets of Numbers, by the way. First, the Fabric bandwidth across CPU slots is 37.9GB/s per 2B (16bit) connections or 18.9GB/s full-duplex. We know that in single-path servers this Fabric can be redefined as PCIe, so the rate for inter-CPU interconnection seems to be higher than the 8GT/s for PCIe 3.0.
The overall bandwidth of the folio between slots is 4 x 37.9≈152GB/s (76GB/s full-duplex). X4 is not problematic here and a clear topology diagram will follow, which I will use to discuss efficiency issues. The figure above also writes about memory bandwidth, but I won't talk about it today.
This is the architecture of the Dell PowerEdgeR830 server I used in my Series "Intel Optane P4800X SSD Review".
The main difference from the E7 is the number of QPI channels -- three per CPU has been reduced to two, so the two CPUs on the diagonal need two hops to communicate. There is an article on the network "several rounds of PK to help you select the" true four roads "! Mention that serious technology peers have found that some low-end XeonE7 four-way servers have shrunk in the motherboard design to the point that, like the Xeon E5, they only use two QPIs.
Although my discussion today is on the "film", it has something in common with the previous four ways. The following SLIDE from AMD, I think, should be the correct interconnection situation.
Look at the connection between the 4 sets of 8 core DIES, does it look like Xeon E7? Careful friends may notice that the total bandwidth of the DIe-to-die Interconnect seems to be based on 6 x 42GB/s BI-dir. So what's wrong with the previous official PowerPoint slide?
No way, I like to dig into details, the truth is not unclear. I just said that AMD's EPYC is a bit like Intel's four-way interconnect. Is this an advantage? Not of... It is understood that AMD chose to design the 8-core Die to cut and then do THE MCM package due to the limited wafer yield rate. And of course one of the things that Came to mind was that the desktop and the server would share the Die design to reduce r&d costs.
Intel Xeon Scalable Grid Interconnect replaces ring buses
There seems to be a local correspondence between Skylake's Cache and each Core. One rule of thumb: Die internal communication is generally more efficient than across Dies.
Above is a schematic of the upcoming Xeon SP (Skylake Architecture). You can see that the connections between CPU cores use a grid matrix, which is quite different from the previous ring bus. Similar architecture has been adopted on KNL code Xeon Phi, and I feel that the number of cores is still relatively efficient.
The current Intel Xeon design is shown above, and I discussed it in The back of a Xeon E5-2600 V4 test data. The 24-core design opened up to a maximum of 22 cores on V4, compared with 28 cores in the first generation of the Xeon SP.
AMD EPYC double interconnect efficiency: Bandwidth does not represent the efficiency
Before I saw this picture, I even thought about drawing it by hand. The interconnection between the two CPUs is actually achieved by four die-to-die cross-slot connections. Let's not just look at the bandwidth Numbers, because each EPYC Die is most efficient only when communicating with the Die directly connected to the other CPU slot -- a "2-hop" is required for the other three dies. This relatively complex phenomenon does not exist on the Intel platform, and there can be two QPI/UPI connections between dual xeons.
So for example, Core 1 and Core 8, which I did in red, if you had a Die and you didn't have those two red diagonal connections between them, you wouldn't have done 2 jumps, you'd have done 3 jumps.
Just kidding, I think some friends said that AMD is a PPT company, and sure enough, they compared the slot bandwidth value with Intel's existing Xeon E5 :) finally, the place to be sure is to be sure. Single-channel AMD EPYC can provide 128 lane PCIe channel, this scalability is the best, the peer friends said used to connect NVMe SSD suitably. But I also have a small question. How much CPU does 24-32 M.2/U.2 SSDS need to run together? I'm referring more to the actual application environment than to a specific Benchmark condition. Is SATA cheap if you just want the capacity?


Popular posts from this blog

AMD's GPU technology enters the mobile phone chip market for the first time

In addition to the release of the Exynos2100 processor, Samsung also confirmed a major event at this Exynos event, that is, the custom GPU that they have worked with AMD for many years will soon appear and will be used on the next flagship machine. The current Exynos2100 processor uses ARM’s Mali-G78GPU core with a total of 14 cores, so the GPU architecture developed by Samsung will be the next Exynos processor, and the GPU will be the focus. This is probably the meaning of Exynos2100’s GPU stacking. The key reason. Dr. InyupKang, president of Samsung’s LSI business, confirmed that the next-generation mobile GPU in cooperation with AMD will be used in the next flagship product, but he did not specify which product. Samsung is not talking about the next-generation flagship but the next one, so it is very likely that a new Exynos processor will be available this year, either for the GalaxyNote21 series or the new generation of folding screen GalaxyZFold3. In 2019, AMD and Samsung reached

Apple and Intel want to join the game, what happened to the GPU market?

Intel recently announced that it will launch Xe-LP GPU at the end of this year, officially entering the independent GPU market, and will hand over to TSMC for foundry. At the 2020 WWDC held not long ago, Apple also revealed that it is possible to abandon AMD's GPU and use a self-developed solution based on the ARM architecture. It will launch a self-developed GPU next year. What happened to the GPU market? Why are the giants entering the game?    Massive data calls for high-performance GPU    Why has the demand for GPUs increased so rapidly in recent years? Because we are entering an era where everything needs to be visualized. Dai Shuyu, a partner of Aiwa (Beijing) Technology Co., Ltd., told a reporter from China Electronics News that visualization requires a large amount of graphics and image computing capabilities, and a large amount of high-performance image processing capabilities are required for both the cloud and the edge.    Aiwa (Beijing) Technology Co., Ltd. is an enterp

NVIDIA officially launches RTX 30 series mobile graphics cards

In the early morning of January 13, NVIDIA officially launched the RTX30 series of mobile graphics cards at the CES2021 exhibition. Ampere-based GPUs have also reached the mobile terminal, mainly including RTX3080, RTX3070 and RTX3060 models. In addition to improving game performance, the RTX30 series of mobile graphics cards have twice the energy efficiency of the previous generation, and support the third-generation Max-Q technology, mainly supporting DynamicBoost2.0 dynamic acceleration technology, WisperMode2.0 noise control, ResizableBAR (similar to AMD’s SAM technology) and DLSS. The third-generation Max-Q technology uses AI and new system optimization to make high-performance gaming laptops faster and more powerful than ever. These technologies include: ·DynamicBoost2.0: The CPU and GPU powers of traditional gaming notebooks are fixed, while games and creative applications are dynamic, and the requirements for the system will vary with the number of frames. With DynamicBoost2.0,