Microsoft’s new AI-focused Azure servers are powered by AMD’s MI300X datacenter GPUs, but are paired with Intel’s Xeon Sapphire Rapids CPUs. AMD’s flagship fourth-generation EPYC Genoa CPUs are powerful, but Sapphire Rapids appears to have a couple of key advantages when it comes to pushing along AI compute GPUs. It’s not just Microsoft choosing Sapphire Rapids either, as Nvidia also seems to prefer it over AMD’s current-generation EPYC chips.
There are likely several factors that convinced Microsoft to go with Intel’s Sapphire Rapids instead of AMD’s Genoa, but Intel’s support for its Advanced Matrix Extensions (or AMX) instructions could be among the important reasons Microsoft tapped Sapphire Rapids. According to Intel, these instructions are tailored towards accelerating AI and machine learning tasks by up to seven times.
While Sapphire Rapids isn’t particularly efficient and has worse multi-threaded performance than Genoa, its single-threaded performance is quite good for some workloads. This isn’t something that only helps AI workloads specifically; it’s just an overall advantage in some types of compute.
It’s also worth noting that servers using Nvidia’s datacenter-class GPUs also go with Sapphire Rapids, including Nvidia’s own DGX H100 systems. Nvidia’s CEO Jensen Huang said the “excellent single-threaded performance” of Sapphire Rapids was a specific reason why he wanted Intel’s CPUs for the DGX H100 rather than AMD’s.
The new Azure instances also feature Nvidia’s Quantum-2 CX7 InfiniBand switches, bringing together the hardware of all three tech giants. That just goes to show that in the cutting-edge world of AI, companies just want the overall best hardware for the job and aren’t particularly picky about who makes it, regardless of rivalries.
With eight MI300X GPUs containing 192GB of HBM3 memory each, these AI-oriented Azure instances offer a combined 1,536GB of VRAM, which is crucial for training AI. All this VRAM was likely a big reason why Microsoft selected MI300X instead of Nvidia’s Hopper GPUs. Even the latest and greatest H200 chip only has 141GB of HBM3e per GPU, a significantly lower amount than the MI300X.
Microsoft also praised AMD’s open-source ROCm software. AMD has been hard at work bringing ROCm to parity with Nvidia’s CUDA software stack, which largely dominates professional and server graphics. That Microsoft is putting its faith in ROCm is perhaps a sign that AMD’s hardware-software ecosystem is improving rapidly.