amd’s-instinct-mi200-gpu-uses-multi-chip-design-for-exascale-supercomputer

AMD’s Instinct MI200 GPU Uses Multi-Chip Design for Exascale Supercomputer

(Image credit: AMD)

A recent Linux patch posted by AMD reveals that the company’s Instinct MI200 next-generation compute GPU, codenamed ‘Aldebaran,’ will use a multi-chip module (MCM) design. That means the GPU will come with two dies in a single chip package instead of the single die we’re accustomed to with standard GPUs. The accelerator is based on the CDNA 2 architecture and is set to be used for the Frontier exascale supercomputer due to be delivered this year.  

“On Aldebaran, only primary die fetches valid power data,” an AMD Linux patch reads. “Show power/energy values as 0 on secondary die. Also, power limit should not be set through secondary die.” 

AMD has a patent called ‘GPU Chiplets Using High-Bandwidth Crosslinks,’ as noted by Coelacanth-dream, so AMD has been working on its multi-chip compute GPU technology for some time. Meanwhile, according to the Linux patch, AMD’s MCM GPU technology requires one of the chiplets to become the primary and manage secondary chiplets, which helps the multi-chip GPU look and behave like one big processor to the host system.

(Image credit: AMD)

Making a multi-chip compute GPU is akin to making a multi-core MCM CPU, like the Ryzen 5000 or Threadripper processors. Firstly, bringing dies closer together increases compute efficiency. AMD’s Infinity architecture ensures a high-performance interconnection that promises to bring the efficiency of two dies close to one. Secondly, it is easier to mass-produce multiple small chips using an advanced process technology than big chips, as smaller chips usually have fewer defects, thus yielding better than larger chips.

(Image credit: AMD)

While multi-chip graphics subsystems have never been truly popular since many graphics workloads do not scale too well (and some do not scale at all), multiple compute GPUs per server are quite common since they scale well due to parallelized nature of supercomputing and datacenter workloads.

The devil is in the software details, applications have to be coded to extract the utmost performance from these types of architectures, but broad industry support for MCM seems to be coming to the fore.

(Image credit: AMD)

Intel’s Xe-HP and Xe-HPC GPUs also rely on MCM designs, so AMD is not alone with its MCM GPU plans. Furthermore, Nvidia’s upcoming Hopper compute GPUs are rumored to feature multiple dies, too. 

AMD’s partner HPE confirmed that the forthcoming Frontier supercomputer, which will be the world’s fastest with peak performance of 1.5 ExaFLOPS, would use AMD’s codenamed Trento CPU (most probably a version of Milan with extra cache and/or other enhancements) and Instinct MI200 accelerator.