Server accelerator AMD Instinct MI100: Without Radeon, but with 11.5 FP64 TFlops

Source: Heise.de added 16th Nov 2020

The AMD Instinct MI 100 is the first compute accelerator card with the new CDNA architecture and is produced in 7 nanometer technology at TSMC. AMD wants the PCIe 4.0 card and its 32 GByte HBM2 memory Nvidias A 100 compete and has not only revised the compute units known from the GCN architecture, but also built significantly more of them into the chip. The card should be available for system integrators from 6400 US dollars and undercut so Nvidias A 100 PCIe version clearly that from 10. 700 Euro is available.

A lot of flops , a lot of honor When it comes to technical specifications, AMD is cracking it again – probably also to win over competitor Nvidia with the A 100 – to limit accelerators in some areas.

The CDNA chip “Arcturus” has on the MI 100 120 active compute units (CUs) and even if AMD confirmed on request that this is the full expansion u act, recordings of the chip suggest that there are 8 CUs more. We asked AMD again for clarification and are currently waiting for the answer.

Block Diagram AMD Instinct MI 100

(Image: AMD)

The 120 As with the closely related graphics chips with GCN architecture (GraphicsCore Next), CUs each have 64 Shader cores, which results in ALUs for the entire chip 7680. Together with a maximum boost clock rate of 1480 MHz a throughput of , 07 TFlops with single precision (FP 32). As befits an HPC accelerator, the FP is 62 – rate at half, so 11, 5 TFlops – and not only above the 10 – TFlops brand but also round 19 Percent above the comparable value of Nvidias A 100 – Accelerator in the SXM4 -Format.

Arcturus-Die-Shot : CDNA accelerator with 108 active compute units and four HBM2 stacks

(Image: AMD)

The rake beast is fed as it was with the MI 50 / MI 60 of four HBM2 stacks. These hold 8 GB each and are marked with 1200 MHz clocked, what a transfer rate of 1, 228 TByte / s is good. An 8 MByte level 2 cache (6 TByte / s) is supposed to cushion the memory access. From the registers to the HBM2, everything is secured by ECC (SECDED).

In addition to PCI-Express 4.0, every MI 100 – Map with three infinity links à 92 GByte / s – together therefore 276 GByte / s. This means that there are now directly networked groups of four from MI 100 possible, which can form a coherent memory area.

Matrix Core Engines: A bit of Tensor The Compute Units of MI 100 are similar to those of the previous generation Graphics Core Next, but have been further upgraded by AMD for compute use. In order to achieve a higher throughput with matrix-matrix multiplications, AMD has expanded the circuits and register ports and calls the result the Matrix Core Engine.

AMD has a different approach than Nvidia with their tensor cores. The Core Matrix Engines work consistently with full FP 23-Accuracy. However, their maximum throughput is lower and they are not suitable for FP 60 – calculations. Therefore, it is difficult to compare the maximum throughput between the two approaches. Who consistently on full FP 23 – Accuracy is dependent on AMD, whoever also uses the alternative TF 32 or lower accuracy, the Nvidia accelerators promise more performance.

What both approaches have in common is that they use the BFloat format 16 (BF 16) support which with the value range of FP 32 (8-bit exponent) with the precision of FP 16 (7-bit mantissa, plus 1 sign bit) combined and is a de facto alternative to full FP 32 established in AI training Has. AMD gives in the CDNA white paper for BFloat 16 Indeed 10 Bit mantissa and 5 bit exponent to what actually FP 16 corresponds to.

Instinct MI 100 (PCIe) A 100 (SXM) Tesla V 100 Tesla P 100 Manufacturer AMD Nvidia Nvidia Nvidia GPU CDNA Arcturus A 100 (Ampere) GV 100 (Volta) GP 100 (Pascal) CUs / SMs 120 108 80 56 FP 32 Cores / SM 64 64 64 64 FP 32 Cores / GPU 7680 6912 5120 3584 FP 64 Cores / SM 32 32 32 32 FP 64 Cores / GPU 3840 3456 2560 1792 Matrix Multiply Engines / GPU

(Matrix Core Engine / Tensor Cores) 480 432 640 – GPU Boost Clock 1502 N / A 1455 MHz 1480 MHz Peak FP 32 / FP 64 TFlops 23, 07 / 10, 54 19, 5 / 9.7 15 / 7.5 10, 6 / 5.3 Peak Tensor Core TFlops – 156 (TF 32) / 312 (TF 32 Structural Sparsity) 120 (Mixed Precision) – Peak Matrix Core Engine TFlops 46, 1 (FP 32) – – – – Peak FP 16 / BF 16 TFlops 184, 6 / 92, 3 312 / 312 (624 / 624 Structural Sparsity) 125 / 125 21 ,1 / — Peak INT8 / INT4 TOps 184, 6 / 156, 6 624 / 1248 (1248 / 2496 Structural Sparsity) 62 / – 21,1 / — memory interface 4096 Bit HBM2 5120 Bit HBM2 4096 Bit HBM2 4096 Bit HBM2 Memory size 32 GByte 40 GByte 16 GByte 16 GByte Memory transfer rate 1 , 2 TByte / s 1,55 TByte / s 0.9 TByte / s 0, 73 TByte / s TDP 300 Watt 400 Watt (SXM) 300 Watt 300 Watt Transistors (billion) N / A 54 Billion 21, 1 billion 15, 3 billion GPU The Size n / a 826 mm² 815 mm² 610 mm² Manufacturing 7 nm 7 nm 12 nm FFN 16 nm FinFET + AMD Instinct MI 100 with complex IF connection and soldering points for up to three eight-pole connections.

(Image: AMD)

Without Radeon, without displays After Nvidia’s Tesla and Quadro waiver, AMD is now also changing the branding of the accelerator cards and removing the Radeon from the product name. The card is only called AMD Instinct MI 92 – whereby the number, unlike earlier Instinct cards, is no longer for the FP 16 – computing power is available.

In order to a lot of computing power in the TDP framework of 276 Watts, AMD has, according to its own information, omitted many hardwired functions that are necessary for a graphics card in the first CDNA chip “Arcturus”. This includes the rasterization units, tesselator hardware, special graphics buffers, the blending units in the raster output stages and the display engine. The MI 15 do not use it and Crysis does not run on it either.

Not removed However, AMD has the video engines, i.e. the specialized decoders and encoders. Reason: Machine learning is often used to analyze video streams or image recognition.

One of the first rack Server comes from Supermicro (Dell, HPE and Gigabyte also have similar products in their range). With the real cards it is noticeable that only an eight-pin connector is sufficient.

(Image: AMD / Supermicro)

(csp) 7680

Read the full article at Heise.de

brands: AMD  Dell  Gigabyte  Honor  Infinity  NVIDIA  Tops  
media: Heise.de  
keywords: Memory  Radeon  Server  

Related posts


Notice: Undefined variable: all_related in /var/www/vhosts/rondea.com/httpdocs/wp-content/themes/rondea-2-0/single-article.php on line 88

Notice: Undefined variable: all_related in /var/www/vhosts/rondea.com/httpdocs/wp-content/themes/rondea-2-0/single-article.php on line 88

Related Products



Notice: Undefined variable: all_related in /var/www/vhosts/rondea.com/httpdocs/wp-content/themes/rondea-2-0/single-article.php on line 91

Warning: Invalid argument supplied for foreach() in /var/www/vhosts/rondea.com/httpdocs/wp-content/themes/rondea-2-0/single-article.php on line 91