NVIDIA at SC23: H200 Accelerator with HBM3e and Jupiter Supercomputer for 2024

Trending 4 weeks ago

With faster and higher capacity HBM3E representation group to travel online early successful 2024, NVIDIA has been preparing its current-generation server GPU products to usage nan caller memory. Back successful August we saw NVIDIA’s plans to merchandise an HBM3E-equipped type of nan Grace Hopper GH200 superchip, and now for nan SC23 tradeshow, NVIDIA is announcing their plans to bring to marketplace an updated type of nan stand-alone H100 accelerator pinch HBM3E memory, which nan institution will beryllium calling nan H200.

Like its Grace Hopper counterpart, nan intent of nan H200 is to service arsenic a mid-generation upgrade to nan Hx00 merchandise statement by rolling retired a type of nan spot pinch faster and higher capacity memory. Tapping nan HBM3E representation that Micron and others are group to rotation retired n, NVIDIA will beryllium capable to connection accelerators pinch amended real-world capacity successful representation bandwidth-bound workloads, but besides parts that tin grip moreover larger workloads. This stands to beryllium particularly adjuvant successful nan generative AI abstraction – which has been driving virtually each of nan request for H100 accelerators frankincense acold – arsenic nan largest of nan ample connection models tin max retired nan 80GB H100 arsenic it is.

Meanwhile, pinch HBM3E representation not shipping until adjacent year, NVIDIA has been utilizing nan spread to denote HBM3E updated parts astatine their leisure. Following this summer’s GH200 announcement, it was only a matter of clip until NVIDIA announced a standalone type of nan Hx00 accelerator pinch HBM3E, and this week NVIDIA is yet making that announcement.

NVIDIA Accelerator Specification Comparison
  H200 H100 A100 (80GB)
FP32 CUDA Cores 16896? 16896 6912
Tensor Cores 528? 528 432
Boost Clock 1.83GHz? 1.83GHz 1.41GHz
Memory Clock ~6.5Gbps HBM3E 5.24Gbps HBM3 3.2Gbps HBM2e
Memory Bus Width 6144-bit 5120-bit 5120-bit
Memory Bandwidth 4.8TB/sec 3.35TB/sec 2TB/sec
VRAM 141GB 80GB 80GB
FP64 Vector 33.5 TFLOPS? 33.5 TFLOPS 9.7 TFLOPS
INT8 Tensor 1979 TOPS? 1979 TOPS 624 TOPS
FP16 Tensor 989 TFLOPS? 989 TFLOPS 312 TFLOPS
FP64 Tensor 66.9 TFLOPS? 66.9 TFLOPS 19.5 TFLOPS
Interconnect NVLink 4
18 Links (900GB/sec)
NVLink 4
18 Links (900GB/sec)
NVLink 3
12 Links (600GB/sec)
GPU GH100
(814mm2)
GH100
(814mm2)
GA100
(826mm2)
Transistor Count 80B 80B 54.2B
TDP 700W 700W 400W
Manufacturing Process TSMC 4N TSMC 4N TSMC 7N
Interface SXM5 SXM5 SXM4
Architecture Hopper Hopper Ampere

Based connected nan aforesaid GH100 GPU arsenic recovered successful nan original H100, nan caller HBM3E-equipped type of nan H100 accelerator will beryllium getting a caller exemplary number, H200, to group it isolated from its original predecessor and align it pinch nan GH200 superchip (whose HBM3E type is not getting a chopped exemplary number).

Looking astatine nan specifications being disclosed today, nan H200 fundamentally looks for illustration nan Hopper half of GH200 arsenic its ain accelerator. The large quality here, of course, is swapping retired HBM3 for HBM3E, which is allowing NVIDIA to boost some representation bandwidth and capacity – arsenic good arsenic NVIDIA enabling nan 6th HBM representation stack, which was abnormal successful nan original H100. This will summation nan H200’s representation capacity from 80GB to 141GB, and representation bandwidth from 3.35TB/second to what NVIDIA is preliminarily expecting to beryllium 4.8TB/second – an astir 43% summation successful bandwidth.

Working backwards present based connected full bandwidth and representation autobus width, this indicates that H200’s representation will beryllium moving astatine astir 6.5Gbps/pin, a astir 25% wave summation versus nan original H100’s 5.3Gbps/pin HBM3 memory. This is really good beneath nan representation frequencies that HBM3E is rated for – Micron wants to deed 9.2Gbps/pin – but since it’s being retrofit to an existing GPU design, it’s not astonishing to spot that NVIDIA’s existent representation controllers don’t person nan aforesaid range.

The H200 will besides support GH200’s different representation capacity of 141GB. The HBM3E representation itself physically has a capacity of 144GB – coming successful nan shape of six 24GB stacks – nevertheless NVIDIA is holding backmost immoderate of that capacity for output reasons. As a result, customers don’t get entree to each 144GB connected board, but compared to H100 they are getting entree to each six stacks, pinch nan capacity and representation bandwidth benefits thereof.

As we’ve noted successful past articles, shipping a portion pinch each 6 moving stacks will fundamentally require a cleanable chip, arsenic H100’s specs very generously allowed NVIDIA to vessel parts pinch a non-functional stack. So this is apt to beryllium a little volume, little yielding portion than comparable H100 accelerators (which are already successful short supply).

Otherwise, thing NVIDIA has disclosed truthful acold indicates that H200 will person amended earthy computational throughput than its predecessor. While real-world capacity should amended from nan representation changes, nan 32 PFLOPS of FP8 capacity that NVIDIA is quoting for an HGX H200 cluster is identical to nan HGX H100 clusters disposable connected nan marketplace today.

Finally, arsenic pinch HBM3E-equipped GH200 systems, NVIDIA is expecting H200 accelerators to beryllium disposable successful nan 2nd 4th of 2024.

HGX H200 Announced: Compatible With H100 Systems

Alongside nan H200 accelerator, NVIDIA is besides announcing their HGX H200 platform, an updated type of nan 8-way HGX H100 that uses nan newer accelerator. The existent backbone of NVIDIA’s H100/H200 family, nan HGX bearer boards location 8 SXM shape facet accelerators linked up successful a pre-arranged, fully-connected topology. The stand-alone quality of nan HGX committee allows it to beryllium plugged successful to suitable big systems, allowing OEMs to customize nan non-GPU parts of their high-end servers.

Given that HGX goes hand-in-hand pinch NVIDIA’s server accelerators, nan announcement of nan HGX 200 is mostly a formality. Still, NVIDIA is making judge to denote it astatine SC23, arsenic good arsenic making judge that HGX 200 boards are cross-compatible pinch H100 boards. So server builders tin usage HGX H200 successful their existent designs, making this a comparatively seamless transition.

Quad GH200 Announced: 4 GH200s Baked Into a Single Board

With NVIDIA now shipping some Grace and Hopper (and Grace Hopper) chips successful volume, nan institution is besides announcing immoderate further products utilizing those chips. The latest of which is simply a 4-way Grace Hopper GH200 board, which NVIDIA is simply calling nan Quad GH200.

Living up to nan name, nan Quad GH200 places 4 GH200 accelerators connected to a azygous board, which tin past beryllium installed successful larger systems. The individual GH200s are wired up to each different successful an 8-chip, 4-way NVLink topology, pinch nan thought being to usage these boards arsenic nan building blocks for larger systems.

In practice, nan Quad GH200 is nan Grace Hopper counterpart to nan HGX platforms. The inclusion of Grace CPUs technically makes each committee independent and self-supporting, dissimilar nan GPU-only HGX boards, but nan request to link them to big infrastructure remains unchanged.

A Quad GH200 node will connection 288 Arm CPU cores and a mixed 2.3TB of high-speed memory. Notably, NVIDIA does not mention utilizing nan HBM3E type of GH200 present (at slightest not initially), truthful these figures look to beryllium pinch nan original, HBM3 version. Which intends we’re looking astatine 480GB of LPDDR5X per Grace CPU, and 96GB of HBM3 per Hopper GPU. Or a full of 1920GB of LPDDR5X and 384GB of HBM3 memory.

Jupiter Supercomputer: 24K GH200s astatine 18.2 Megawatts, Installing successful 2024

Finally, NVIDIA is announcing a caller supercomputer creation triumph this greeting pinch Jupiter. Ordered by nan EuroHPC Joint Undertaking, Jupiter will beryllium a caller supercomputer built retired of 23,762 GH200 nodes. Once it comes online, Jupiter will beryllium nan largest Hopper-based supercomputer announced frankincense far, and is nan first 1 that is explicitly (and publicly) targeting modular HPC workloads arsenic good arsenic nan low-precision tensor-driven AI workloads that person travel to specify nan first Hopper-based supercomputers.

Contracted to Eviden and ParTec, Jupiter is simply a showcase of NVIDIA technologies done and through. Based connected nan Quad GH200 node that NVIDIA is besides announcing today, Grace CPUs and Hopper GPUs beryllium astatine nan halfway of nan supercomputer. The individual nodes are backed by a Quantum-2 InfiniBand network, nary uncertainty based connected NVIDIA’s ConnectX adapters.

The institution is not disclosing circumstantial halfway count aliases representation capacity figures, but since we cognize what a azygous Quad GH200 committee offers, nan mathematics is elemental enough. At nan apical extremity (assuming nary salvaging/binning for output reasons), this would beryllium 23,762 Grace CPUs, 23,762 Hopper H100-class GPUs, and astir 10.9 PB of LPDDR5X and anther 2.2PB of HBM3 memory.

The strategy is slated to connection 93 EFLOPS of low-precision capacity for AI uses, aliases complete 1 EFLOPS of delivered high-precision (FP64) capacity for accepted HPC workloads. The second fig is particularly notable, arsenic that would make Jupiter nan first NVIDIA-based exascale strategy for HPC workloads.

That said, NVIDIA’s HPC capacity claims should beryllium taken pinch a connection of caution, arsenic NVIDIA is still counting tensor capacity present – 1 EFLOPS of FP64 is thing 23,762 H100s tin only supply pinch FP64 tensor operations. The accepted metric for theoretical HPC supercomputer throughput is vector capacity alternatively than matrix performance, truthful this fig isn’t wholly comparable to different systems. Still, pinch HPC workloads besides making important usage of matrix mathematics successful parts, it’s not an wholly irrelevant claim, either. Otherwise, for anyone looking for nan obligatory Frontier comparison, nan consecutive vector capacity of Jupiter would beryllium astir 800 TFLOPS, versus complete doubly that for Frontier. How adjacent nan 2 systems get successful real-world conditions, connected nan different hand, will travel down to really overmuch matrix mathematics is utilized successful their respective workloads (LINPACK results should beryllium interesting).

No value tag has been announced for nan system, but powerfulness depletion has: a toasty 18.2 Megawatts of energy (~3MW little than Frontier). So immoderate nan existent value of nan strategy is, for illustration nan strategy itself, it will beryllium thing but petite.

According to NVIDIA’s property release, nan strategy will beryllium housed astatine nan Forschungszentrum Jülich installation successful Germany, wherever it will beryllium utilized for “the creation of foundational AI models successful ambiance and upwind research, worldly science, supplier discovery, business engineering and quantum computing.” Installation of nan strategy is scheduled for 2024, though nary day has been announced for erstwhile it’s expected to travel online.

Source Networking
Networking