AMD talks 1.2 million GPU AI supercomputer to compete with Nvidia — 30X more GPUs than world's fastest supercomputer (2024)

AMD talks 1.2 million GPU AI supercomputer to compete with Nvidia — 30X more GPUs than world's fastest supercomputer (1)

Demand for more computing power in the data center is growing at a staggering pace, and AMD has revealed that it has had serious inquiries to build single AI clusters packing a whopping 1.2 million GPUs or more.

AMD's admission comes from a lengthy discussion The Next Platform had with Forrest Norrod, AMD's EVP and GM of the Datacenter Solutions Group, about the future of AMD in the data center. One of the most eye-opening responses was about the biggest AI training cluster that someone is seriously considering.

When asked if the company has fielded inquiries for clusters as large as 1.2 million GPUs, Forrest replied that the assessment was virtually spot on.

Morgan: What’s the biggest AI training cluster that somebody is serious about – you don’t have to name names. Has somebody come to you and said with MI500, I need 1.2 million GPUs or whatever.Forrest Norrod: It’s in that range? Yes.

Morgan: You can’t just say “it’s in that range.” What’s the biggest actual number?Forrest Norrod: I am dead serious, it is in that range.

Morgan: For one machine.Forrest Norrod: Yes, I’m talking about one machine.

Morgan: It boggles the mind a little bit, you know?

1.2 million GPUs is an absurd number (mind-boggling, as Forest quips later in the interview). AI-training clusters are often built with a few thousand GPUs connected via a high-speed interconnect across several server racks or less. By contrast, creating an AI cluster with 1.2 million GPUs seems virtually impossible.

We can only imagine the pitfalls someone will need to overcome to try and build an AI cluster with over a million GPUs, but latency, power, and the inevitability of hardware failures are a few factors that immediately come to mind.

AI workloads are extremely sensitive to latency, particularly tail latency and outliers, wherein certain data transfers take much longer than others and disrupt the workload.Additionally, today's supercomputers have to mitigate the GPU or other hardware failures that, at their scale, occur every few hours. Those issues would become far more pronounced when scaling to 30X the size of today's largest known clusters. And that's before we even touch on the nuclear power plant-sized power delivery required for such an audacious goal.

Even the most powerful supercomputers in the world don't scale to millions of GPUs. For instance, the fastest operational supercomputer right now, Frontier, "only" has 37,888 GPUs.

The goal of million-GPU clusters speaks to the seriousness of the AI race that is molding the 2020s. If it is in the realm of possibility, someone will try to do it if it means greater AI processing power. Forest didn't say which organization is considering building a system of this scale but did mention that "very sober people" are contemplating spending tens to hundreds of billions of dollars on AI training clusters (which is why millions of GPU clusters are being considered at all).

Stay On the Cutting Edge: Get the Tom's Hardware Newsletter

Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Aaron Klotz

Freelance News Writer

Aaron Klotz is a freelance writer for Tom’s Hardware US, covering news topics related to computer hardware such as CPUs, and graphics cards.

Most Popular

Sony cuts 250 jobs at optical media plant — recordable disc production to be phased out, says report

SK hynix plans $74.6 billion investment to strengthen its memory chip business — hopes for AI business boost

Microsoft's Copilot+ PC exclusive features are a bad joke, even for AI fans

Overseas Ryzen 9000 CPU preorders shed light on potential MSRPs — Ryzen 9 9950X for $707, Ryzen 5 9600X for $332

Hard drive, SSD puncher puts four holes through your drives — Puncher P30 destroys physical media with 12 tons of pressure

Intel Arc Battlemage GPU surfaces — BMG-G31 silicon reportedly wields 32 Xe2 Cores

Alibaba Cloud ditches Nvidia's interconnect in favor of Ethernet — tech giant uses own High Performance Network to connect 15,000 GPUs inside data center

Laptop mod lets you type in Morse code by slamming the lid shut at the correct rhythm — No parties involved take responsibility for damage to the screen, hinge, or sanity

Best Buy listings show Ryzen AI 300 laptops coming on July 28 — retailer moves availability date back by two weeks

AI becomes leading server workload as shipments of other servers drop

Qualcomm Snapdragon X Elite laptops suffer compatibility issues with many games — even Intel's integrated Arc Graphics is up to 3x faster