Ben's Bites Newsletter
Posts
The infrastructure used to train LLaMA 3

The infrastructure used to train LLaMA 3

March 13, 2024

Zuck is going big on AI. The massive GPU clusters he mentioned in earning this quarter are coming to reality. They are designed to train the next generation of crazy-smart AI models. Plus, we get a dose of open-source hardware and software goodness.

What's going on here?

Meta just unveiled two huge AI training clusters, sharing details about the design and performance.

What does this mean?

Each of these clusters packs massive compute: 24,000+ GPUs per cluster to be exact. Meta has even bigger ambitions by the end of 2024, aiming for a total of 600k H100 equivalent GPUs. This is fuel for training complex LLMs, like Llama 3 (which is already under training on this new cluster.)

These are built on Meta’s Grand Teton hardware platform (open-sourced, of course) and use PyTorch. Meta is experimenting with network design. One cluster uses RoCE, and the other InfiniBand which will let them figure out the best way to scale up even further in the future.

Why should I care?

Zuck keeps on giving. First the Llama models, and now sharing details about their hardware work too. To be fair, often, a big part of releasing this in public is to attract insane talent. The promise is simple, Meta has the resources you need to do awesome research.

And unlike other labs which are hush-hush about what’s cooking, Meta is open about LLaMA 3 being in the pipeline. No random shocks (looking at you Sora 👀👀).

Reply

or to participate.