- May 24, 2022
- Tech
- Data
- Cloud
- Luxembourg
- Security
- Startup
- Development
- Digital
Meta AI wants to build the world's most powerful supercomputer
On January 24th, Meta introduced the AI Research SuperCluster (RSC), which is among the fastest AI supercomputers running today and will be the fastest in the world once fully built out in mid-2022, according to Meta.
AI can currently perform tasks like translating text between
languages and helping identify potentially harmful content, but developing the
next generation of AI will require powerful supercomputers capable of
quintillions of operations per second.
RSC will help Meta’s AI researchers build better AI models
that can learn from trillions of examples; work across hundreds of different
languages; seamlessly analyze text, images and video together; develop new
augmented reality tools and more. Ultimately, the work done with RSC will pave
the way toward building technologies for the next major computing platform —
the metaverse, where AI-driven applications and products will play an important
role.
Why We Need AI at This Scale
Since 2013, Facebook has been making significant strides in
AI, including self-supervised learning, where algorithms can learn from vast
numbers of unlabeled examples and transformers, which allow AI models to reason
more effectively by focusing on certain areas of their input. To fully realize
the benefits of advanced AI, various domains, whether vision, speech, language,
will require training increasingly large and complex models, especially for
critical use cases like identifying harmful content. In early 2020, it was
decided that the best way to accelerate progress was to design a new computing
infrastructure — RSC.
Using RSC to Build for the Metaverse
With RSC, it is possible to train models more quickly that
use multimodal signals to determine whether an action, sound or image is
harmful or benign. This research will not only help keep people safe on Meta services
today, but also in the future, for the build of the metaverse.
RSC: Under the hood
AI supercomputers are built by combining multiple GPUs into
compute nodes, which are then connected by a high-performance network fabric to
allow fast communication between those GPUs. RSC today comprises a total of 760
NVIDIA DGX A100 systems as its compute nodes, for a total of 6,080 GPUs — with
each A100 GPU being more powerful than the V100 used in our previous system.
Each DGX communicates via an NVIDIA Quantum 1600 Gb/s InfiniBand two-level Clos
fabric that has no oversubscription. RSC’s storage tier has 175 petabytes of
Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing
Altus systems, and 10 petabytes of Pure Storage FlashBlade.
Early benchmarks on RSC, compared with Meta’s legacy
production and research infrastructure, have shown that it runs computer vision
workflows up to 20 times faster, runs the NVIDIA Collective Communication
Library (NCCL) more than nine times faster, and trains large-scale NLP models
three times faster. That means a model with tens of billions of parameters can
finish training in three weeks, compared with nine weeks before.
Phase two and beyond
RSC is up and running today, but its development is ongoing.
Once the phase two of building out RSC is competed, it will be the fastest AI
supercomputer in the world, performing at nearly 5 exaflops of mixed precision
compute. Through 2022, Meta will work to increase the number of GPUs from 6,080
to 16,000, which will increase AI training performance by more than 2.5x. The
InfiniBand fabric will expand to support 16,000 ports in a two-layer topology
with no oversubscription. The storage system will have a target delivery
bandwidth of 16 TB/s and exabyte-scale capacity to meet increased demand.