Five years ago, Nvidia set out to design a supercomputer-class system powerful enough to train and run its own AI models, such as models for autonomous vehicles, but flexible enough to serve just about any deep-learning researcher. After building multiple iterations of its DGX Pods, Nvidia learned valuable lessons about building a system with modular, scalable units. Then, the pandemic struck.
The COVID-19 outbreak brought new challenges for Nvidia, as it set out to build Selene, the fourth generation of its DGX SuperPODs. Reduced staff and building restrictions complicated their task, but Nvidia managed to go from bare racks in the data center to a fully operational system in just three and-a-half weeks.
Selene is now a Top 10 supercomputer, the fastest industrial system in the US and the fastest MLPerf machine that’s commercially available.
The task of building Selene during the pandemic underscored the benefits of the design principles Nvidia has adopted, and it created a new sense of urgency to get the machine up and running.
“The whole point initially was to enable deployment very, very quickly so that we could get our researchers up and on,” Michael Houston, a chief architect who leads the Nvidia systems team, told reporters. “Nvidia’s the first customer for our machines — so we prove everything out and make sure that the machines and how we specify the pod architectures are rock solid.”
But with the onset of the pandemic, he said, “we wanted to get the machine ready to start doing COVID research, to enable some of our partners like Argonne National Labs, who also have SuperPODs.”
Selene sits in a standard data center next to Nvidia’s Silicon Valley headquarters. It’s comprised of 280 DGX v100 systems — that’s 2,240 Tensor Core GPUs all together. It has 494 Mellanox switches and seven petabytes of all-flash storage.
Since its Spring deployment, Selene has run thousands of jobs a week, often simultaneously. It runs AI data analytics, traditional machine learning and HPC applications.
“It’s not just an AI machine,” Houston said. “It’s one of the best HPC machines in the world, it’s one of the best ML machines in the world, and it’s one of the best AI machines.”
Typically, it take dozens of engineers months to assemble and deploy a supercomputer-class system. Nvidia managed to do it with two-person teams — isolated from each other as a from of social distancing — to unbox and rack systems. Engineers racked up to 60 systems in a day, the maximum their loading dock could handle. Cabling was completed with six-foot distances between people, with virtual logins that let administrators validate cabling remotely.
The systems team defined modules of 20 nodes connected by relatively simple “thin switches.” These scalable units could be laid down incrementally — turned on and tested before the next one was added. Cables were cut at set lengths and bundled together with Velcro.
“It’s designed to be deployed very, very fast,” Houston explained. “Once you have the racks in and power in, the whole design and methodology of what we’ve done is to enable very rapid deployment. You’re actually getting users on incrementally, so our average time from a racked machine to doing a full checkout and handing off to a user is four hours.”
Selene is based on an open architecture Nvidia shares with its customers. In addition to the Argonne National Lab, the University of Florida plans to use the design to build the fastest AI computer in academia. Companies like Lockheed Martin and Microsoft are also using DGX SuperPODs.
The aim of Nvidia’s design, Houston said, is to be able to deploy in any available data center — all the way from telco data centers, which tend to use 7kW racks, all the way up to HPC data centers.
That said, “the biggest pull that we’ve been seeing is in HPC facilities and in AI research companies that need to get machines up that are big and fast and proven,” Houston said, “without having to waste a lot of time trying to do custom builds and figure out all of the interconnect trade offs, all the software to make it all work… Both from a software full stack point of view but also from a data center full stack point of view, we’re able to deliver very fast and enable their research.”
Nvidia targeted 28 kilowatts per rack, which is the most common high-density hyperscaler infrastructure. The team spent significant time on interconnect design to make the system easy to deploy and expandable.
“We learned a lot of hard lessons on expandability on previous architectures,” Houston said, “where to expand the machine we had to massively rewire it. So we wanted a very different approach as we went through this.”
The team split up compute, storage and management fabrics into independent planes, with two network-interface cards per GPU. With this SuperPOD, Nvidia also increased the capacity and throughput of memory and storage links. Four storage tiers span 100 terabyte/second memory links to 100 Gb/s storage pools.
The system was built with layers of automation. For instance, Selene talks to Nvidia staff over a Slack channel as if it were a co-worker, reporting loose cables and isolating malfunctioning hardware so the system can keep running.
Meanwhile, Nvidia uses a telepresence robot from Double Robotics — named Trip — to let staff virtually observe Selene via the robot’s camera and microphone.