One of the key elements to deep learning and training is lots of very dense compute, as well as the dense servers to go through the computation. Intel’s Nervana NNP-T Spring Crest silicon, which we saw at Hot Chips earlier this year, is the ‘big training silicon’ that came out of the acquisition with Nervana: a 680 mm2 built on TSMC 16nm with CoWoS with four stacks of HBM2. The servers using this silicon are now starting to appear, and we caught sight of two at the Supermicro booth at Supercomputing.
To start the week, Supermicro showed off its first server using the NNP-T hardware, all based on PCIe cards. As one might imagine, these fit into any server that was previously designed to accommodate GPUs, and so Supermicro’s system is a very typical 2P unit with 8 cards in a 4U design. The cards talk to each other, with 3.58 Tbps total bi-directional bandwidth per chip, and off-chip connections supporting scalability up to 1024 nodes. As noted by the single 8-pin PCIe power per card, it means that each card has a peak power of 225W as per PCIe standards.
Later in the week, we were told by Supermicro that they had been given permission to show off the 8-way OAM (OCP Accelerator Module) version of the server, which keeps the chip-to-chip communications through the PCB of the baseboard, rather than using PCIe card-to-card like connectors. This also allows for substantial air cooling implementations, and compatibility with the OCP standards, as well as modularization.
The chip is Intel’s first to support bfloat16 numerics for training in deep learning. Each chip supports up to 119 TOPs, with 60 MB of on-chip memory and 24 dedicated ‘tensor’ processing clusters which have dual 32×32 matrix multiply arrays. The chip has a total of 27 billion transistors, and the cores run at 1.1 GHz. This is supported by 32 GB of HBM2-2400 memory, and technically the PCIe connection is a PCIe 4.0 x16 connection, however Intel does not have CPUs to support this yet.
We were told that for this sort of compute, in order to drive it all, some training customers are moving up from 2P per head node up to 4P, and bigger installations such as Facebook are already using 8P systems to drive their deep learning and training hardware.
Supermicro states that their NNP-T systems are ready for deployment.