Hot Chips 31 Live Blogs: Cerebras' 1.2 Trillion Transistor Deep Learning Processor

08:49PM EDT – Some of the big news of today is Cerebras announcing its wafer-scale 1.2 trillion transistor solution for deep learning. The talk today goes into detail about the technology.

08:51PM EDT – Wafer scale chip, over 46,225 mm2, 1.2 trillion transistors, 400k AI cores, fed by 18GB of on-chip SRAM

08:51PM EDT – TSMC 16nm

08:51PM EDT – 215mm x 215mm – 8.5 inches per side

08:51PM EDT – 56 times larger than the largest GPU today

08:52PM EDT – Built for Deep Learning

08:52PM EDT – DL training is hard (ed: this is an understatement)

08:52PM EDT – Peta-to-exa scale compute range

08:53PM EDT – The shape of the problem is difficult to scale

08:53PM EDT – Fine grain has a lot of parallelism

08:53PM EDT – Coarse grain is inherently serial

08:53PM EDT – Training is the process of applying small changes, serially

08:53PM EDT – Size and shape of the problem makes training NN really hard

08:53PM EDT – Today we have dense vector compute

08:54PM EDT – For Coarse Grained, require high speed interconnect to run mutliple instances. Still limited

08:54PM EDT – Scaling is limited and costly

08:54PM EDT – Specialized accelerators are the answer

08:55PM EDT – NN: what is the right architecture

08:55PM EDT – Need a core to be optimized for NN primitives

08:55PM EDT – Need a programmable NN core

08:55PM EDT – Needs to do sparse compute fast

08:55PM EDT – Needs fast local memory

08:55PM EDT – All of the cores should be connected with a fast interconnect

08:56PM EDT – Cerebras uses flexible cores. Flexible general ops for control processing

08:56PM EDT – Core should handle tensor operations very efficiency

08:56PM EDT – Forms the bulk fo the compute in any neural network

08:56PM EDT – Tensors as first class operands

08:57PM EDT – fmac native op

08:57PM EDT – NN naturally creates sparse networks

08:58PM EDT – The core has native sparse processing in the hardware with dataflow scheduling

08:58PM EDT – All the compute is triggered by the data

08:58PM EDT – Filters all the sparse zeros, and filters the work

08:58PM EDT – saves the power and energy, and get performance and acceleration by moving onto the next useful work

08:58PM EDT – Enabled because arch has fine-grained execution datapaths

08:58PM EDT – Many small cores with independent instructions

08:59PM EDT – Allows for very non-uniform work

08:59PM EDT – Next is memory

08:59PM EDT – Traditional memory architectures are not optimized for DL

08:59PM EDT – Traditional memory requires high data reuse for performane

09:00PM EDT – Normal matrix multiply has low end data reuse

09:00PM EDT – Translating Mat*Vec into Mat*Mat, but changes the training dynamics

09:00PM EDT – Cerebras has high-perf, fully distributed on-chip SRAM next to the cores

09:01PM EDT – Getting orders of magnitude higher bandwidth

09:01PM EDT – ML can be done the way it wants to be done

09:01PM EDT – High bandwidth, low latency interconnect

09:01PM EDT – fast and fully configurable fabric

09:01PM EDT – all hw based communication avoicd sw overhead

09:02PM EDT – 2D mesh topology

09:02PM EDT – higher utlization and efficiency than global topologies

09:02PM EDT – Need more than a single die

09:02PM EDT – Solition is a wafer scale

09:03PM EDT – Build Big chips

09:03PM EDT – Cluster scale perf on a single chip

09:03PM EDT – GB of fast memory (SRAM) 1 clock cycle from the core

09:03PM EDT – That’s impossible with off-chip memory

09:03PM EDT – Full on-chip interconnect fabric

09:03PM EDT – Model parallel, linear performance scaling

09:04PM EDT – Map the entire neural network onto the chip at once

09:04PM EDT – One instance of NN, don’t have to increase batch size to get cluster scale perf

09:04PM EDT – Vastly lower power and less space

09:04PM EDT – Can use TensorFlow and PyTorch

09:05PM EDT – Performs placing and routing to map neural network layers to fabric

09:05PM EDT – Entire wafer operates on the single neural network

09:05PM EDT – Challenges of wafer scale

09:05PM EDT – Need cross-die connectivity, yield, thermal expansion

09:06PM EDT – Scribe line separates the die. On top of the scribe line, create wires

09:07PM EDT – Extends 2D mesh fabric across all die

09:07PM EDT – Same connectivity between cores and between die

09:07PM EDT – More efficient than off-chip

09:07PM EDT – Full BW at the die level

09:08PM EDT – Redundancy helps yield

09:08PM EDT – Redundant cores and redundant fabric links

09:08PM EDT – Reconnect the fabric with links

09:08PM EDT – Drive yields high

09:09PM EDT – Transparent to software

09:09PM EDT – Thermal expansion

09:09PM EDT – Normal tech, too much mechanical stress via thermal expansion

09:09PM EDT – Custom connector developed

09:09PM EDT – Connector absorbs the variation in thermal expansion

09:10PM EDT – All components need to be held with precise alignment – custom packaging tools

09:10PM EDT – Power and Cooling

09:11PM EDT – Power planes don’t work – isn’t enough copper in the PCB to do it that way

09:11PM EDT – Heat density too high for direct air cooling

09:12PM EDT – Bring current perpendicular to the wafer. Water cooled perpendicular too

09:14PM EDT – Q&A Time

09:14PM EDT – Q and A

09:14PM EDT – Already in use? Yes

09:15PM EDT – Can you make a round chip? Square is more convenient

09:15PM EDT – Yield? Mature processes are quite good and uniform

09:16PM EDT – Does it cost less than a house? Everything is amortised across the wafer

09:17PM EDT – Regular processor for housekeeping? They can all do it

09:17PM EDT – Is it fully synchronous? No

09:20PM EDT – Clock rate? Not disclosed

09:20PM EDT – That’s a wrap. Next is Habana

Bitcoin (BTC) $ 33,643.00
Ethereum (ETH) $ 2,061.64
Tether (USDT) $ 0.999150
Binance Coin (BNB) $ 312.68
Cardano (ADA) $ 1.32
Dogecoin (DOGE) $ 0.259368
XRP (XRP) $ 0.701903
USD Coin (USDC) $ 0.998137
Polkadot (DOT) $ 19.09
Uniswap (UNI) $ 18.78