At the end of the concrete plaza that forms the courtyard of the Salk Institute in La Jolla, California, there is a three-hundred-fifty-foot drop to the Pacific Ocean.
Sometimes people explore that drop from high up in a paraglider. If they’re less adventuresome, they can walk down a meandering trail that hugs the cliff all the way to the bottom.
It’s a good spot from which to reflect on the mathematical tool called “stochastic gradient descent,” a technique that is at the heart of today’s machine learning form of artificial intelligence.
Terry Sejnowski has been exploring gradient descent for decades. Sejnowski, who leads a team at Salk studying what’s called computational neuroscience, has been a mentor to Geoffrey Hinton of Google, one of the three winners of this year’s ACM Turing award for contributions to computing in machine learning. He regularly shares ideas with Hinton’s co-recipients, Yann LeCun of Facebook and Yoshua Bengio of Montreal’s MILA institute for machine learning.
This week, I sat down with Sejnowski in his cozy office, lined to the ceiling with books, inside the concrete bungalows at Salk, for a wide-ranging chat about A.I. One intriguing theme stood out, the notion that the entire A.I. field is only just beginning to understand the profound phenomenon of gradient descent.
“What the mathematicians are discovering is that all your intuitions are wrong about stochastic gradient descent,” Sejnowski said.
Also: Google’s DeepMind asks what it means for AI to fail
To understand why that is requires a brief history lesson from Sejnowski. He is well suited to the task, having authored a superb book on the topic that is part memoir and part science lesson, called The Deep Learning Revolution.
Sejnowski recalled how A.I. had progressed from its birth in the 1950s. The “rules-based” researchers in A.I., people who took approaches based on logic and symbol manipulation, tried for decades to make their approach work, and failed. Their failure made room in the eighties and nineties for quiet progress by the alternative school of thought, the “connectionists,” including Sejnowski and Hinton and LeCun and Bengio. Connectionism, as it achieved stunning success in the Naughts, was rechristened deep learning.
The difference between failed logic systems and deep learning is scale. The connectionists’ neural networks, unlike the rules-based, logic-based approach, were able to scale up to larger and larger problems as computers got more and more powerful and data more plentiful. Rules didn’t scale, but learning from data did. The rest is history, at least to Sejnowski.
“See, the people who went for logic had fifty years to show that it didn’t scale. And now, we had thirty years, from the eighties to today, to show that it [connectionism] does scale.
“Here, at least with some patterns, with pattern recognition, with reinforcement learning and so forth, we have something that scales,” he said.
While big data and rising compute made all that possible, nothing would have scaled if it weren’t for the mysterious underlying reality of the gradient.
“It turns out, it looks as if the stochastic gradient descent is the magic, the secret sauce,” he said.
“There’s something special about it.”
Gradient descent is an optimization approach for neural networks. A neural network has what are called weights that decide how much any single component of a neural network should contribute to the final answer that is generated by the network.
To find the right mixture of weights, the neural network adjusts those weights by searching a landscape of geometric coordinates that resembles a valley. The neural network repeatedly adjusts weights in response to data in order to find a path from the top of the valley, which represents the greatest error, to the lowest point in the valley, which represents the smallest amount of error the neural network can achieve.
If it was as easy as jumping off the cliff at La Jolla, this process would be a simple matter for the computer. Instead, stochastic gradient descent is like wandering through an uncharted mountainside, trying to find the quickest way down.
Because gradient descent is just a mathematical construct, a geometric model of what’s going on in the search for a solution, the entire field of A.I. is only beginning to grasp what the mystery of that search means, Sejnowski contends.
Also: Google explores AI’s mysterious polytope
In the ’80s, navigating that gradient was derided by MIT scientist Marvin Minsky as mere “hill climbing.” (The inverse of gradient descent is like ascending to a summit of highest accuracy.) In Minsky’s view, it was an unremarkable search, nothing like true learning and nothing representing actual intelligence. Similar attacks are leveled against deep learning to this day.
But such attacks fail to understand what is coming into focus ever so slowly as greater and greater computing power reveals aspects of the gradient, Sejnowski contends.
“Here is what we’ve discovered, and what Minsky could never have imagined,” he said, “because he lived in the low-dimensional universe of problems that are so small, you can’t really explore what happens when you have a vast space with a billion parameters in it.”
What has been discovered is that the way people think about gradient descent is generally wrong.
In simple neural network searches, in geometry of just two or three dimensions, the quest for that place at the bottom of the valley is fraught with wrong turns, called spurious local minima, like a ridge along the way that only looks to be the valley floor.
Also: LeCun, Hinton, Bengio: AI conspirators awarded prestigious Turing prize
Deep learning was able to overcome those local minima via a combination of larger data sets, more network layers, and techniques such as “drop out,” where weights are pruned from the network.
However, Sejnowski’s point is that inside of the trap of local minima is something potentially very powerful. As the math gets more complex with more powerful computer models, all those wrong turns start to form something more meaningful.
“If you have a million dimensions, and you’re coming down, and you come to a ridge or something, even if half the dimensions are going up, the other half are going down! So you always find a way to get out,” explains Sejnowski. “You never get trapped” on a ridge, at least, not permanently.
In this view, the classic statistical trap of “over-fitting” the data, which can lead to local minima, is actually a blessing in disguise.
“It turns out that over-parameterizing is not a sin in higher-dimensional spaces. In fact it gives you degrees of freedom that you can use for learning,” Sejnowski said.
Even something as simple as linear regression, Sejnowski said, which is not machine learning per se but merely elementary statistics, takes on a strange new form in a gradient of potentially infinite scale.
“It turns out that even regression — something that is kind of elementary, a closed book, how you fit a straight line through a bunch of points — it turns out when you are dealing with a million-dimensional space, is a much more interesting problem; it’s like, you can actually fit every single point with a straight line, except for a very small number.”
The gradient is leading those mathematicians who study deep learning toward insights that will some day form a theory of machine learning, Sejnowski is confident.
“It’s the geometry of these high-dimensional spaces, in terms of how they are organized, in terms of the way you get from one place in space to another.
“All of these things point toward something that tends to be very rich mathematically. And once we’ve understood it — we’re beginning to explore it — we’ll come up with even more, incrementally more efficient ways of exploring this space of these architectures.”
For current machine learning research, there is an immediate implication: stuff that is more precise is less desirable, not more.
“If you use a fancier optimization technique that does it more accurately, it doesn’t work as well,” he observes.
“So there’s something special about an optimization technique that is noisy, where you are taking in mini-batches and it’s not going down the perfect gradient, but going down in a direction that’s only an approximate downhill.”
The field is “just beginning to explore” the mysteries of gradient descent, Sejnowski said. “We have something that works, and we don’t actually know why it works.
“Once we do, we will be able to build an even more efficient machine that will be much more powerful.”