Deep learning - Geoffrey Hinton AMA

Highlights from Dr. Hinton’s AMA

Geoffrey (Geoff) Everest Hinton is a British-born cognitive psychologist and computer scientist, most noted for his work on artificial neural networks. He now divides his time working for Google and University of Toronto. He is the co-inventor of the backpropagation and contrastive divergence training algorithms and is an important figure in the deep learning movement.

  • Are we any closer to understanding biological models of computation?

I think the success of deep learning gives a lot of credibility to the idea that we learn multiple layers of distributed representations using stochastic gradient descent. However, I think we are probably a long way from understanding how the brain does this.

Evolution must have found an efficient way to adapt features that are early in a sensory pathway so that they are more helpful to features that are several stages later in the pathway. I now think there is a small chance that the cortex really is doing backpropagation through multiple layers of representation. The only way I can see for this to work is for a neuron to use the temporal derivative of the underlying Poisson rate of its output to represent the derivative of the error with respect to its input. Using this representation in a stack of autoencoders makes the idea that cortex does multi-layer backprop not totally crazy, though there are still lots of other issues to solve before this would be a plausible theory, especially the issue of how we could do backprop through time. Interestingly, the idea of using temporal derivatives to represent error derivatives predicts one type of spike-time dependent plasticity for bottom-up connections and a different type for top-down connections. I talked about this at the first deep learning workshop in 2007 and the slides have been on the web for 7 years with zero comments. I moved them to my web page recently (left-hand column) and also updated them.

I think that the way we currently use an unstructured “layer” of artificial neurons to model a cortical area is utterly crazy. Its just the first thing to try because its easy to program and its turned out to be amazingly successful. But I want to replace unstructured layers with groups of neurons that I call “capsules” that are a lot more like cortical columns. There is a lot of highly structured computation going on in a cortical column and I suspect we will not understand it until we have a theory of what its for. My current favorite theory is that its for finding sharp agreements between multi-dimensional predictions. This is a very different computation from simply adding up evidence in favor of a binary hypothesis or combining weighted inputs to compute some scalar property of the world. Its much more robust to noise, much better for dealing with viewpoint changes and much better at performing segmentation (by grouping together multi-dimensional predictions that agree).

  • What are your thought on DeepMind’s Neural Turing Machines? Is this a promising approach?

The NTM is a great model. Its very impressive that they can get an RNN to invent a sorting algorithm. Its the first time I’ve believed that deep learning would be able to do real reasoning in the not too distant future. There will be a lot of future work in making the NTM (or its descendants) learn much more complicated algorithms and it will probably have many applications. Given where it was developed, I think its a good bet that it will be combined with reinforcement learning.

  • Here are some of my beliefs about the brain that have made a big difference to the kinds of machine learning I have done

The cortex is pretty much the same all over and if parts are lost early, other parts can take on the functions they would have implemented. This suggests its really worth taking a bet on there being a general purpose learning procedure.

The brain is clearly using distributed representations.

The brain does complex tasks like object recognition and sentence understanding with surprisingly little serial depth to the computation. So artificial neural nets should do the same.

The brain has about 1014 synapses and we only live for about 109 seconds. So we have a lot more parameters than data. This motivates the idea that we must do a lot of unsupervised learning since the perceptual input (including proprioception) is the only place we can get 105 dimensions of constraint per second.

Roughly speaking, spikes are noisy samples from an underlying Poisson rate. Over the short time periods involved in perception, this is an incredibly noisy code. One of the motivations for the idea of dropout was that very noisy spikes are a good way to get a very strong regularizer that can help the brain deal with the fact that it has thousands of times more parameters than experiences.

Over a short time period, a neuron really is a binary all-or-none device (so far as other neurons are concerned). This was one of the motivations behind Boltzmann machines. Another was the paper by Crick and Mitchison suggesting that we do unlearning during sleep. There now seems to be quite a lot of evidence for this.

  • What do you think about the work of Numenta and Vicarious, startups that claim to do cortical-based learning?

I have not been following what Vicarious or Numenta have been doing recently. When they can solve a problem that no one was able to solve before, I’ll take notice.

I think Jeff Hawkins has good intuitions and a very sensible goal, but I do not think he has nearly as much experience at developing machine learning systems that actually work as someone like Yann LeCun. You could say this experience is irrelevant to understanding the brain but I do not agree. I am in the camp that believes in developing artificial neural nets that work really well and then making them more brain-like when you understand the computational advantages of adding an additional brain-like property. For example, if someone (maybe Sebastian Seung?) can show me a good computational reason for never allowing a synaptic weight to change sign, I’d be happy to add that restriction to my models. But currently it just makes the models work worse and in these circumstances I think its silly to add it just to be more brain-like. It hurts the technology without advancing the science. Another example is my current work on capsules. I now think I understand why a linear filter followed by a scalar non-linearity (and possibly preceded by multiplicative interactions with the outputs of other linear filters or neurons) is NOT the right computation to be doing in the later stages of a sensory pathway. So I am very happy to experiment with group non-linearities that can implement multi-dimensional coincidence filtering.