This is an overview of most of the lectures I found illuminating at the NIPS 2015. This was my first time there and more than the extraordinary research, I was really impressed with how open and friendly everyone was. From elevator conversations with Prof. LeCun to drinks with David from the FB translation team and dinner with Doug from Google Brain, these and similar encounters were the highlights for me.

Frank Wood on Probabilistic Programming

  • The CS way of doing Stats is probabilistic programming
  • Functional or Imperative programs with
    • ability to draw values at random from distributions
    • ability to condition values of vars in a program via observations
  • Goals
    • Increase programmer productivity
    • Commodify inference
    • New kinds of models
      • In graphical models, languages are BUGS and STAN
      • In factor graphs, languages are Factorie,, and Figaro
      • Presented mostly in Anglican, a clojure syntax PP language.

Andrew Ng in Panel:

  • Superforecasting the book: No one can see more than 5 yrs ahead.
  • I had to learn to see more than 1 year ahead and it’s unbelievably advantageous to see more than 2 yrs ahead.
  • We are running out of supervised data and figuring out how to utilize unsupervised data will be very valuable.
  • What are the domains where we’ll run out of supervised data faster than others?

Demis Hassabis Talk:

  • DeepMind: Learn automatically from raw inputs and Generalize across domains
  • Grounded cognition: from sensorimotor data stream to actions
  • Games: unlimited training data, no testing bias, parallel testing, measurable progress
  • Game Research was inspired by hippocampal replay in the human brain
  • Neuroscience: Two purposes:
    • Provides direction/inspiration for new algos/architecutres/analysis
    • Validation testing: does the also constitute viable component in AGI? How much effort should be put into X?
  • If we can point to a system in the brain that does X then it gives a lot of confidence that this is the right path.
  • Points to interesting work in Walker et al (2007) in PNAS on hippocampal understanding of transitive inference (A>C, C>D -> A>D)
  • Can rats imagine?
    • Place cells are neurons in rat hippocampus that determine where in a box the rat is
    • When rat sleeps, trajectories of place cells are replayed.
    • They placed food for the rat just outside of its reach but still within its vision (around the corner and behind a barrier). Then when the rat was asleep, it replayed the steps and thought it had gone to the food, but it never actually did. This showed that rats are imagining.

Josh Tenenbaum: Building machines that learn like humans

  • Intelligence is more than just pattern recognition; It’s about modeling the world.
  • Say we had a physics learning engine and a memory engine. Is it reasonable to expect them to be integrated as the same engine or blocks that were assembled together and act mostly modularly?
  • Has been doing a lot with one-shot learning (how we can learn so quickly and generalize from just a few examples).
  • See Lake, Salakhutdinov, Tenenbaum (2015) in Science on Character Learning.
  • Been using the Omniglot database for training, which is a transpose of MNIST. Instead of 1000 examples of each of 10 categories, it’s 10 examples of each of a 1000 categories.

Larry Heck on Future of Conversation Systems

  • Seq2Seq works well but for chit-chat, not so well for content-rich dialog, and even worse for goal-oriented dialog
  • Need to expand to not be domain-dependent so that people don’t think twice and instead reach for search.

Li Deng: deep unsupervised learning for SLU

  • How can we explore 300k hrs of speech acoustics that have no paired labels? There’s only 3k of paired acoustic and word labels.

Bengio: small steps toward bio-plausible deep learning

  • How does the brain do Credit Assignment?
  • Spike-timing dependent plasticity is worth studying. It is explained by a learning rule.
  • Neuron is likely a leaky integrator
  • Shows that denoting auto encoder w reconstruction function R(s) converges towards R(s) - s = gradient of energy
  • Open Qs:
    • How about when not at a fixed point?
    • How to handle unsupervised case?
    • What if we do not have a true energy function (only approximate symmetry)?
  • Mentioned a few times that this is only a part of the story. Doesn’t cover much of anything, just one part and probably not right yet.
  • “In ML, we are doing a wide exploration for unsupervised learning. Why not look at what the brain does to aid that search?”

Pulkin: Do CNNs approximate what’s happening in Brain?

  • Found a reasonable amount of correlation tying the levels of hierarchy (of 7) in a CNN to the levels of hierarchy in the brain’s vision system.

LeCun: How the hell do we do unsupervised learning?

  • We are in the same situation as Physics has with Dark Energy. 95% of learning is unsupervised and we have no idea how to do it.
  • Boltzmann Machines don’t work because they don’t scale.
  • Check out What-Were Autoencoders (“Zhao, Lecun” —> Stacked What Were Autoencoders paper on Arxiv)

Neil Lawrence: Modeling the Brain

  • Understanding one neuron and then simulating the brain is insufficient although it’s persuasive enough to get you 1B euros.
  • DL is equivalent to System 1. Fitting a mechanistic model is equivalent to System 2. The robotics switch to using DL for supervised learning is an example.
  • System 1 is the elephant guiding the System 2 boy, i.e. it’s doing most of the work and is in control even though the boy thinks he is in control.
  • System 1 can deal with uncertainty; System 2 cannot.

Panel with Neil Lawrence, LeCun, Bengio, Konrad Kording, Surya Ganguli, and Matthias Bethge.

  • LeCun:
    • Need to learn planning and memory because we can do translation, etc
    • Babies learn perception before motor control
    • The progression of going from now to figuring out more stuff is guided by where people are saying that the current boundaries are. His example was Gary Marcus frequently telling him that “You can’t do that yet in DL.” And then 6 months later they do it and then he says another gotcha and they go tackle that.
  • Ganguli:
    • We have 10^10 neurons and 10^6 optical intake, so the likely view is that we have an internal state that we occasionally check to make sure that the outside world aligns.
  • Bengio and LeCun told the neuroscientists to stop researching the neuron because it wasn’t helpful. So Konrad asked them “What do you want Neuroscience to do to help you?”
    • Bengio: “Help us figure out how the brain learns.”
    • Lecun: “And it’s not STDP. That’s probably a consequence of something else. Another one - what the hell are all these feedback connections used for? Are they necessary?”
    • Bengio: “If we consider a shorter time scale, what is the role of systems of neurons evolving over time?”
    • This was super interesting. They said more and had concrete experiments they wanted done.
  • Suriya: What if the objective function is a straitjacket that we should reconsider considering? In the brain, most all objective functions are local.”
    • LecUN: “Backpropagation is local.” —> See Bengio’s TargetProp paper
  • Bengio: We need more people at the interface studying Neuro and ML. Right now, very few people at that juncture.

Graves: Rise of Differentiable Attention

  • Presentation on different attention papers.
  • Pointed out Neural Programmer Interpreters (Reed 2015), DRAW paper (Gregor 2015), and Spatial Transformer Networks (Jaderberg at NIPS 2015)

Cho: NMT

  • Didn’t go into the details of NMT, which was good.
  • Instead described three issues / directions for Translation that need to be addressed:
    • Word-level
    • Sentence-wise
    • Bilingual
  • Word-level: go to subword level
    • “Run”, “Running”, etc are all diff words but shouldn’t be treated that way
    • Problem is that meaning is highly nonlinear over the character set —> “quiet”, “quite”, “quit” start the same but mean very different things
    • Computation complexity increases greatly when using character-level
    • Papers here include Wang Ling (2015) and Gillick et all 2015 (Unicode)
    • Unicode doesn’t work so well for all languages, e.g. Korean
  • Sentence-wise: Need context greater than sentence
    • Wang and Cho, 2015
    • Chris Dyer doing great work here out of CMU
    • Serbian (2015), Sardoni (2015) are each doing hierarchical modeling for dialogue
    • Also need to blend world knowledge with context … no good ideas how to do that
  • Multilingual Translation: most interesting of the three that he’s been looking at
    • Does knowing Language F help with Language E? Unsolved for both humans and machines
    • Do one NxM language model to avoid quadratic explosion of # of parameters
    • Long et al (2015): Multilingual Seq2Seq
    • Is this an interlingua?
    • Is attention model universal or language specific?
    • This is interesting because multilingual corresponds to multimodal and multitask

Gary Marcus Talk

  • “We have language translators (that can’t answer anything else)”
  • Wrote a paper with Marblestone and Tom Dean in Science —> check out the Arxiv supplement for better coverage
  • His book from 2001 “The Algebraic Mind” presents a candidate neuro-ontology.
  • Can one capture the richness of human language and thought from a reduced set of neurocognitive primitives?
  • Variables and Operations —> How to encode? They give you open-ended generalizations.
  • Newborns can see simple patterns like “wo wo fe” + “be be ca” —> “ta ta do” and NOT “do ta ta”
  • Opposite approach: Canonical Cortex View
  • His conjecture: The cortex consists of a heterogenous set of basic circuit types, the FPGA model of the brain
  • Half of the genome is expressed in the brain. Do we really need that many genes just to build an MLP? Why are we still assuming that we don’t want structure?
  • Wants to approach it with a rich set of basic computational circuit types –> Memory Networks, NTMs, NPI all good steps


  • Lecunn: not aware of climate change and deep learning being done together. Suggested the questioner try it.
  • Pascal Vincent gave a great talk, really interesting, on a way to speed up neural networks. He looked at the underlying calculations in doing the feed forward and found that by making two tricks he could speed it up by a factor of D/4d, where D is the output size and little d is the input size. Only applies to things that fit his spherical objective definition.