# d539: ICML 2017 DeepMind papers

ICML 2017 DeepMind papers

##### Decoupled Neural Interfaces using Synthetic Gradients

Authors: Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver, Koray Kavukcuoglu

When training neural networks, the modules (layers) are locked: they can only be updated after backpropagation. We remove this constraint by incorporating a learnt model of error gradients, Synthetic Gradients, which means we can update networks without full backpropagation. We show how this can be applied to feed-forward networks which allows every layer to be trained asynchronously, to RNNs which extends the time over which models can remember, and to multi-network systems to allow communication.

For further details and related work, please see the paper.

##### Parallel Multiscale Autoregressive Density Estimation

Authors: Scott Reed, Aäron van den Oord, Nal Kalchbrenner, Ziyu Wang, Dan Belov, Nando de Freitas

The parallel multiscale autoregressive density estimator generates high-resolution (512 by 512) images, with orders of magnitude speedup over other autoregressive models. We evaluate the model on class-conditional image generation, text-to-image synthesis, and action-conditional video generation, showing that our model achieves the best results among non-pixel-autoregressive density models that allow efficient sampling.

For further details and related work, please see the paper.

##### Understanding Synthetic Gradients and Decoupled Neural Interfaces

Authors: Wojtek Czarnecki, Grzegorz Świrszcz, Max Jaderberg, Simon Osindero, Oriol Vinyals, Koray Kavukcuoglu

Synthetic gradients has been shown to work empirically in both feed-forward and recurrent cases. This work focuses on why and how it actually works – it shows that under mild assumptions critical points are preserved and that in the simplest case of linear model, synthetic gradients based learning does converge to the global optimum. On the other hand, we present empirically that trained models might be qualitatively different from those obtained using backpropagation.

For further details and related work, please see the paper.

##### Minimax Regret Bounds for Reinforcement Learning

Authors: Mohammad Gheshlaghi Azar, Ian Osband, Remi Munos

We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of order (HSAT)1/2  (up to a logarithmic factor) where H is the time horizon, S the number of states, A the number of actions and T the number of time-steps. This result improves over the best previous known bound HS(AT)1/2 achieved by the UCRL2 algorithm of [Jaksch, Ortner, Auer, 2010]. The key significance of our new results is that for large T, the sample complexity of our algorithm matches the optimal lower bound of Ω(HSAT)1/2. Our analysis contains two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in S), and we define Bernstein-based “exploration bonuses” that use the empirical variance of the estimated values at the next states (to improve scaling in H).

For further details and related work, please see the paper.

##### Video Pixel Networks

Authors: Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals,Alex Graves, Koray Kavukcuoglu

Predicting the continuation of frames in a video is a hallmark task in unsupervised learning. We present a video model, the VPN, that is probabilistic and that is able to make accurate and sharp predictions of future video frames. The VPN achieves, for the first time, a nearly perfect score on the Moving MNIST dataset and produces plausible futures of up to 18 frames of robotic arm movements.

For further details and related work, please see the paper.

##### Sharp Minima Can Generalize For Deep Nets

Authors: Laurent Dinh (Univ. Montreal), Razvan Pascanu, Samy Bengio (Google Brain), Yoshua Bengio (Univ. Montreal)

Empirically, it has been observed that deep networks generalise well, even when they have the capacity to overfit the data. Additionally, it seems that stochastic gradient descent results in models that generalise better than batch method. One hypothesis for explaining this phenomena is that the noise of SGD helps model to find wide minina which generalise better than sharp (narrow) minima. In this work we try to improve our understanding of this hypothesis. We show that it does not hold for proposed definitions of wideness or sharpness due to the structure of neural networks. This suggest that there is no causality connection between batchsize size and generalisation.

For further details and related work, please see the paper.

##### Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

Authors: Ian Osband, Benjamin Van Roy

Computational results demonstrate that posterior sampling for reinforcement learning (PSRL) dramatically outperforms existing algorithms driven by optimism, such as UCRL2. We provide insight into the extent of this performance boost and the phenomenon that drives it. We leverage this insight to establish an $\tilde{O}(H\sqrt{SAT})$ Bayesian regret bound for PSRL in finite-horizon episodic Markov decision processes. This improves upon the best previous Bayesian regret bound of $\tilde{O}(H S \sqrt{AT})$ for any reinforcement learning algorithm. Our theoretical results are supported by extensive empirical evaluation.

For further details and related work, please see the paper.

##### DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

Authors: Irina Higgins*, Arka Pal*, Andrei Rusu, Loic Matthey, Chris Burgess, Alexander Pritzel, Matt Botvinick, Charles Blundell, Alexander Lerchner

Modern deep reinforcement learning agents rely on large quantities of data to learn how to act. In some scenarios, such as robotics, obtaining a lot of training data may be infeasible. Hence such agents are often trained on a related task where data is easy to obtain (e.g. simulation) with the hope that the learnt knowledge will generalise to the task of interest (e.g. reality). We propose DARLA, a DisentAngled Representation Learning Agent, that exploits its interpretable and structured vision to learn how to act in a way that is robust to various novel changes in its environment – including a simulation to reality transfer scenario in robotics. We show that DARLA significantly outperforms all baselines, and that its performance is crucially dependent on the quality of its vision.

For further details and related work, please see the paper.

##### Automated Curriculum Learning for Neural Networks

Authors: Alex Graves, Marc G. Bellemare, Jacob Menick, Koray Kavukcuoglu, Remi Munos

As neural networks are applied to ever more complex problems, the need for efficient curriculum learning becomes more pressing. However, designing effective curricula is difficult and typically requires a large amount of hand-tuning. This paper uses reinforcement learning to automate the path, or syllabus, followed by the network through the curriculum so as to maximise the overall rate of learning progress. We consider nine different progress indicators, including a novel class of complexity-gain signal. Experimental results on three problems show that an automatically derived syllabus can lead to efficient curriculum learning, even on data (such as the bAbI tasks) that were not explicitly designed for curriculum learning.

For further details and related work, please see the paper.

Authors: Yutian Chen, Matthew Hoffman, Sergio Gomez, Misha Denil, Timothy Lillicrap, Matthew Botvinick , Nando de Freitas

We learn recurrent neural network optimisers trained on simple synthetic functions by gradient descent. The learned optimisers exhibit a remarkable degree of transfer in that they can be used to efficiently optimise a broad range of derivative-free black-box problems, including continuous bandits, control problems, global optimization benchmarks and hyper-parameter tuning tasks.

For further details and related work, please see the paper.

##### A Distributional Perspective on Reinforcement Learning

Authors: Marc G. Bellemare*, Will Dabney*, Remi Munos

We argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. Although there is an established body of literature studying the value distribution, thus far it has always been used for a specific purpose such as implementing risk-aware behaviour. We begin with theoretical results in both the policy evaluation and control settings, exposing a significant distributional instability in the latter. We then use the distributional perspective to design a new algorithm which applies Bellman’s equation to the learning of approximate value distributions. We evaluate our algorithm using the suite of games from the Arcade Learning Environment. We obtain both state-of-the-art results and anecdotal evidence demonstrating the importance of the value distribution in approximate reinforcement learning. Finally, we combine theoretical and empirical evidence to highlight the ways in which the value distribution impacts learning in the approximate setting.

For further details and related work, please see the blog post and the paper.

##### A Laplacian Framework for Option Discovery in Reinforcement Learning

Authors: Marlos Machado (Univ. Alberta), Marc G. Bellemare, Michael Bowling

Representation learning and option discovery are two of the biggest challenges in reinforcement learning (RL). Proto-value functions (PVFs) are a well-known approach for representation learning in MDPs. In this paper we address the option discovery problem by showing how PVFs implicitly define options. We do it by introducing eigenpurposes, intrinsic reward functions derived from learned representations. The options discovered from eigenpurposes traverse the principal directions of the state space. They are useful for multiple tasks because they are discovered without taking the environment’s rewards into consideration. Moreover, different options act at different time scales, making them helpful for exploration. We demonstrate features of eigenpurposes in traditional tabular domains as well as in Atari 2600 games.

For further details and related work, please see the paper.

##### Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

In this paper, we introduce a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform. We also introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold of embeddings that allows for morphing between instruments, meaningfully interpolating in timbre to create new types of sounds that are realistic and expressive.

For further details and related work, please see the paper.

##### Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study

Authors: Samuel Ritter*, David Barrett*, Adam Santoro, Matt Botvinick

Deep neural networks (DNNs) have achieved unprecedented performance on a wide range of tasks, rapidly outpacing our understanding of the nature of their solutions. In this work, we propose to address this interpretability problem in modern DNNs using the problem descriptions, theories and experimental methods developed of cognitive psychology. In a case study, we apply a theory and method from the psychology of human word learning to better understand how modern one-shot learning systems work. Results revealed not only that our DNNs exhibit the same inductive bias as humans, but also several unexpected features of the DNNs.

For further details and related work, please see the paper.

##### Count-Based Exploration with Neural Density Models

Authors: Georg Ostrovski, Marc Bellemare, Aaron van den Oord, Remi Munos

Count-based exploration based on prediction gain of a simple graphical density model has previously achieved  state-of-the-art results on some of the hardest exploration games in Atari. We investigate the open questions 1) whether a better density model leads to better exploration, and 2) what role the mixed Monte Carlo update rule used in this work plays for exploration. We show that a neural density model – PixelCNN – can be trained online on the experience stream of an RL agent and used for count-based exploration to achieve even better results on a wider set of hard exploration games, while preserving higher performance on easy exploration games. We also show that the Monte Carlo return is crucial in making use of the intrinsic reward signal in the sparsest reward settings, and cannot easily be replaced by a softer lambda-return update rule.

For further details and related work, please see the paper.

##### The Predictron: End-to-End Learning and Planning

Authors: David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, Thomas Degris

One of the key challenges of artificial intelligence is to learn models that are effective in the context of planning. In this document we introduce the predictron architecture. The predictron consists of a fully abstract model, represented by a Markov reward process, that can be rolled forward multiple “imagined” planning steps. Each forward pass of the predictron accumulates internal rewards and values over multiple planning depths. The predictron is trained end-to-end so as to make these accumulated values accurately approximate the true value function. We applied the predictron to procedurally generated random mazes and a simulator for the game of pool. The predictron yielded significantly more accurate predictions than conventional deep neural network architectures.

For further details and related work, please see the paper.

##### FeUdal Networks for Hierarchical Reinforcement Learning

Authors: Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Hees, Max Jaderberg, David Silver, Koray Kavukcuoglu

How to create agents that can learn to decompose their behaviour into meaningful primitives and then reuse them to more efficiently acquire new behaviours is a long standing research question. The solution to this question may be an important stepping stone towards agents with general intelligence and competence. This paper introduced FeUdal Networks (FuN), a novel architecture that formulates sub-goals as directions in latent state space, which, if followed, translates into a meaningful behavioural primitives. FuN clearly separates the module that discovers and sets sub-goals from the module that generates behaviour through primitive actions. This creates a natural hierarchy that is stable and allows both modules to learn in complementary ways. Our experiments clearly demonstrate that this makes long-term credit assignment and memorisation more tractable. This also opens many avenues for further research, for instance: deeper hierarchies can be constructed by setting goals at multiple time scales, scaling agents to truly large environments with sparse rewards and partial observability.

For further details and related work, please see the paper.

##### Neural Episodic Control

Authors: Alex Pritzel, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech, Oriol Vinyals, Demis Hassabis, Daan Wierstra, Charles Blundell

Deep reinforcement learning algorithms have achieved state of the art performance on a variety of tasks, however they tend to be grossly data inefficient. In this work we propose a novel algorithm that allows rapid incorporation of new information collected by the agent. For this we introduce a new differentiable data structure, a differentiable neural dictionary, that can incorporate new information immediately, while being able to update it’s internal representation based on the task the algorithm is supposed to solve. Our agent, Neural Episodic Control, is built on top of the differentiable data structure and is able to learn significantly faster across a wide range of environments.

For further details and related work, please see the paper.