Best Deep Reinforcement Learning Research of 2019
Since my mid-2019 report on the state of deep reinforcement learning (DRL) research, much has happened to accelerate the field further. Read my previous article for a bit of background, brief overview of the technology, comprehensive survey paper reference, along with some of the best research papers at that time. In this article, I’ve gone back to my favorite source, the arXiv.org pre-print server, to survey of all the research centered on DRL up to the end of the year, and picked out some of my favorite deep reinforcement learning research of 2019. Consuming this list is a great way to start off the New Year!
[Related Article: Best Deep Reinforcement Learning Research of 2019 So Far]
Model-Based Reinforcement Learning for Atari
Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction — substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. This paper explores how video prediction models can similarly enable agents to solve Atari games with fewer interactions than model-free methods. The paper describes Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models and present a comparison of several model architectures, including a novel architecture that yields the best results in the proposed setting. Experiments evaluate SimPLe on a range of Atari games in low data regime of 100K interactions between the agent and the environment, which corresponds to two hours of real-time play. In most games SimPLe outperforms state-of-the-art model-free algorithms, in some games by over an order of magnitude.
A Survey and Critique of Multiagent Deep Reinforcement Learning
Deep reinforcement learning (RL) has achieved outstanding results in recent years. This has led to a dramatic increase in the number of applications and methods. Recent works have explored learning beyond single-agent scenarios and have considered multiagent learning (MAL) scenarios. Initial results report successes in complex multiagent domains, although there are several challenges to be addressed. The primary goal of this paper is to provide a clear overview of current multiagent deep reinforcement learning (MDRL) literature. Additionally, the researchers complement the overview with a broader analysis: (i) revisit previous key components, originally presented in MAL and RL, and highlight how they have been adapted to multiagent deep reinforcement learning settings; (ii) provide general guidelines to new practitioners in the area: describing lessons learned from MDRL works, pointing to recent benchmarks, and outlining open avenues of research; (iii) take a more critical tone raising practical challenges of MDRL (e.g., implementation and computational demands). This article will help unify and motivate future research to take advantage of the abundant literature that exists (e.g., RL and MAL) in a joint effort to promote fruitful research in the multiagent community.
Dota 2 with Large Scale Deep Reinforcement Learning
On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. The authors of this paper developed a distributed training system and tools for continual training which allowed training OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.
Generative Adversarial Imagination for Sample Efficient Deep Reinforcement Learning
Reinforcement learning has seen great advancements in the past five years. The successful introduction of deep learning in place of more traditional methods allowed reinforcement learning to scale to very complex domains achieving super-human performance in environments like the game of Go or numerous video games. Despite great successes in multiple domains, these new methods suffer from their own issues that make them often inapplicable to the real world problems. Extreme lack of data efficiency, together with huge variance and difficulty in enforcing safety constraints, is one of the three most prominent issues in the field. Usually, millions of data points sampled from the environment are necessary for these algorithms to converge to acceptable policies. This paper proposes novel Generative Adversarial Imaginative Reinforcement Learning algorithm. It takes advantage of the recent introduction of highly effective generative adversarial models, and Markov property that underpins reinforcement learning setting, to model dynamics of the real environment within the internal imagination module. Rollouts from the imagination are then used to artificially simulate the real environment in a standard reinforcement learning process to avoid, often expensive and dangerous, trial and error in the real environment. Experimental results show that the proposed algorithm more economically utilizes experience from the real environment than the current state-of-the-art Rainbow DQN algorithm, and thus makes an important step towards sample efficient deep reinforcement learning.
A Survey of Deep Reinforcement Learning in Video Games
Deep reinforcement learning (DRL) has made great achievements since proposed. Generally, DRL agents receive high-dimensional inputs at each step, and make actions according to deep-neural-network-based policies. This learning mechanism updates the policy to maximize the return with an end-to-end method. This paper surveys the progress of DRL methods, including value-based, policy gradient, and model-based algorithms, and compare their main techniques and properties. Besides, DRL plays an important role in game artificial intelligence (AI). The paper also takes a review of the achievements of DRL in various video games, including classical Arcade games, first-person perspective games and multi-agent real-time strategy games, from 2D to 3D, and from single-agent to multi-agent. A large number of video game AIs with DRL have achieved super-human performance, while there are still some challenges in this domain. Therefore, also discussed are some key points when applying DRL methods to this field, including exploration-exploitation, sample efficiency, generalization and transfer, multi-agent learning, imperfect information, and delayed spare rewards, as well as some research directions.
This paper considers self-supervised representation learning to improve sample efficiency in reinforcement learning (RL). The authors propose a forward prediction objective for simultaneously learning embeddings of states and actions. These embeddings capture the structure of the environment’s dynamics, enabling efficient policy learning. The paper demonstrates that the action embeddings alone improve the sample efficiency and peak performance of model-free RL on control from low-dimensional states. By combining state and action embeddings, the authors achieve efficient learning of high-quality policies on goal-conditioned continuous control from pixel observations in only 1–2 million environment steps.
Pre-training Neural Networks with Human Demonstrations for Deep Reinforcement Learning
Deep reinforcement learning (deep RL) has achieved superior performance in complex sequential tasks by using a deep neural network as its function approximator and by learning directly from raw images. A drawback of using raw images is that deep RL must learn the state feature representation from the raw images in addition to learning a policy. As a result, deep RL can require a prohibitively large amount of training time and data to reach reasonable performance, making it difficult to use deep RL in real-world applications, especially when data is expensive. This paper proposes a method to speed up training by addressing half of what deep RL is trying to solve — learning features. The approach is to learn some of the important features by pre-training deep RL network’s hidden layers via supervised learning using a small set of human demonstrations. The researchers empirically evaluate the approach using deep Q-network (DQN) and asynchronous advantage actor-critic (A3C) algorithms on the Atari 2600 games of Pong, Freeway, and Beamrider. The results show that: 1) pre-training with human demonstrations in a supervised learning manner is better at discovering features relative to pre-training naively in DQN, and 2) initializing a deep RL network with a pre-trained model provides a significant improvement in training time even when pre-training from a small number of human demonstrations.
Learning to Predict Without Looking Ahead: World Models Without Forward Prediction
Much of model-based reinforcement learning involves learning a model of an agent’s world, and training an agent to leverage this model to perform a task more efficiently. While these models are demonstrably useful for agents, every naturally occurring model of the world of which we are aware — e.g., a brain — arose as the byproduct of competing evolutionary pressures for survival, not minimization of a supervised forward-predictive loss via gradient descent. That useful models can arise out of the messy and slow optimization process of evolution suggests that forward-predictive modeling can arise as a side-effect of optimization under the right circumstances. Crucially, this optimization process need not explicitly be a forward-predictive loss. This paper introduces a modification to traditional reinforcement learning which is termed “observational dropout,” whereby the agents ability to observe the real environment at each time-step is limited. In doing so, an agent can be coerced into learning a world model to fill in the observation gaps during reinforcement learning. The paper shows that the emerged world model, while not explicitly trained to predict the future, can help the agent learn key skills required to perform well in its environment. Videos of our results available at https://learningtopredict.github.io/
[Related Article: The Most Influential Deep Learning Research of 2019]
A survey on intrinsic motivation in reinforcement learning
The reinforcement learning (RL) research area is very active, with an important number of new contributions; especially considering the emergent field of deep RL (DRL). However a number of scientific and technical challenges still need to be addressed, amongst which we can mention the ability to abstract actions or the difficulty to explore the environment which can be addressed by intrinsic motivation (IM). This paper provides a survey on the role of intrinsic motivation in DRL. The paper categorizes the different kinds of intrinsic motivations and details for each category, its advantages and limitations with respect to the mentioned challenges. Additionally, the authors conduct an in-depth investigation of substantial current research questions that are currently under study or not addressed at all in the considered research area of DRL. The authors choose to survey these research works, from the perspective of learning how to achieve tasks. It is suggested then, that solving current challenges could lead to a larger developmental architecture which may tackle most of the tasks. This developmental architecture is described on the basis of several building blocks composed of a RL algorithm and an IM module compressing information.
Figure 1: Illustration of the sparse reward issue in a very simple setting. The agent, represented by a circle, strives to reach the star. The reward function is one when the agent reaches the star and zero otherwise. On the left side, the agent explores with standard methods such as — greedy; as a result, it stays in its surrounded area because of the temporal inconsistency of its behavior. On the right side, we can imagine an ideal exploration strategy where the agent covers the whole state space to discover where rewards are located.
Want to learn more about deep reinforcement learning? Attend ODSC East 2020 this April 13–17 in Boston and hear in-person from the researchers and practitioners who use it daily.