The Most Influential Deep Learning Research of 2019
Deep learning has continued its forward movement during 2019 with advances in many exciting research areas like generative adversarial networks (GANs), auto-encoders, and reinforcement learning. In terms of deployments, deep learning is the darling of many contemporary application areas such as computer vision, image recognition, speech recognition, natural language processing, machine translation, autonomous vehicles, and many more.
[Related Article: Best Machine Learning Research of 2019]
Earlier this year, we saw Google AI Language revolutionize the NLP segment of deep learning with the new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. The already seminal paper was released on arXiv on May 24. This has led to a storm of follow-on research results. This is just one specific area of deep learning, with many more are pushing forward just as quickly.
Although deep learning is officially a subset of machine learning, its creative use of artificial neural networks is finely tuned to certain high-dimensional problem domains. For typical business problems, traditional machine learning algorithms (gradient boosting is supreme) often perform better.
In this article, I’ll help kick start your effort to keep pace with this research-heavy field by curating the current large pool of research efforts published in 2019 on arXiv.org down to the manageable short-list of my favorites that follows. Enjoy!
A Comprehensive Survey on Graph Neural Networks
Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video processing to speech recognition and natural language understanding. The data in these tasks are typically represented in the Euclidean space. However, there is an increasing number of applications where data are generated from non-Euclidean domains and are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has imposed significant challenges on existing machine learning algorithms. Recently, many studies on extending deep learning approaches for graph data have emerged. This survey upon which this paper is based provides a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields. The researchers propose a new taxonomy to divide the state-of-the-art graph neural networks into four categories, namely recurrent graph neural networks, convolutional graph neural networks, graph auto-encoders, and spatial-temporal graph neural networks. Included is a discussion of the applications of graph neural networks across various domains and summarize the open source codes, benchmark data sets, and model evaluation of graph neural networks. The paper concludes by proposing potential research directions in this rapidly growing field.
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. This paper from Google Research systematically studies model scaling and identifies that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, a new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. The paper demonstrates the effectiveness of this method on scaling up MobileNets and ResNet. To go even further, neural architecture search is used to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. Source code is available on GitHub.
Deep Learning for Anomaly Detection: A Survey
Anomaly detection is an important problem that has been well-studied within diverse research areas and application domains. The aim of this survey is two-fold, firstly to present a structured and comprehensive overview of research methods in deep learning-based anomaly detection, and additionally to review the adoption of these methods for anomaly across various application domains and assess their effectiveness. The paper groups state-of-the-art research techniques into different categories based on the underlying assumptions and approach adopted. Within each category the paper outlines the basic anomaly detection technique, along with its variants and present key assumptions, to differentiate between normal and anomalous behavior. For each category, the paper also presents the advantages and limitations and discusses the computational complexity of the techniques in real application domains. Finally, the paper outlines open issues in research and challenges faced while adopting these techniques.
Deep Learning for Symbolic Mathematics
Neural networks have a reputation for being better at solving statistical or approximate problems than at performing calculations or working with symbolic data. This paper from Facebook AI Research shows that they can be surprisingly good at more elaborated tasks in mathematics, such as symbolic integration and solving differential equations. The paper proposes a syntax for representing mathematical problems, and methods for generating large data sets that can be used to train sequence-to-sequence models. Results were achieved that outperform commercial Computer Algebra Systems such as Matlab or Mathematica.
The computations required for deep learning research have been doubling every few months, resulting in an estimated 300,000x increase from 2012 to 2018. These computations have a surprisingly large carbon footprint. Ironically, deep learning was inspired by the human brain, which is remarkably energy efficient. Moreover, the financial cost of the computations can make it difficult for academics, students, and researchers, in particular those from emerging economies, to engage in deep learning research. This position paper advocates a practical solution by making efficiency an evaluation criterion for research alongside accuracy and related measures. In addition, the paper proposes reporting the financial cost or “price tag” of developing, training, and running models to provide baselines for the investigation of increasingly efficient methods. The goal is to make AI both greener and more inclusive — enabling any inspired undergraduate with a laptop to write high-quality research papers.
The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design
The past decade has seen a remarkable series of advances in machine learning, and in particular deep learning approaches based on artificial neural networks, to improve our abilities to build more accurate systems across a broad range of areas, including computer vision, speech recognition, language translation, and natural language understanding tasks. This paper by Jeffrey Dean of Google Research discusses some of the advances in machine learning, and their implications on the kinds of computational devices we need to build, especially in the post-Moore’s Law-era. It also discusses some of the ways that machine learning may also be able to help with some aspects of the circuit design process. Finally, it provides a sketch of at least one interesting direction towards much larger-scale multi-task models that are sparsely activated and employ much more dynamic, example- and task-based routing than the machine learning models of today.
Batch Normalization (BN) is a highly successful and widely used batch dependent training method. Its use of mini-batch statistics to normalize the activations introduces dependence between samples, which can hurt the training if the mini-batch size is too small, or if the samples are correlated. Several alternatives, such as Batch Renormalization and Group Normalization (GN), have been proposed to address these issues. However, they either do not match the performance of BN for large batches, or still exhibit degradation in performance for smaller batches, or introduce artificial constraints on the model architecture. This paper by Google Research proposes the Filter Response Normalization (FRN) layer, a novel combination of a normalization and an activation function, that can be used as a drop-in replacement for other normalizations and activations. The new method operates on each activation map of each batch sample independently, eliminating the dependency on other batch samples or channels of the same sample. The method outperforms BN and all alternatives in a variety of settings for all batch sizes.
Neural Random Forest Imitation
This paper presents Neural Random Forest Imitation — a novel approach for transforming random forests into neural networks. Existing methods produce very inefficient architectures and do not scale. The new method is for generating data from a random forest and learning a neural network that imitates it. Without any additional training data, this transformation creates very efficient neural networks that learn the decision boundaries of a random forest. The generated model is fully differentiable and can be combined with the feature extraction in a single pipeline enabling further end-to-end processing. Experiments on several real-world benchmark datasets demonstrate outstanding performance in terms of scalability, accuracy, and learning with very few training examples. Compared to state-of-the-art mappings, this method significantly reduces the network size while achieving the same or even improved accuracy due to better generalization.
When Does Label Smoothing Help?
The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. This paper from Google Brain Toronto, shows empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search. The researcher, including Geoffrey Hinton, also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, the paper visualizes how label smoothing changes the representations learned by the penultimate layer of the network. The paper shows that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model’s predictions.
On the Learning Dynamics of Deep Neural Networks
While a lot of progress has been made in recent years, the dynamics of learning in deep nonlinear neural networks remain to this day largely misunderstood. This paper from Microsoft Research studies the case of binary classification and proves various properties of learning in such networks under strong assumptions such as linear separability of the data. Extending existing results from the linear case, the paper confirms empirical observations by proving that the classification error also follows a sigmoidal shape in nonlinear architectures. The paper shows that given proper initialization, learning expounds parallel independent modes and that certain regions of parameter space might lead to failed training. The paper also demonstrates that input norm and features’ frequency in the data set lead to distinct convergence speeds which might shed some light on the generalization capabilities of deep neural networks. Included is a comparison between the dynamics of learning with cross-entropy and hinge losses, which could prove useful to understand recent progress in the training of generative adversarial networks. Finally, the paper identifies a phenomenon baptized “gradient starvation” where the most frequent features in a data set prevent the learning of other less frequent but equally informative features.
[Related Article: Best Deep Reinforcement Learning Research of 2019 So Far]
Want to learn more about these novel deep learning techniques and findings from the people who work on them? Attend ODSC East 2020 in Boston April 13–17 and learn from them directly!