Word Embedding and Natural Language Processing
Editor’s note: Check out Mayank’s talk at ODSC East 2019 this April 30 to May 3 in Boston, “Let’s Embed Everything!“
Researchers who work with the nuts and bolts of deep neural networks (or even shallow neural networks, like skip-gram) know that the success of these methods can be attributed in no small part to representation learning, or as more colloquially known, embeddings. But what is representation learning really? And why should we care? This is a topic that has interested me greatly for the last few years, because it goes to the depth of an issue that’s in part engineering, part science, and part philosophy. I’ve found that mastering it can make all the difference between success and failure, or between giving your system an edge that can’t be easily copied.
Let’s start at the very beginning. If you’ve programmed before, you are (most likely) acquainted with data structures. Even more likely, data structures were actually your first challenging foray into computational representations. The number and definitions of data structures were never very problematic; we all knew intuitively what a tree or list was. Rather, the challenge was in representing a problem in terms of these data structures. Then came the problem of devising an algorithm for ‘manipulating’ these data structures (which includes things like sorting, but also ‘virtual’ manipulation like searching) to arrive at an outcome that yielded the answer you were looking for.
[Related article: Why Word Vectors Make Sense in Natural Language Processing]
Embeddings and machine learning
Now let’s switch to the kinds of data we (typically) see in machine learning (ML): images, text, and documents. Because my background is in knowledge graphs and NLP, I’ll use text as the running example. Imagine you’re trying to build an ML system to do sentiment analysis. In its simplest version, the task is to build an ML model that takes text (usually a short document like a review or a tweet) as input and outputs a probability of the text expressing positive sentiment. This is nothing but probabilistic binary classification. Previously, the way that this problem was approached was to first convert the text into a ‘feature vector’, following which a supervised ML model (like random forest or logistic regression) would be trained.
However, feature engineering can be a cumbersome problem. One of the first approaches to deal with this was the famous bag-of-words (or tf-idf), which can still be hard to outperform in many domains. Then came topic models like Latent Dirichlet Allocation or LDA (giving us the ubiquitous word-clouds), followed, most recently, by word embedding algorithms like word2vec. Word2vec attempts to ‘learn’ low-dimensional, real-valued vectors for words by modeling its objective function after Firth’s axiom: ‘you shall know a word by the company it keeps’. Assuming normal usage of words, this implies that words like ‘cat’ and ‘dog’ would end up with similar embeddings, as would cities like ‘Paris’ and ‘London’. Given enough data, one would start to see intriguing, subtle differences e.g., large culturally influential cities may end up getting clustered close together (since they would occur in similar contexts) than to other cities. If there is bias in the data, word embeddings pick those up too: infamously, it has been found that professions like cook, housekeeper and secretary tend to be associated strongly with women, while CEO and executive are associated with men. It is still not a settled matter how one could compensate for these biases when learning embeddings.
Using the same basic neural network that word2vec uses, entire documents, sentences, and paragraphs can also be embedded. The algorithm has also been made robust in recent years to artifacts like misspellings and out-of-vocabulary (OOV) tokens, a notorious issue in real-world text inputs. Embeddings are here to stay, and they keep getting better. It is safe to say that they are used, in some form or the other, in almost every modern NLP pipeline, be it for information extraction, semantic role labeling or text classification.
[Related article: An Idiot’s Guide to Word2vec Natural Language Processing]
Where’s this movement headed?
What started out with words has now branched out into multiple ML domains. Representation learning and its benefits were already well known in the computer vision community, but the field has since been applied to videos, text-image-video combinations, speech, time series, graphs, and networks. Research in one area (e.g., convolutional neural networks in vision) is frequently applicable to another (graph convolutional networks). Embeddings algorithms now exist that can work with, feed into or even be jointly integrated with supervised, semi-supervised and unsupervised ML models. But challenges still remain, including the aforementioned issues of bias, but also transfer learning and the requirement of big, uniform-domain datasets to learn reliable embeddings.
Editor’s note: Want to learn more about embedding in-person? Check out Mayank’s talk at ODSC East 2019 this April 30 to May 3 in Boston, “Let’s Embed Everything!”
— — — — — — — — — —
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.