Introduction to GPT-3
Natural Language Processing (NLP) has become the darling of the deep learning community in the past several years and is now an accelerating area of research. There have been significant gains over this time with many NLP tasks and benchmarks going through a two-step process: training with a number of very large text data sets in conjunction with a task-specific fine-tuning step using much smaller collections of data.
The “generative pre-training model,” or GPT, has gained the most recent attention, and the latest iteration language generation model, GPT-3, uses up to 175 billion parameters. This is 10-times the model size of the previous king of the hill GPT-2. In contrast to previous methods, GPT-3 is applied without any gradient updates or fine-tuning. GPT-3 scores strong performance on several NLP data sets.
In this article, my goal is to get you up to speed with the GPT-3 phenomenon by offering a brief historical timeline of major results over the past few years, pointing you to several seminal papers, and sharing a few caveats associated with the technology.
History of Language Models Leading to GPT-3
GPT-3 is the most recent language model coming from the OpenAI research lab team. They announced GPT-3 in a May 2020 research paper, “Language Models are Few-Shot Learners.” I really enjoy reading seminal papers like this especially when they involve such popular technology. The paper is 42 pages, 75 with all the appendices, so it’s some great summer reading. Later in July, OpenAI offered API access to the model to a selected number of beta testers who proceeded to develop a number of compelling use case examples.
The research efforts leading up to GPT-3 started around 2010 when NLP researchers fully embraced deep neural networks as their primary methodology. First, the 2013 Word2vec paper showed that word vectors have remarkable properties. Later, the 2014 GloVe paper came out describing another vector representation algorithm that became very popular.
Next on stage was recurrent neural networks (RNNs) that represented an important innovation that could read sentences. An RNN had the advantage that it could read arbitrarily long sequences of words and be able to maintain long-range coherence. This is around the time when the seq2seq paper came out in 2014 and this approach became very popular, but RNN-based models were still lacking in many areas.
Another series of big advances happened starting in 2017 at the NIPS Conference with the release of the “Attention is All You Need” paper by a team of Google Brain and University of Toronto researchers. The paper introduced the “transformer” architecture which enabled the creation of much deeper neural networks. The team started to publish even bigger models such as BERT-base with 110 million parameters, BERT-large with 340 million parameters, and CTRL from Salesforce with 1.6 billion parameters.
Most of these autocorrelative language models work with a given sentence and then try to predict what the next word should be, or mask models that use a sentence where a random word has been “masked” and then try to predict what the masked work should be. Here, the model doesn’t need a human-generated label, rather it can learn from any text.
Transformer models have changed the world of NLP research, but they come with a cost. With so many parameters on such big data, training speed is very slow. Even the downstream and fine-tuning training requires thousands of samples.
GPT-3 in a Nutshell
GPT-3 and the previous incarnations GPT and GPT-2 can be considered transformer models. The main difference is scale: GPT had 110 million parameters, while GPT-2 had 1.6 billion parameters. GPT-2 was so good at generating text that early on, OpenAI did not make weights open source for fear that the technology could lead to rampant fake news.
Now, GPT-3 has a whopping 175 billion parameters (an order of magnitude greater than Microsoft’s 17 billion parameter Turing-NLG), but what does that scale bring to the table? The GPT-3 paper suggests that the model is so large that fine-tuning is no longer necessary. The model can perform what is called “few- shot,” “one-shot,” or “zero-shot” learning — which leads to the ability to learn complex tasks from only a few examples, or no examples at all. Since costly labeled data is no longer necessary, few-shot learning could democratize AI and extend it to many more problem domains.
GPT-3 was pre-trained on five data sets, the Common Crawl data set (~60% of GPT-3), Webtext2, two book corpora, and English Wikipedia. GPT-3 can be tuned by providing instructions in plain English (predecessors required task-specific tuning).
By consuming text that is written by humans during the training process, GPT-3 learns to write like humans, complete with humanity’s best and worst characteristics. As you might expect from a model trained with unfiltered content from the internet, the GPT-3 paper includes indications of an array of problematic areas of bias that lies underneath racist and sexist tropes.
GPT-3 shows that language model performance scales as a power-law of model size, size of data set, as well as the amount of compute resources. Further, such a language model trained on enough data can solve NLP tasks that it hasn’t seen before. This means that GPT-3 comes up with a general solution for many downstream tasks without fine-tuning. Still, it’s not clear what is going on behind the scenes. Up for debate is whether the model has learned “reasoning” or is just able to memorize training examples in a more optimized manner. GPT-3 performance continues to scale with an increased number of parameters with no conceived upper bound. However, it is reported that GPT-3 was very expensive to train, ~$5 million, so continuing to increase the number of parameters could lead to a situation where using the model may become economically impractical or reserved only for groups with very deep pockets.
Data scientists with early-stage beta API access to GPT-3’s generative power have crafted some unexpected applications, all done with just a handful of well-crafted examples as input via few-shot learning. The API works like this: send it an HTTP request with a text string and it responds with GPT-3 generated text. The API appears to be slow, and there is no information about the infrastructure running GPT-3. And there is still no information available about API pricing.
GPT-3 sets the stage for continued research and development into new and innovative language models. For instance, Google AI recently announced Language-Agnostic BERT Sentence Embedding (LaBSE), a multilingual BERT model for generating cross-lingual sentence embeddings. This is an area of deep learning that is accelerating quickly.
As mentioned above, GPT-3 has various forms of algorithmic bias. The creators of the language model intend to continue their research into addressing these biases. For now, however, the responsibility is simply passed along to any organization or individual willing to bear the risk. Most data scientists agree that all models have an inherent bias, and this should not be the sole reason to avoid AI. Over time, it may turn out that the benefits will outweigh the risks.
Another issue is that people can’t distinguish between GPT-3 generated news stories and real ones. The GPT-3 paper notes that language models will eventually become advanced enough to power large-scale misinformation campaigns.
In order for us to benefit from language models like GPT-3 and its future generations, we need to keep our foot on the brakes during the rush to deploy such powerful AI across mission-critical problem domains. Providing sufficient governance is key to understanding and monitoring points of failure while continuing the desire to provide societal value.
Editor’s note: Interested in learning more about NLP? Check out these upcoming ODSC talks!
ODSC West Virtual Conference, October 27–30:
- Natural Language Processing with PyTorch: Ravi Ilango, Data Scientist | States Mode
- Advanced NLP with TensorFlow and PyTorch: LSTMs, Self-attention and Transformers: Daniel Whitenack, PhD, Instructor, Data Scientist | Data Dan
- Introduction to Transfer Learning with Transformers in NLP applications: Amina Shabbeer, PhD, Applied Scientist | Amazon