Using Text Features to Predict the Great Stock Market Crash of 1929
Predicting financial crises is notoriously difficult. This is primarily a consequence of the infrequency of such events and the instability of relationships between financial variables. However, it is also related to the contagious nature of financial crises: if one bank expects another to liquidate its holdings of a certain asset, then it may also feel pressured to do the same. As such, a financial crisis may be triggered by a belief cascade across financial market participants, rather than movements in hard financial variables.
While beliefs are not directly observable, they can sometimes be inferred from documents produced by financial market participants, regulators, and central banks. This brief article will describe two simple natural language processing (NLP) techniques for inferring beliefs from text: sentiment analysis and text regression. We will apply each to content from the Federal Reserve Bulletin, a monthly publication produced by the U.S. Federal Reserve Board, over the period preceding (and including) the Great Stock Market Crash of 1929.
Dataset Compilation and Cleaning
Our dataset consists of 44 issues of the Federal Reserve Bulletin that span the period between May 1926 and December 1929. This data is available in the St. Louis Federal Reserve Bank’s archive and can be scraped with the urllib module in Python and processed using BeautifulSoup. Below, we provide a code example for scraping and extracting text from the May 2016 issue.
from bs4 import BeautifulSoup
from urllib.request import urlopen# Define URL.
url = "https://fraser.stlouisfed.org/title/federal-reserve-bulletin-62/may-1926-20653/fulltext"# Send GET request and process result using BeautifulSoup.
html = urlopen(url)
soup = BeautifulSoup(html.read())
print(soup.text)
…Citrus fruits in both Florida and California are reported to be making satisfactory progress.
Southern peach orchards, however, were damaged by the March freeze and crop prospects
have been somewhat impaired.
…
We repeat this process for the 43 remaining issues and store soup.text for each document in the list texts. We also store the date for each document in the list dates.
Sentiment Analysis
We will first try to infer the Federal Reserve Board’s beliefs about the state of the macroeconomy and financial system by measuring its sentiment via sentiment analysis. While we can do this using sophisticated and general models (such as transformer-based NLP models), another option is to use simple dictionary-based methods, which count words that are “positive” or “negative” according to pre-defined word lists (dictionaries). We will do this with the Loughran-McDonald dictionary, which is commonly employed in finance and was constructed using the textual content of financial filings.
In the code below, we work with pysentiment2, which implements the Loughran-McDonald dictionary. After importing modules, we instantiate a tokenizer — which cleans a string and splits it into a list of word tokens — and then apply it to each document in the corpus. Finally, we use the get_score method to recover the “polarity” or net positivity of each document and store it in a pandas DataFrame, indexed by dates.
# Install and import pysentiment2.
!pip install pysentiment2# Import modules.
import pysentiment2 as ps
import pandas as pd
import matplotlib.pyplot as plt# Instantiate tokenizer for LM dictionary.
lm = ps.LM()# Tokenize texts.
tokens = [lm.tokenize(t) for t in texts]# Compute sentiment for each document.
sentiment = [lm.get_score(p)['Polarity'] for p in tokens]# Convert to DataFrame and plot rolling mean.
sentiment = pd.DataFrame(sentiment, index = dates)
sentiment.rolling(3).mean().plot(figsize=(15,7))
Plotting the rolling mean of the net sentiment analysis series, we can see that it initially rose from late 1927 to 1929, but then declined in the months leading up to the Great Crash in October of 1929.
Text Regression
In many cases, it will unclear ex-ante which text features are useful for explaining variation in the data. If, for instance, we want to find signs of the Great Crash in the Federal Reserve Bulletin, what features of the text should we use?
In such cases, we will often select a category of token, such as words, and then compute counts without targeting any particular set of words. We will then use the matrix of counts or frequencies as features in a LASSO model or neural network. Rather than assuming that a set of features explains variation in stock returns, this approach uses supervised learning to discover terms that have predictive power. In the code below, we compute a matrix of word counts that could be used as an input for such a model using the CountVectorizer function of sklearn.
from sklearn.feature_extraction.text import CountVectorizer# Instantiate vectorizer.
vectorizer = CountVectorizer(max_features=100)# Transform texts into tf-idf matrix.
tf = vectorizer.fit_transform([' '.join(token) for token in tokens])# Recover feature names.
feature_names = vectorizer.get_feature_names()# Plot word count time series.
feature_matrix = pd.DataFrame(tf.toarray(), columns = feature_names,
index = dates)
feature_matrix.plot(figsize=(15,7), legend = None)
Each column in feature_matrix contains a time series of word counts for a given term. Again, we find that there is a considerable amount of activity in some series in the months leading up to the Great Crash. Furthermore, if we look at the feature names for those series, they are mostly related to credit and assets.
Conclusion
In this short article, we demonstrated how to measure two types of simple text features that could have predictive power for stock returns or financial crises by using sentiment analysis and text regression.
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform.