Intro to Language Processing with the NLTK

ODSC - Open Data Science
4 min readMay 23, 2019

Hidden information often lies deep within the boundaries of what we can perceive with our eyes and our ears. Some look to data for that purpose, and most of the time, data can tell us more than we thought was imaginable. But sometimes data might not be clear cut enough to perform any sort of analytics. So what do you do when you’re at a standstill? If you have a large amount of text-rich data that would be impossible to read through, luckily, natural language processing can concentrate all that text into simple insights.

Language, tone, and sentence structure can explain a lot about how people are feeling, and can even be used to predict how people might feel about similar topics using a combination of the Natural Language Toolkit, a Pythonlibrary used for analyzing text, and machine learning. For our purposes, we’ll work on a single body of text to clean and analyze key parts of past presidents’ inaugural speeches, which are included in NLTK’s corpus library. Once you have the basics, applying these techniques to a machine learning classification should be an easy task you can do with just about any text-rich data. Here’s how to get started.

As always, we start by installing and importing the proper packages for our project. Here’s the list of libraries I used in my notebook:

Next, we’ll download the inaugural speech data from NLTK’s corpus library. The speech I’ll be analyzing is Obama’s from 2009.

When working with text files using NLTK, it’s essential to separate, or tokenize, each word in the document. Luckily, NLTK’s corpus library has built-in calls to tokenize files, so all we’ll need to do is specify the exact speech we want to explore.

Another important step is to remove stop words from the data. Stop words are what’s considered to be some of the more common English words like and, or, are, am, etc. These words aren’t that helpful in examining the language used in the speech, so it’s best to do away with them.

We can start looking at our data visually now with the help of the matplotlib library. If you’re unfamiliar with matplotlib, it’s a fairly simple tool that allows you to generate charts from raw data in python. Their website has several tutorials listed if you would like to toy around with data visualization.

That looks pretty good, but I think we can do a little bit more cleaning. We need to simplify our data even further so it can be learned easier if we end up applying machine learning algorithms to it. This process is called normalization, and it is important when working with even larger sets of data. For our purposes, we’ll just lemmatize the words in Obama’s speech, which will take words and reduce them to their base form.

And here’s what our outcome should look like:

Great! Now we have the cleanup tools necessary to work on data using the Natural Language Toolkit. We can use these packages to work on larger sets of data like a to perform sentiment analysis. These tricks can be helpful when looking into largely inconsistent data, like comments on a youtube thread, and can help us understand how people react to things on a large scale.

Original post here.

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.