Creating Word Clouds from Text

ODSC - Open Data Science
4 min readNov 4, 2021

--

Word clouds are a useful visualization tool. They show the most frequent words in a text, where the relative size of the word correlates with frequency.

This is an example word cloud:

Word clouds are useful for at least two purposes:

  1. An initial exploration of text to discover which words are most numerous. While this can be achieved by printing a list of words in descending order by frequency, a word cloud will create an easier-to-read visual representation.
  2. In conjunction with a topic model, it creates visuals for each of the topics, where the most representative words for each topic are evident.

Let’s create a word cloud based on the book Sherlock Holmes by Arthur Conan Doyle. You can download the file with the text of the book on GitHub: https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook/blob/master/Chapter01/sherlock_holmes.txt.

First, import the necessary packages:

import os
import nltk
from os import path
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
from nltk.probability import FreqDist
from PIL import Image
import numpy as np

Now read in the text file and lowercase it:

text_file = "sherlock_holmes.txt" #Modify this path accordingly
text = open(text_file, "r", encoding="utf-8").read()
text = text.lower()

We need to remove the stopwords from the file, as otherwise the most prominent words will be words like I, he, the, etc. There are different ways of removing stopwords, including compiling a list, or removing the most frequent words. One way is to write a function that compiles a list of top 2% of words from a text, which we will later use as the stopwords list:

def compile_stopwords_list_frequency(text, freq_percentage=0.02):
words = nltk.tokenize.word_tokenize(text)
freq_dist = FreqDist(word.lower() for word in words)
words_with_frequencies = [(word, freq_dist[word]) for word in freq_dist.keys()]
sorted_words = sorted(words_with_frequencies, key=lambda tup: tup[1])
length_cutoff = int(freq_percentage*len(sorted_words))
stopwords = [tuple[0] for tuple in sorted_words[-length_cutoff:]]
return stopwords

The function takes in two arguments: the text and the percentage that will be used for cutoff, which defaults to 2%. First, the function tokenizes the text into words and then it creates a FreqDist object. This object contains the word frequency information about the text. In the next line we get a list of tuples, where the first element is the word, and the second one is its frequency. We then sort the list by frequency, calculate the length cutoff using the percentage and get the stopwords list using this parameter.

Next, use this function to create the stopwords list:

stopwords = compile_stopwords_list_frequency(text)
stopwords.remove("holmes")
stopwords.remove("watson")

We remove the words holmes and watson from the list, as although they are frequent, they are not stopwords.

Now create the word cloud:

output_filename = "odsc_wordcloud.png"
wordcloud = WordCloud(min_font_size=10, max_font_size=100, stopwords=stopwords, width=1000, height=1000, max_words=1000, background_color="white").generate(text)
wordcloud.to_file(output_filename)

You can change the input parameters to the WordCloud object, to change the size of the picture, the minimum and maximum font sizes, and colors.

Use the following code to display the image while the program is running:

plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

The resulting image will look something like this (it changes from run to run):

This word cloud still contains some stopwords (said, off, without), and you can experiment with modifying the stopwords list to get a cleaner result. In any case, you can see that the book talks about a woman, paper, police, business, money.

You can also see some phrases in the word cloud, such as Sherlock Holmes, said Holmes, of course, and others. Remove these by setting the collocations parameter to False when creating the WordCloud object:

wordcloud = WordCloud(min_font_size=10, max_font_size=100, stopwords=stopwords, width=1000, height=1000, max_words=1000, background_color="white", collocations=False).generate(text)

Finally, you can apply a shape to the cloud image. We will use the following shape:

Read in the image and generate the word cloud using it as a mask:

output_filename = "odsc_wordcloud_mask.png"
sherlock_data = Image.open("sherlock.png")
sherlock_mask = np.array(sherlock_data)
wordcloud = WordCloud(background_color="white", max_words=2000, mask=sherlock_mask, stopwords=stopwords, min_font_size=10, max_font_size=100)
wordcloud.generate(text)
wordcloud.to_file(output_filename)

The result will look approximately like this:

While word clouds are useful for visualizing text data, topic models are a more formal tool to analyze topics in a text. I will discuss topic models at the tutorial Introduction to NLP and Topic Modeling at the ODSC West conference (https://odsc.com/speakers/introduction-to-nlp-and-topic-modeling/).

More code recipes can be found in my book, Python Natural Language Processing Cookbook: https://www.amazon.com/Python-Natural-Language-Processing-Cookbook/dp/1838987312/.

Original post here.

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform.

--

--

ODSC - Open Data Science
ODSC - Open Data Science

Written by ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.

No responses yet