An Introduction to Sentence-Level Sentiment Analysis with sentimentr

ODSC - Open Data Science
6 min readNov 2, 2018

--

Sentiment analysis algorithms understand language word by word, estranged from context and word order. But our languages are subtle, nuanced, infinitely complex, and entangled with sentiment. They defy summaries cooked up by tallying the sentiment of constituent words.

Unsophisticated sentiment analysis techniques calculate sentiment/polarity by matching words back to a dictionary of words flagged as “positive,” “negative,” or “neutral.” This approach is too reductive. It cleaves off useful information and bastardizes our syntactically complex, lexically rich language. Plus it’s just not the way humans intuit language. We listen to an entire sentence and derive meaning that is gestalt, or greater than the sum of the individual words. Plus we parse incoming words through the complex latticework of lifelong social learning. Our algorithms have little hope.

The sentimentr package by Tyler Rinker gets our machines just a hair closer to this by bolstering sentiment analysis with a lexicon of words that tend to slide sentiment a teeny bit in one direction or the other. These words are known as valence shifters.

Rinker’s package incorporates 130 valence shifters that often reverse or overrule the sentiment calculated by lexicon-lookup methods which don’t sense this sort of subtlety. The four valence shifters accounted for are: negators (not, can’t), amplifiers (absolutely, certainly), de-amplifiers (almost, barely), and adversative conjunctions (although, that being said). This is an important (necessary?) step because, as Rinker points out, up to 20 percent of polarized words co-occur with one of these shifters across the corpora he looked at.

Putting sentimentr to use

This post explores the basics of sentence-level sentiment analysis, unleashing sentimentr on the entire corpus of R package help documents on CRAN, which we programmatically mine from a simple HTML table using the htmltabpackage.

For starters, I need a corpus. I had an earlier idea to mine the (likely hyperbolic) sentiment of news articles of various topics, but since I’d need a benchmark to compare it against, I thought I’d assemble a corpus of what I expect to be fairly unsentimental, prosaic text: technical help pages of the packages on CRAN.

To get all the PDFs of package documentation from CRAN, I’ll:

  1. Get package names by scraping this page
  2. Build URLs to each package, which follows this format:
  1. Iterate through each link and download the PDF
  2. Sentiment analysis

First I’ll load my libraries:

library(tidyverse) # of course
library(htmltab) # to scrape an html table
library(pdftools) # for sucking out text from a PDF

htmltab() collects information from the structured contents in the doc argument and spits it out as a data frame.

url <- "https://cran.r-project.org/web/packages/available_packages_by_date.html"
htmltab(doc = url, which = '/html/body/table') -> r_packs

Now I just need to build the URLs and I’ll be ready to loop through them to download the PDFs.

r_packs <- r_packs %>% mutate(Date = as.Date(Date, "%Y-%m-%d"), 
   yr = lubridate::year(Date),
   pdf_url = paste("https://cran.r-project.org/web/packages/",Package,"/",
    Package, ".pdf", sep = ""))

Now I’ll write a simple for loop to download and save all the PDFs to a local directory.

setwd("./All R Package Docs")
for (p in seq_along(r_packs$Package)){
download.file(url = r_packs[p, "pdf_url"],
destfile = r_packs[p, "pdf_name"],
quiet = T,
method = 'auto',
mode = "wb",
cacheOK = TRUE,
extra = getOption("download.file.extra"))
}

Looks like it worked!

The best we can do with this text is read it. That’s no good, since my computer isn’t so hot at parsing PDFs. To unlock text from its PDF prison, I’ll wrap pdftools:pdf_text in purrr::map to iteratively vacuum out the text of each PDF.

First, I set a variable to the directory of the R Docs:

dir <- "/Users/brandondey/Desktop/All R Package Docs"

Then I create a vector of pathnames:

pdfs <- paste(dir, "/", list.files(dir, pattern = "*.pdf"), sep = "")

Then I create a vector of package names:

pdf_names <- list.files(dir, pattern = "*.pdf")

Then I suck out the text from each PDF using pdftools:pdf_text wrapped in purrr::map to iterate on each pdf.

pdfs_text <- purrr::map(pdfs[1000], pdftools::pdf_text)

Next I create a dataframe with one row for each package:

my_data <- data_frame(package = pdf_names[1:1000], text = pdfs_text)

Next I need to figure out where my sentences end and calculate a sentiment score on each one using sentimentr::get_sentences() and sentimentr::sentiment().

my_data %>% 
  unnest %>%
  sentimentr::get_sentences() %>%
  sentimentr::sentiment() %>%
  mutate(characters = nchar(stripWhitespace(text))) %>%
  filter(characters >1 ) -> bounded_sentences
summary(bounded_sentences$sentiment)

I’m removing values outside [-1,1], which is 466 observations of ~260,000:

bounded_sentences %>% filter(between(sentiment,-1,1)) ->  bounded_sentences

Summarize the df to plot:

dat <- with(density(bounded_sentences$sentiment), data.frame(x, y))

Then plot:

ggplot(dat, aes(x = x, y = y)) +
geom_line() +
geom_area(mapping = aes(x = ifelse(x >=0 & x<=1 , x, 0)), fill = "green") +
geom_area(mapping = aes(x = ifelse(x <=0 & x>=-1 , x, 0)), fill = "red") +
scale_y_continuous(limits = c(0,7.5)) +
theme_minimal(base_size = 16) +
labs(x = "Sentiment",
y = "",
title = "The Distribution of Sentiment Across R Package Help Docs") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.y=element_blank()) -> gg

Behold the density of sentiment.

Downfalls of word-level sentiment analysis

In a longer post, I’d explore the nuance of these scores, scrutinize the data more, validate the classifier, and even build a custom lexicon to match on. But the gist of the approach is in place.

On another note, you may wonder why I’m analyzing at the sentence level, and not at the unigram (word) level. My reasoning: Lexicon approaches are too reductive to push state of art to begin with, and a unigram-level lexicon sentiment analysis is even worse because it only assigns polarity piecemeal. This tends to exacerbate some of the documented issues (here and here) with the sentiment mining of complex natural language, such as how tough it is to successfully capture nuance, sarcasm, negation, idiomatic subtlety, domain dependency, homonymy, synonymy, and bipolar words (words that shift polarity with regard to their domain).

The list goes on. So I didn’t want to be even more reductive when deploying an already reductive technique.

As a toy example of the limitations of uniform sentiment analysis, consider how unintuitive and fallacious results are when I try to use the syuzhet package to manage basic negation: “I don’t love apple pie” is considered positive because of the word “love”, even though the statement is obviously negative.

However, in the second row, you can see that sentimentr catches this negation and forces sentiment negative accordingly, while the syuzhet package erroneously assigns it the same sentiment score as “I love apple pie” (Jocker made a solid defense of his package here). sentimentr even reckons a higher sentiment score for, “I really really love apple pie!!!” because of how the algorithm captures the nuance of those crafty amplifiers, really really, which are missed by the syuzhet approach.

sentimentr is not without its shortcomings. It’s still a lexicon approach that suffers from reductiveness, even if its default lexicon is a combined and augmented version of the syuzhet package (Jocker 2017) and Rinker’s augmented Hu & Liu (2004) from the lexicon package. Still a lexicon.

The proof is in the pudding. Below is a snippet of an HTML file created by another of sentimentr’s cool functions, highlight(), which paints sentences by sentiment. Clearly it thought I concluded this post on a negative note, but do you think so? I hope not…

The limits of lexicon-based sentiment analysis are clear.

sentimentr::sentiment_by(text) %>% sentimentr::highlight()

In order to validate the classifier I just built, which isn’t technically a classifier because I never dichotomized the continuous sentiment score into positive, negative, or neutral groups, I’d need labeled training data to test against. Failing that, I could turn to a more sophisticated unsupervised approach, which is appealing but well beyond the scope of this post.

But hey, now that I have an entire corpus of some 12k+ help docs, I have data aplenty to cut my teeth on in a later post!

The .R scripts from this post are here.

References and Further Reading:

Original story here.

— — — — — — — — — — — — — — — — — —

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.

--

--

ODSC - Open Data Science
ODSC - Open Data Science

Written by ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.

Responses (1)