ggplot2: a Framework for Thinking with Graphs

Data detectives

Creating high-quality visualizations takes time and practice, and the best source of both comes from performing Exploratory Data Analysis (EDA). American mathematician John Tukey first coined “Exploratory Data Analysis” in his 1977 seminal text. In it, Tukey compares creating graphs and visualization to doing detective work,

EDA: A learning method for learning visualization

EDA is also a great way to hone your visualization skills. The graphs we create will require us to combine our technical skills for creating accurate charts with critical thinking and reasoning. The process starts with data in a spreadsheet, and then we calculate some basic counts and summaries. Next, we build visualizations for each column (univariate graphs) and then compare the columns to each other (bivariate graphs). Finally, if necessary, we create charts for three or more columns (multivariate graphs). All along the way, we use these visualizations to answer (and ask) questions about the data. Throughout this process, it’s also possible we make discoveries that require us to restructure and reformat (or ‘wrangle’) the data before we can create a visualization that communicates what the data contains.

What are ‘data’?

Graphs are illustrations drawn from data, often to reduce their complexity into a display we can process visually. We’ll consider data to be to any rectangular arrangement of information, with rows representing different observations (e.g., participants in a survey, movies, US cities, etc.), and columns representing variable characteristics (e.g., answers to survey questions, movie critic scores, city population, etc.). The values representing a single measurement unit are at the intersection of the rows and columns (see the example below).

A grammar for graphics

I highly recommend using the ggplot2 package (built with the statistical programming language R) to create data visualizations. The underlying system for constructing graphs with ggplot2 is a comprehensive vocabulary and grammar of graphics (from the book with the same title by Leland Wilkinson). Grammar exists for a reason: to have precisely and unambiguously defined concepts. Dedicating an entire language to building graphs might seem excessive, but like all technical endeavors, designing visualizations benefits from having a shared vocabulary for describing their attributes. A shared language can also provide a framework for building a mental model for graphs (mental models are mental representations of how some aspect of the world works).

Graph components

We’ll use the diagram below to define some standard graph components:

An example: Palmer penguins

These terms and definitions can seem a little abstract, so we’ll work through an example. Consider the data below, which contains ten measurements of penguin bill length from the Palmer Archipelago in Antarctica. We’ve stored these data in the penguins dataset.

Start with the labels

We’ll start by looking at the distribution of the bill length column (bill_length_mm) using a histogram. As I mentioned above, we’ll begin by making the labels for this graph with ggplot2's labs()` function:

bill_labels <- labs(title = "Distribution of Palmer penguins bill length",
subtitle = "Histogram of bill_length_mm",
caption = "https://allisonhorst.github.io/palmerpenguins/",
x = "Bill Length (mm)")

Build a canvas

Now that we have the labels for our plot, we can build the first layer of our graph. In ggplot2, layers are “a collection of geometric elements and statistical transformations.” We’ll use the ggplot() function to initialize the graph with the penguins data:

ggplot(data = penguins) # initialize plot

Map data to aesthetics

The code above is the beginning of our plot’s first ‘layer.’ The display we’ve created is the canvas we’ll add a geom function to; in this case, it’s a geom_histogram(). Inside geom_histogram(), we’ll ‘map’ the bill length (bill_length_mm) to the x axis:

ggplot(data = penguins) + 
geom_histogram(mapping = aes(x = bill_length_mm))
ggplot(data = penguins) +
geom_histogram(mapping = aes(x = bill_length_mm)) +
bill_labels

Checkpoint

Let’s recap how we’ve created the graph above:

  1. We identified a dataset (penguins) and variable we wanted to investigate (bill_length_mm)
  2. We built the labels for our plot with the labs() function and stored them in bill_labels
  3. We initialized the plot with ggplot(data = ...)
  4. We added a geom function for the type of graph we wanted to build and mapped the aesthetics (geom_histogram(mapping = aes(x = ...)))
  5. We included our graph labels to make sure we knew what we were looking at if we looked at this graph in the future

A visualization template

The great thing about creating visualizations with ggplot2 is that once we start thinking about graphs in terms of data, columns, and layers, we can build visualizations using any of ggplot2’s many geom functions. We can put the steps above into a template for creating graphs with ggplot2:

ggplot(data = <DATA>) +
geom_function(mapping = aes(<AESTHETIC MAPPINGS>)) +
<LABELS>

Recreate an existing graph

Having a grammar of graphics also gives us terms and definitions that serve as a mental model for thinking about graphs. We can look at an existing visualization and break it down into the data, columns, aesthetics, and geoms. Consider the plot below of body mass and flipper length for the penguins in the penguins dataset:

penguin_labels <- labs(title = "Penguins body mass vs. flipper length",  
subtitle = "Penguins from the Palmer Archipelago, Antarctica",
caption = "https://allisonhorst.github.io/palmerpenguins/",
x = "Body mass (g)",
y = "Flipper Length (mm)")
# TEMPLATE 
ggplot(data = <DATA>) +
geom_function(mapping = aes(<AESTHETIC MAPPINGS>)) +
<LABELS>

“Infinitely extensible”

In closing

EDA is an excellent opportunity to practice building graphs because it’s an iterative, creative process. ggplot2’s grammar is an ideal tool for EDA because of its ability to provide rapid prototyping and feedback.

ggplot(data = <DATA>) +
geom_function(mapping = aes(<AESTHETIC MAPPINGS>))
+ geom_function(mapping = aes(<AESTHETIC MAPPINGS>)) +
...
<LABELS>

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
ODSC - Open Data Science

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.