In this section, we’ll step back into theory for a bit to talk about where the “tidy” in tidyverse comes from and why it is an important feature of data. We’ll also see how to transform data from “messy”/wide to “tidy”/long and vice versa.

What is Tidy Data?

Why Tidy Data?

Tidy data enables us to do lots of things!

  1. Great ggplots
  2. Summarize/slice the data in multiple ways
  3. Enable Exploratory Data Analysis
  4. Ensure assumptions are met for methods
  5. Enable Confirmatory Data Analysis

Beware of columns masquerading as variables!

library(tidyverse)
library(readr)
fertility_data <- read_csv("data/total_fertility.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Total fertility rate` = col_character()
## )
## See spec(...) for full column specifications.
fertility_data

These columns are actually categories! 1800 doesn’t correspond to the values that follow below it. The same is true for any of the other column headers here. They correspond to the year in which the data is measured. That data is on fertility rate.

Ask yourself: do these columns go together as a single observation for your analysis?

Also ask yourself: What is the unit of observation?

Making data tidy: gather()

Use gather() when you need to make a bunch of columns into one column. In other words, when you want to convert “wide data” to “long data.”

# gather() has three standard arguments: data, key, and value
# data is usually loaded via the %>%
# key is what you want your new categorical column to be named
# value is for the actual values in the columns

# We don't want the `Total fertility rate` column to be included as part of the
# gather() operation, so we use the `-` notation to exclude it.

fertility_tidy <- fertility_data %>% 
  gather(key = "Year", value = "fertilityRate", -`Total fertility rate`) %>% 
  # Re-arrange and rename columns
  select(Country = `Total fertility rate`, Year, fertilityRate) %>% 
  # Remove rows with missing values 
  # (there are countries that have little to no information)
  na.omit()

fertility_tidy

Your Task: using this tidy data

Exercise 3.1

As a refresher from earlier in the workshop, how would we find the average fertility for each year?

# Write and check your answer here
# ONE SOLUTION
fertility_tidy %>% 
  group_by(Year) %>% 
  summarize(mean_fert = mean(fertilityRate))

How about from 1860 on?

fertility_tidy %>% 
  filter(Year >= 1860) %>% 
  summarize(mean_fert = mean(fertilityRate))

Making one column into many: spread()

Sometimes, you will need to go the other direction: take a long format dataset and make it into a more matrix-like format. This is necessary for such functions such as heatmap().

Let’s change things around and make the Country column into the variables (columns) in the dataset.

fertility_wide <- fertility_tidy %>% 
  # spread() takes a key (Country) and value (fertilityRate) argument
  # Note that we don't quote here, whereas we do in gather()
  spread(key = Country, value = fertilityRate) 

fertility_wide

Your Task - Who is the most democratic?

Exercise 3.2

Load the dem_score.csv dataset in the data folder. Tidy it up. Which countries had the highest democracy score in 2007?

Hint: you’ll have to use your dplyr skills as well.

#enter your answer here
# ONE SOLUTION
dem_score <- read_csv("data/dem_score.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   `1952` = col_integer(),
##   `1957` = col_integer(),
##   `1962` = col_integer(),
##   `1967` = col_integer(),
##   `1972` = col_integer(),
##   `1977` = col_integer(),
##   `1982` = col_integer(),
##   `1987` = col_integer(),
##   `1992` = col_integer(),
##   `1997` = col_integer(),
##   `2002` = col_integer(),
##   `2007` = col_integer()
## )
dem_score_tidy <- dem_score %>% 
  gather(key = "year", value = "democracy_score", -country)
dem_score_tidy %>% 
  filter(year == 2007) %>% 
  top_n(1, democracy_score)

What you learned in this section

How to convert


What’s Next?

We’ve showed you the bare basics of data wrangling in the tidyverse. There’s a ton more!

Where to go next?


Closing project


Conclusion

Data importing, wrangling, and tidying are often forgotten as being important parts of the data analysis pipeline. The tidyverse packages as designed to work together to import, tidy, and wrangle all in a consistent framework working with data frames.

More resources

  • Ted and Jessica Minnier created a free DataCamp course covering many of the topics covered here if you’d like to go back and practice on your own.
  • Chester and Albert Kim wrote a free introductory textbook to help beginners get going with R.
  • We’re biased but we also highly recommend Dave Robinson’s Introduction to the Tidyverse course on DataCamp that Chester helped to author in his role at DataCamp.
  • Alison Hill will also be launching a follow-up DataCamp course on data importing, data taming, and data tidying tentatively titled “Working with Data in the Tidyverse” later this summer. You can track its progress here.

Post-session survey

We appreciate and yearn for your constructive and descriptive feedback so that we can improve as educators. To further support this, please feel out this brief survey.

