In this section, we’ll step back into theory for a bit to talk about where the “tidy” in tidyverse comes from and why it is an important feature of data. We’ll also see how to transform data from “messy”/wide to “tidy”/long and vice versa.

What is Tidy Data?

Why Tidy Data?

Tidy data enables us to do lots of things!

  1. Great ggplots
  2. Summarize/slice the data in multiple ways
  3. Enable Exploratory Data Analysis
  4. Ensure assumptions are met for methods
  5. Enable Confirmatory Data Analysis

Beware of columns masquerading as variables!

fertility_data <- read_csv("data/total_fertility.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Total fertility rate` = col_character()
## )
## See spec(...) for full column specifications.

These columns are actually categories! 1800 doesn’t correspond to the values that follow below it. The same is true for any of the other column headers here. They correspond to the year in which the data is measured. That data is on fertility rate.

Ask yourself: do these columns go together as a single observation for your analysis?

Also ask yourself: What is the unit of observation?

Making data tidy: gather()

Use gather() when you need to make a bunch of columns into one column. In other words, when you want to convert “wide data” to “long data.”

# gather() has three standard arguments: data, key, and value
# data is usually loaded via the %>%
# key is what you want your new categorical column to be named
# value is for the actual values in the columns

# We don't want the `Total fertility rate` column to be included as part of the
# gather() operation, so we use the `-` notation to exclude it.

fertility_tidy <- fertility_data %>% 
  gather(key = "Year", value = "fertilityRate", -`Total fertility rate`) %>% 
  # Re-arrange and rename columns
  select(Country = `Total fertility rate`, Year, fertilityRate) %>% 
  # Remove rows with missing values 
  # (there are countries that have little to no information)