In this section, we’ll step back into theory for a bit to talk about where the “tidy” in tidyverse
comes from and why it is an important feature of data. We’ll also see how to transform data from “messy”/wide to “tidy”/long and vice versa.
What is Tidy Data?
- each row corresponds to an observation
- each variable is a column
- each type of observation is in a different table
Why Tidy Data?
Tidy data enables us to do lots of things!
- Great ggplots
- Summarize/slice the data in multiple ways
- Enable Exploratory Data Analysis
- Ensure assumptions are met for methods
- Enable Confirmatory Data Analysis
Beware of columns masquerading as variables!
library(tidyverse)
library(readr)
fertility_data <- read_csv("data/total_fertility.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Total fertility rate` = col_character()
## )
## See spec(...) for full column specifications.
fertility_data
These columns are actually categories! 1800
doesn’t correspond to the values that follow below it. The same is true for any of the other column headers here. They correspond to the year in which the data is measured. That data is on fertility rate.
Ask yourself: do these columns go together as a single observation for your analysis?
Also ask yourself: What is the unit of observation?
Making data tidy: gather()
Use gather()
when you need to make a bunch of columns into one column. In other words, when you want to convert “wide data” to “long data.”
# gather() has three standard arguments: data, key, and value
# data is usually loaded via the %>%
# key is what you want your new categorical column to be named
# value is for the actual values in the columns
# We don't want the `Total fertility rate` column to be included as part of the
# gather() operation, so we use the `-` notation to exclude it.
fertility_tidy <- fertility_data %>%
gather(key = "Year", value = "fertilityRate", -`Total fertility rate`) %>%
# Re-arrange and rename columns
select(Country = `Total fertility rate`, Year, fertilityRate) %>%
# Remove rows with missing values
# (there are countries that have little to no information)
na.omit()
fertility_tidy
Your Task: using this tidy data
Exercise 3.1
As a refresher from earlier in the workshop, how would we find the average fertility for each year? How about from 1860 on?
# Write and check your answer here
Making one column into many: spread()
Sometimes, you will need to go the other direction: take a long format dataset and make it into a more matrix-like format. This is necessary for such functions such as heatmap()
.
Let’s change things around and make the Country
column into the variables (columns) in the dataset.
fertility_wide <- fertility_tidy %>%
# spread() takes a key (Country) and value (fertilityRate) argument
# Note that we don't quote here, whereas we do in gather()
spread(key = Country, value = fertilityRate)
fertility_wide
Your Task - Who is the most democratic?
Exercise 3.2
Load the dem_score.csv
dataset in the data
folder. Tidy it up. Which countries had the highest democracy score in 2007?
Hint: you’ll have to use your dplyr
skills as well.
#enter your answer here
What you learned in this section
How to convert
- wide/messy data into long/tidy data using the
gather()
function in the tidyr
package
- long data into wide data using the
spread()
function in the tidyr
package
What’s Next?
We’ve showed you the bare basics of data wrangling in the tidyverse. There’s a ton more!
Where to go next?
Closing project
- Try to load in your own data and use
tidyr
to get it into the right format if needed to use dplyr
to do some data wrangling. If you don’t have your own data, do some analyses on the periodic_table
data you loaded in before using dplyr
. We’ll be around to answer questions. Thanks much!
Conclusion
Data importing, wrangling, and tidying are often forgotten as being important parts of the data analysis pipeline. The tidyverse
packages as designed to work together to import, tidy, and wrangle all in a consistent framework working with data frames.
More resources
- Ted and Jessica Minnier created a free DataCamp course covering many of the topics covered here if you’d like to go back and practice on your own.
- Chester and Albert Kim wrote a free introductory textbook to help beginners get going with R.
- We’re biased but we also highly recommend Dave Robinson’s Introduction to the Tidyverse course on DataCamp that Chester helped to author in his role at DataCamp.
- Alison Hill will also be launching a follow-up DataCamp course on data importing, data taming, and data tidying tentatively titled “Working with Data in the Tidyverse” later this summer. You can track its progress here.
Post-session survey
We appreciate and yearn for your constructive and descriptive feedback so that we can improve as educators. To further support this, please feel out this brief survey.
LS0tCnRpdGxlOiAiVGlkeSBEYXRhOiBXaHkgYW5kIEhvdyIKYXV0aG9yOiAiVGVkIExhZGVyYXMgYW5kIENoZXN0ZXIgSXNtYXkiCm91dHB1dDogCiAgaHRtbF9kb2N1bWVudDoKICAgIGNvZGVfZG93bmxvYWQ6IHRydWUKICAgIGNvZGVfZm9sZGluZzogaGlkZQogICAgZGZfcHJpbnQ6IHBhZ2VkCi0tLQoKSW4gdGhpcyBzZWN0aW9uLCB3ZSdsbCBzdGVwIGJhY2sgaW50byB0aGVvcnkgZm9yIGEgYml0IHRvIHRhbGsgYWJvdXQgd2hlcmUgdGhlICJ0aWR5IiBpbiBgdGlkeXZlcnNlYCBjb21lcyBmcm9tIGFuZCB3aHkgaXQgaXMgYW4gaW1wb3J0YW50IGZlYXR1cmUgb2YgZGF0YS4gV2UnbGwgYWxzbyBzZWUgaG93IHRvIHRyYW5zZm9ybSBkYXRhIGZyb20gIm1lc3N5Ii93aWRlIHRvICJ0aWR5Ii9sb25nIGFuZCB2aWNlIHZlcnNhLgoKIyMgV2hhdCBpcyBUaWR5IERhdGE/CgotIGVhY2ggcm93IGNvcnJlc3BvbmRzIHRvIGFuIG9ic2VydmF0aW9uCi0gZWFjaCB2YXJpYWJsZSBpcyBhIGNvbHVtbgotIGVhY2ggdHlwZSBvZiBvYnNlcnZhdGlvbiBpcyBpbiBhIGRpZmZlcmVudCB0YWJsZQoKIVtdKGZpZ3MvdGlkeS0xLnBuZykKCiMjIFdoeSBUaWR5IERhdGE/CgpUaWR5IGRhdGEgZW5hYmxlcyB1cyB0byBkbyBsb3RzIG9mIHRoaW5ncyEKCjEpIEdyZWF0IGdncGxvdHMKMikgU3VtbWFyaXplL3NsaWNlIHRoZSBkYXRhIGluIG11bHRpcGxlIHdheXMKMykgRW5hYmxlIEV4cGxvcmF0b3J5IERhdGEgQW5hbHlzaXMKNCkgRW5zdXJlIGFzc3VtcHRpb25zIGFyZSBtZXQgZm9yIG1ldGhvZHMKNSkgRW5hYmxlIENvbmZpcm1hdG9yeSBEYXRhIEFuYWx5c2lzCgojIyBCZXdhcmUgb2YgY29sdW1ucyBtYXNxdWVyYWRpbmcgYXMgdmFyaWFibGVzIQoKCmBgYHtyIHdhcm5pbmc9RkFMU0V9CmxpYnJhcnkodGlkeXZlcnNlKQpsaWJyYXJ5KHJlYWRyKQpmZXJ0aWxpdHlfZGF0YSA8LSByZWFkX2NzdigiZGF0YS90b3RhbF9mZXJ0aWxpdHkuY3N2IikKZmVydGlsaXR5X2RhdGEKYGBgCgpUaGVzZSBjb2x1bW5zIGFyZSBhY3R1YWxseSBjYXRlZ29yaWVzISBgMTgwMGAgZG9lc24ndCBjb3JyZXNwb25kIHRvIHRoZSB2YWx1ZXMgdGhhdCBmb2xsb3cgYmVsb3cgaXQuIFRoZSBzYW1lIGlzIHRydWUgZm9yIGFueSBvZiB0aGUgb3RoZXIgY29sdW1uIGhlYWRlcnMgaGVyZS4gVGhleSBjb3JyZXNwb25kIHRvIHRoZSB5ZWFyIGluIHdoaWNoIHRoZSBkYXRhIGlzIG1lYXN1cmVkLiBUaGF0IGRhdGEgaXMgb24gZmVydGlsaXR5IHJhdGUuCgpBc2sgeW91cnNlbGY6IGRvIHRoZXNlIGNvbHVtbnMgZ28gdG9nZXRoZXIgYXMgYSBzaW5nbGUgb2JzZXJ2YXRpb24gZm9yIHlvdXIgYW5hbHlzaXM/CgpBbHNvIGFzayB5b3Vyc2VsZjogV2hhdCBpcyB0aGUgdW5pdCBvZiBvYnNlcnZhdGlvbj8KCgojIyBNYWtpbmcgZGF0YSB0aWR5OiBgZ2F0aGVyKClgCgpVc2UgYGdhdGhlcigpYCB3aGVuIHlvdSBuZWVkIHRvIG1ha2UgYSBidW5jaCBvZiBjb2x1bW5zIGludG8gb25lIGNvbHVtbi4gSW4gb3RoZXIgd29yZHMsIHdoZW4geW91IHdhbnQgdG8gY29udmVydCAid2lkZSBkYXRhIiB0byAibG9uZyBkYXRhLiIKCmBgYHtyfQojIGdhdGhlcigpIGhhcyB0aHJlZSBzdGFuZGFyZCBhcmd1bWVudHM6IGRhdGEsIGtleSwgYW5kIHZhbHVlCiMgZGF0YSBpcyB1c3VhbGx5IGxvYWRlZCB2aWEgdGhlICU+JQojIGtleSBpcyB3aGF0IHlvdSB3YW50IHlvdXIgbmV3IGNhdGVnb3JpY2FsIGNvbHVtbiB0byBiZSBuYW1lZAojIHZhbHVlIGlzIGZvciB0aGUgYWN0dWFsIHZhbHVlcyBpbiB0aGUgY29sdW1ucwoKIyBXZSBkb24ndCB3YW50IHRoZSBgVG90YWwgZmVydGlsaXR5IHJhdGVgIGNvbHVtbiB0byBiZSBpbmNsdWRlZCBhcyBwYXJ0IG9mIHRoZQojIGdhdGhlcigpIG9wZXJhdGlvbiwgc28gd2UgdXNlIHRoZSBgLWAgbm90YXRpb24gdG8gZXhjbHVkZSBpdC4KCmZlcnRpbGl0eV90aWR5IDwtIGZlcnRpbGl0eV9kYXRhICU+JSAKICBnYXRoZXIoa2V5ID0gIlllYXIiLCB2YWx1ZSA9ICJmZXJ0aWxpdHlSYXRlIiwgLWBUb3RhbCBmZXJ0aWxpdHkgcmF0ZWApICU+JSAKICAjIFJlLWFycmFuZ2UgYW5kIHJlbmFtZSBjb2x1bW5zCiAgc2VsZWN0KENvdW50cnkgPSBgVG90YWwgZmVydGlsaXR5IHJhdGVgLCBZZWFyLCBmZXJ0aWxpdHlSYXRlKSAlPiUgCiAgIyBSZW1vdmUgcm93cyB3aXRoIG1pc3NpbmcgdmFsdWVzIAogICMgKHRoZXJlIGFyZSBjb3VudHJpZXMgdGhhdCBoYXZlIGxpdHRsZSB0byBubyBpbmZvcm1hdGlvbikKICBuYS5vbWl0KCkKCmZlcnRpbGl0eV90aWR5CmBgYAoKIyMgWW91ciBUYXNrOiB1c2luZyB0aGlzIHRpZHkgZGF0YQoKIyMjIEV4ZXJjaXNlIDMuMQoKQXMgYSByZWZyZXNoZXIgZnJvbSBlYXJsaWVyIGluIHRoZSB3b3Jrc2hvcCwgaG93IHdvdWxkIHdlIGZpbmQgdGhlIGF2ZXJhZ2UgZmVydGlsaXR5IGZvciBlYWNoIHllYXI/IEhvdyBhYm91dCBmcm9tIDE4NjAgb24/CgpgYGB7cn0KIyBXcml0ZSBhbmQgY2hlY2sgeW91ciBhbnN3ZXIgaGVyZQoKYGBgCgojIyBNYWtpbmcgb25lIGNvbHVtbiBpbnRvIG1hbnk6IGBzcHJlYWQoKWAKClNvbWV0aW1lcywgeW91IHdpbGwgbmVlZCB0byBnbyB0aGUgb3RoZXIgZGlyZWN0aW9uOiB0YWtlIGEgbG9uZyBmb3JtYXQgZGF0YXNldCBhbmQgbWFrZSBpdCBpbnRvIGEgbW9yZSBtYXRyaXgtbGlrZSBmb3JtYXQuIFRoaXMgaXMgbmVjZXNzYXJ5IGZvciBzdWNoIGZ1bmN0aW9ucyBzdWNoIGFzIGBoZWF0bWFwKClgLgoKTGV0J3MgY2hhbmdlIHRoaW5ncyBhcm91bmQgYW5kIG1ha2UgdGhlIGBDb3VudHJ5YCBjb2x1bW4gaW50byB0aGUgdmFyaWFibGVzIChjb2x1bW5zKSBpbiB0aGUgZGF0YXNldC4gCgpgYGB7cn0KZmVydGlsaXR5X3dpZGUgPC0gZmVydGlsaXR5X3RpZHkgJT4lIAogICMgc3ByZWFkKCkgdGFrZXMgYSBrZXkgKENvdW50cnkpIGFuZCB2YWx1ZSAoZmVydGlsaXR5UmF0ZSkgYXJndW1lbnQKICAjIE5vdGUgdGhhdCB3ZSBkb24ndCBxdW90ZSBoZXJlLCB3aGVyZWFzIHdlIGRvIGluIGdhdGhlcigpCiAgc3ByZWFkKGtleSA9IENvdW50cnksIHZhbHVlID0gZmVydGlsaXR5UmF0ZSkgCgpmZXJ0aWxpdHlfd2lkZQpgYGAKCiMjIFlvdXIgVGFzayAtIFdobyBpcyB0aGUgbW9zdCBkZW1vY3JhdGljPwoKIyMjIEV4ZXJjaXNlIDMuMgoKTG9hZCB0aGUgYGRlbV9zY29yZS5jc3ZgIGRhdGFzZXQgaW4gdGhlIGBkYXRhYCBmb2xkZXIuIFRpZHkgaXQgdXAuIFdoaWNoIGNvdW50cmllcyBoYWQgdGhlIGhpZ2hlc3QgZGVtb2NyYWN5IHNjb3JlIGluIDIwMDc/CgpIaW50OiB5b3UnbGwgaGF2ZSB0byB1c2UgeW91ciBgZHBseXJgIHNraWxscyBhcyB3ZWxsLgoKYGBge3J9CiNlbnRlciB5b3VyIGFuc3dlciBoZXJlCgpgYGAKCiMjIFdoYXQgeW91IGxlYXJuZWQgaW4gdGhpcyBzZWN0aW9uCgpIb3cgdG8gY29udmVydAoKLSB3aWRlL21lc3N5IGRhdGEgaW50byBsb25nL3RpZHkgZGF0YSB1c2luZyB0aGUgYGdhdGhlcigpYCBmdW5jdGlvbiBpbiB0aGUgYHRpZHlyYCBwYWNrYWdlCi0gbG9uZyBkYXRhIGludG8gd2lkZSBkYXRhIHVzaW5nIHRoZSBgc3ByZWFkKClgIGZ1bmN0aW9uIGluIHRoZSBgdGlkeXJgIHBhY2thZ2UKCi0tLQoKIyMgV2hhdCdzIE5leHQ/CgpXZSd2ZSBzaG93ZWQgeW91IHRoZSBiYXJlIGJhc2ljcyBvZiBkYXRhIHdyYW5nbGluZyBpbiB0aGUgdGlkeXZlcnNlLiBUaGVyZSdzIGEgdG9uIG1vcmUhCgpXaGVyZSB0byBnbyBuZXh0PwoKLSBNb3JlIGNvb2wgZnVuY3Rpb25zIGluIFtgdGlkeXJgXShodHRwOi8vdGlkeXIudGlkeXZlcnNlLm9yZy8pCiAgICAtIFRoZSBbRGF0YSBJbXBvcnRdKGh0dHBzOi8vZ2l0aHViLmNvbS9yc3R1ZGlvL2NoZWF0c2hlZXRzL3Jhdy9tYXN0ZXIvZGF0YS1pbXBvcnQucGRmKSBSU3R1ZGlvIGNoZWF0c2hlZXQgYWxzbyBoYXMgYSBzZWN0aW9uIG9uIGB0aWR5cmAKLSA8aHR0cDovL3RpZHl2ZXJzZS5vcmc+CiAgICAtIGBsdWJyaWRhdGVgIGZvciBkZWFsaW5nIHdpdGggZGF0ZXMKICAgIC0gYHN0cmluZ3JgIGZvciBtYW5pcHVsYXRpbmcgc3RyaW5ncwogICAgLSBgZm9yY2F0c2AgZm9yIHdvcmtpbmcgd2l0aCBjYXRlZ29yaWNhbCBkYXRhCi0gVGlkeXZlcnNlIGNvbW11bml0eSBwYWNrYWdlcwogICAgLSBbYG5hbmlhcmBdKGh0dHA6Ly9uYW5pYXIubmp0aWVybmV5LmNvbS8pIGZvciB0aWR5IGhhbmRsaW5nIG9mIG1pc3NpbmcgZGF0YQogICAgLSBbYGluZmVyYF0oaHR0cHM6Ly9pbmZlci5uZXRsaWZ5LmNvbSkgZm9yIHRpZHkgc3RhdGlzdGljYWwgaW5mZXJlbmNlICAgIAotIE1vZGVybkRpdmUgKGJ5IENoZXN0ZXIgYW5kIEFsYmVydCBLaW0pOiBodHRwOi8vd3d3Lm1vZGVybmRpdmUuY29tCi0gUiBmb3IgRGF0YSBTY2llbmNlOiBodHRwOi8vcjRkcy5oYWQuY28ubnoKLSBbVmFyaWV0eSBvZiBjb3Vyc2VzIG9uIERhdGFDYW1wXShodHRwczovL3d3dy5kYXRhY2FtcC5jb20vY291cnNlcy90ZWNoOnIpIAoKLS0tCgojIENsb3NpbmcgcHJvamVjdAoKLSBUcnkgdG8gbG9hZCBpbiB5b3VyIG93biBkYXRhIGFuZCB1c2UgYHRpZHlyYCB0byBnZXQgaXQgaW50byB0aGUgcmlnaHQgZm9ybWF0IGlmIG5lZWRlZCB0byB1c2UgYGRwbHlyYCB0byBkbyBzb21lIGRhdGEgd3JhbmdsaW5nLiBJZiB5b3UgZG9uJ3QgaGF2ZSB5b3VyIG93biBkYXRhLCBkbyBzb21lIGFuYWx5c2VzIG9uIHRoZSBgcGVyaW9kaWNfdGFibGVgIGRhdGEgeW91IGxvYWRlZCBpbiBiZWZvcmUgdXNpbmcgYGRwbHlyYC4gV2UnbGwgYmUgYXJvdW5kIHRvIGFuc3dlciBxdWVzdGlvbnMuIFRoYW5rcyBtdWNoIQoKLS0tCgojIyBLZWVwIGluIFRvdWNoIQoKLSBUZWQ6IFtAdGxhZGVyYXNdKGh0dHBzOi8vdHdpdHRlci5jb20vdGxhZGVyYXMpIGh0dHBzOi8vbGFkZXJhc3QuZ2l0aHViLmlvCi0gQ2hlc3RlcjogW0BvbGRfbWFuX2NoZXN0ZXJdKGh0dHBzOi8vdHdpdHRlci5jb20vb2xkX21hbl9jaGVzdGVyKSBodHRwczovL2NoZXN0ZXIucmJpbmQuaW8KCiMjIENvbmNsdXNpb24KCkRhdGEgaW1wb3J0aW5nLCB3cmFuZ2xpbmcsIGFuZCB0aWR5aW5nIGFyZSBvZnRlbiBmb3Jnb3R0ZW4gYXMgYmVpbmcgaW1wb3J0YW50IHBhcnRzIG9mIHRoZSBkYXRhIGFuYWx5c2lzIHBpcGVsaW5lLiBUaGUgYHRpZHl2ZXJzZWAgcGFja2FnZXMgYXMgZGVzaWduZWQgdG8gd29yayB0b2dldGhlciB0byBpbXBvcnQsIHRpZHksIGFuZCB3cmFuZ2xlIGFsbCBpbiBhIGNvbnNpc3RlbnQgZnJhbWV3b3JrIHdvcmtpbmcgd2l0aCBkYXRhIGZyYW1lcy4KCiMjIE1vcmUgcmVzb3VyY2VzCgotIFRlZCBhbmQgW0plc3NpY2EgTWlubmllcl0oaHR0cDovL2plc3NpY2FtaW5uaWVyLmNvbS8pIGNyZWF0ZWQgYSBmcmVlIERhdGFDYW1wIGNvdXJzZSBjb3ZlcmluZyBtYW55IG9mIHRoZSB0b3BpY3MgY292ZXJlZCBoZXJlIGlmIHlvdSdkIGxpa2UgdG8gZ28gYmFjayBhbmQgcHJhY3RpY2Ugb24geW91ciBvd24uIAotIENoZXN0ZXIgYW5kIFtBbGJlcnQgS2ltXShodHRwOi8vcnVkZWJveWJlcnQucmJpbmQuaW8vKSB3cm90ZSBhIFtmcmVlIGludHJvZHVjdG9yeSB0ZXh0Ym9va10oaHR0cHM6Ly9tb2Rlcm5kaXZlLm5ldGxpZnkuY29tKSB0byBoZWxwIGJlZ2lubmVycyBnZXQgZ29pbmcgd2l0aCBSLiAKLSBXZSdyZSBiaWFzZWQgYnV0IHdlIGFsc28gaGlnaGx5IHJlY29tbWVuZCBEYXZlIFJvYmluc29uJ3MgW0ludHJvZHVjdGlvbiB0byB0aGUgVGlkeXZlcnNlXShodHRwczovL3d3dy5kYXRhY2FtcC5jb20vY291cnNlcy9pbnRyb2R1Y3Rpb24tdG8tdGhlLXRpZHl2ZXJzZSkgY291cnNlIG9uIERhdGFDYW1wIHRoYXQgQ2hlc3RlciBoZWxwZWQgdG8gYXV0aG9yIGluIGhpcyByb2xlIGF0IERhdGFDYW1wLiAKLSBBbGlzb24gSGlsbCB3aWxsIGFsc28gYmUgbGF1bmNoaW5nIGEgZm9sbG93LXVwIERhdGFDYW1wIGNvdXJzZSBvbiBkYXRhIGltcG9ydGluZywgZGF0YSB0YW1pbmcsIGFuZCBkYXRhIHRpZHlpbmcgdGVudGF0aXZlbHkgdGl0bGVkICJXb3JraW5nIHdpdGggRGF0YSBpbiB0aGUgVGlkeXZlcnNlIiBsYXRlciB0aGlzIHN1bW1lci4gWW91IGNhbiB0cmFjayBpdHMgcHJvZ3Jlc3MgW2hlcmVdKGh0dHBzOi8vdHJlbGxvLmNvbS9iL0pTTGJCcVdCL2RhdGFjYW1wLWNvdXJzZS1yb2FkbWFwKS4KCiMjIyBQb3N0LXNlc3Npb24gc3VydmV5CgpXZSBhcHByZWNpYXRlIGFuZCB5ZWFybiBmb3IgeW91ciBjb25zdHJ1Y3RpdmUgYW5kIGRlc2NyaXB0aXZlIGZlZWRiYWNrIHNvIHRoYXQgd2UgY2FuIGltcHJvdmUgYXMgZWR1Y2F0b3JzLiBUbyBmdXJ0aGVyIHN1cHBvcnQgdGhpcywgcGxlYXNlIGZlZWwgb3V0IHRoaXMgW2JyaWVmIHN1cnZleV0oaHR0cHM6Ly9nb28uZ2wvZm9ybXMvejE4NklyRWZJTHhZcGVvcDIpLgo=