In this section, we’ll discuss Data Wrangling/Transformation via the dplyr
package. We’ll explore ways to
- choose subsets of data,
- aggregate data to create summaries,
- make new variables,
- and sort your data frames.
It is recommended you also explore the RStudio Cheatsheet on Data Transformation as we discuss this content.
Back to gapminder
Here is a look at the gapminder
data frame in the gapminder
package.
library(tidyverse)
library(gapminder)
gapminder
We recommend always checking out your data first before you hop into your analyses. Simply running the name of the object like we have with gapminder
above is helpful. The View()
command (note the capital “V”! R is case sensitive!) is also helpful. Try it out with View(gapminder)
.
Say we wanted mean life expectancy across all years for Asia
# Base R
asia <- gapminder[gapminder$continent == "Asia", ]
mean(asia$lifeExp)
## [1] 60.0649
library(dplyr)
gapminder %>%
filter(continent == "Asia") %>%
summarize(mean_exp = mean(lifeExp))
The pipe %>%
- A way to chain together commands
- This helps with making your analyses readable and exploratory at the same time
The Five Main Verbs (5MV) of data wrangling
filter()
summarize()
group_by()
mutate()
arrange()
filter()
- Select a subset of the rows of a data frame.
- The arguments are the “filters” that you’d like to apply.
library(gapminder); library(dplyr)
gap_2007 <- gapminder %>% filter(year == 2007)
gap_2007
- Use
==
to compare a variable to a value.
- Remember to include quotes around strings, but they are not needed around numbers.
Your turn
Exercise 1.1
Choose only Asian rows in the gapminder
data and provide the name of asian_countries
to that result.
# Enter and try your answer here.
# ONE SOLUTION
asian_countries <- gapminder %>%
filter(continent == "Asia")
asian_countries
Logical operators
- Use
|
to check for any in multiple filters being true:
gapminder %>%
filter(year == 2002 | continent == "Europe")
- Use
&
or ,
to check for all of multiple filters being true:
gapminder %>%
filter(year == 2002, continent == "Europe")
- Use
%in%
to check for any being true (shortcut to using |
repeatedly with ==
)
gapminder %>%
filter(country %in% c("Argentina", "Belgium", "Mexico"),
year %in% c(1987, 1992))
Your turn
Exercise 1.2
Choose only rows corresponding to African or European nations.
# Enter and try your answer here.
# ONE SOLUTION
gapminder %>%
filter(continent %in% c("Africa", "Europe"))
summarize()
- Any numerical summary that you want to apply to a column of a data frame is specified within
summarize()
.
max_exp_1997 <- gapminder %>%
filter(year == 1997) %>%
summarize(max_exp = max(lifeExp))
max_exp_1997
Combining summarize()
with group_by()
When you’d like to determine a numerical summary for all levels of a different categorical variable
max_exp_1997_by_cont <- gapminder %>%
filter(year == 1997) %>%
group_by(continent) %>%
summarize(max_exp = max(lifeExp))
max_exp_1997_by_cont
Without the %>%
It’s hard to appreciate the %>%
without seeing what the code would look like without it:
max_exp_1997_by_cont <-
summarize(
group_by(
filter(
gapminder,
year == 1997),
continent),
max_exp = max(lifeExp))
max_exp_1997_by_cont
mutate()
- Allows you to
- create a new variable based on other variables OR
- change the contents of an existing variable
- create a new variable based on other variables
gap_w_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap)
gap_w_gdp
mutate()
- change the contents of an existing variable
gap_weird <- gapminder %>% mutate(pop = pop + 1000)
gap_weird
arrange()
- Reorders the rows in a data frame based on the values of one or more variables
gapminder %>%
arrange(year, country)
- Can also put into descending order
gapminder %>%
filter(year > 2000) %>%
arrange(desc(lifeExp))
Other useful dplyr
verbs
select
top_n
sample_n
slice
glimpse
rename
Your turn
Exercise 1.3
Determine which African country had the highest GDP per capita in 1982 using the gapminder
data in the gapminder
package.
#Space for your answer here.
gapminder %>%
filter(continent == "Africa", year == 1982) %>%
arrange(desc(gdpPercap))
Challenge - Exercise 1.3C
For both of these tasks below, use the bechdel
data frame in the fivethirtyeight
package:
- Use the
count
function in the dplyr
package to determine how many movies in 2013 fell into each of the different categories for clean_test
.
- Determine the percentage of movies that received the value of
"ok"
across all years.
library(fivethirtyeight)
bechdel %>%
filter(year == 2013) %>%
count(clean_test)
bechdel %>%
summarize(perc_ok = mean(clean_test == "ok") * 100)
Your turn
Exercise 1.4
Determine the top five movies in terms of domestic return on investment for 2013 scaled data using the bechdel
data frame in the fivethirtyeight
package.
#Space for your answer here.
bechdel %>%
mutate(roi_2013 = domgross_2013 / budget_2013) %>%
top_n(5, roi_2013)
What you learned in this section
Different ways in the dplyr
package to
- choose subsets of data using the
filter()
verb
- aggregate data to create summaries using the
summarize()
and group_by
verbs
- make new variables using the
mutate()
verb
- sort your data frames using the
arrange()
verb
LS0tCnRpdGxlOiAiUGFydCAxIC0gSW50cm8gdG8gYGRwbHlyYCIKYXV0aG9yOiAiQ2hlc3RlciBJc21heSIKb3V0cHV0OiAKICBodG1sX2RvY3VtZW50OgogICAgY29kZV9kb3dubG9hZDogdHJ1ZQogICAgY29kZV9mb2xkaW5nOiBzaG93CiAgICBkZl9wcmludDogcGFnZWQKLS0tCgpJbiB0aGlzIHNlY3Rpb24sIHdlJ2xsIGRpc2N1c3MgRGF0YSBXcmFuZ2xpbmcvVHJhbnNmb3JtYXRpb24gdmlhIHRoZSBgZHBseXJgIHBhY2thZ2UuICBXZSdsbCBleHBsb3JlIHdheXMgdG8KCjEuIGNob29zZSBzdWJzZXRzIG9mIGRhdGEsIAoxLiBhZ2dyZWdhdGUgZGF0YSB0byBjcmVhdGUgc3VtbWFyaWVzLCAKMS4gbWFrZSBuZXcgdmFyaWFibGVzLCAKMS4gYW5kIHNvcnQgeW91ciBkYXRhIGZyYW1lcy4gIAoKSXQgaXMgcmVjb21tZW5kZWQgeW91IGFsc28gZXhwbG9yZSB0aGUgUlN0dWRpbyBDaGVhdHNoZWV0IG9uIFtEYXRhIFRyYW5zZm9ybWF0aW9uXShodHRwczovL2dpdGh1Yi5jb20vcnN0dWRpby9jaGVhdHNoZWV0cy9yYXcvbWFzdGVyL2RhdGEtdHJhbnNmb3JtYXRpb24ucGRmKSBhcyB3ZSBkaXNjdXNzIHRoaXMgY29udGVudC4KCiMjIyBCYWNrIHRvIGBnYXBtaW5kZXJgCgpIZXJlIGlzIGEgbG9vayBhdCB0aGUgYGdhcG1pbmRlcmAgZGF0YSBmcmFtZSBpbiB0aGUgYGdhcG1pbmRlcmAgcGFja2FnZS4KCmBgYHtyIG1lc3NhZ2U9RkFMU0V9CmxpYnJhcnkodGlkeXZlcnNlKQpsaWJyYXJ5KGdhcG1pbmRlcikKZ2FwbWluZGVyCmBgYAoKV2UgcmVjb21tZW5kIGFsd2F5cyBjaGVja2luZyBvdXQgeW91ciBkYXRhIGZpcnN0IGJlZm9yZSB5b3UgaG9wIGludG8geW91ciBhbmFseXNlcy4gU2ltcGx5IHJ1bm5pbmcgdGhlIG5hbWUgb2YgdGhlIG9iamVjdCBsaWtlIHdlIGhhdmUgd2l0aCBgZ2FwbWluZGVyYCBhYm92ZSBpcyBoZWxwZnVsLiBUaGUgYFZpZXcoKWAgY29tbWFuZCAobm90ZSB0aGUgY2FwaXRhbCAiViIhIFIgaXMgY2FzZSBzZW5zaXRpdmUhKSBpcyBhbHNvIGhlbHBmdWwuIFRyeSBpdCBvdXQgd2l0aCBgVmlldyhnYXBtaW5kZXIpYC4KClNheSB3ZSB3YW50ZWQgbWVhbiBsaWZlIGV4cGVjdGFuY3kgYWNyb3NzIGFsbCB5ZWFycyBmb3IgQXNpYQoKYGBge3J9CiMgQmFzZSBSCmFzaWEgPC0gZ2FwbWluZGVyW2dhcG1pbmRlciRjb250aW5lbnQgPT0gIkFzaWEiLCBdCm1lYW4oYXNpYSRsaWZlRXhwKQpgYGAKCmBgYHtyfQpsaWJyYXJ5KGRwbHlyKQpnYXBtaW5kZXIgJT4lIAogIGZpbHRlcihjb250aW5lbnQgPT0gIkFzaWEiKSAlPiUKICBzdW1tYXJpemUobWVhbl9leHAgPSBtZWFuKGxpZmVFeHApKQpgYGAKCiMjIFRoZSBwaXBlIGAlPiVgCgpgciBrbml0cjo6aW5jbHVkZV9ncmFwaGljcygiZmlncy9waXBlLnBuZyIpYCAmZW1zcDsgJmVtc3A7YHIga25pdHI6OmluY2x1ZGVfZ3JhcGhpY3MoImZpZ3MvTWFncml0dGVQaXBlLmpwZyIpYAoKLSBBIHdheSB0byBjaGFpbiB0b2dldGhlciBjb21tYW5kcwotIFRoaXMgaGVscHMgd2l0aCBtYWtpbmcgeW91ciBhbmFseXNlcyByZWFkYWJsZSBhbmQgZXhwbG9yYXRvcnkgYXQgdGhlIHNhbWUgdGltZQoKCiMgVGhlIEZpdmUgTWFpbiBWZXJicyAoNU1WKSBvZiBkYXRhIHdyYW5nbGluZwoKIyMjIGBmaWx0ZXIoKWAgPGJyPiBgc3VtbWFyaXplKClgIDxicj4gYGdyb3VwX2J5KClgIDxicj4gYG11dGF0ZSgpYCA8YnI+IGBhcnJhbmdlKClgCgotLS0KCiMjIGBmaWx0ZXIoKWAKCi0gU2VsZWN0IGEgc3Vic2V0IG9mIHRoZSByb3dzIG9mIGEgZGF0YSBmcmFtZS4gCi0gVGhlIGFyZ3VtZW50cyBhcmUgdGhlICJmaWx0ZXJzIiB0aGF0IHlvdSdkIGxpa2UgdG8gYXBwbHkuCgpgYGB7cn0KbGlicmFyeShnYXBtaW5kZXIpOyBsaWJyYXJ5KGRwbHlyKQpnYXBfMjAwNyA8LSBnYXBtaW5kZXIgJT4lIGZpbHRlcih5ZWFyID09IDIwMDcpCmdhcF8yMDA3CmBgYAoKLSBVc2UgYD09YCB0byBjb21wYXJlIGEgdmFyaWFibGUgdG8gYSB2YWx1ZS4KLSBSZW1lbWJlciB0byBpbmNsdWRlIHF1b3RlcyBhcm91bmQgc3RyaW5ncywgYnV0IHRoZXkgYXJlIG5vdCBuZWVkZWQgYXJvdW5kIG51bWJlcnMuCgojIyBZb3VyIHR1cm4KCiMjIyBFeGVyY2lzZSAxLjEKCkNob29zZSBvbmx5IEFzaWFuIHJvd3MgaW4gdGhlIGBnYXBtaW5kZXJgIGRhdGEgYW5kIHByb3ZpZGUgdGhlIG5hbWUgb2YgYGFzaWFuX2NvdW50cmllc2AgdG8gdGhhdCByZXN1bHQuCgpgYGB7cn0KIyBFbnRlciBhbmQgdHJ5IHlvdXIgYW5zd2VyIGhlcmUuCiMgT05FIFNPTFVUSU9OCmFzaWFuX2NvdW50cmllcyA8LSBnYXBtaW5kZXIgJT4lIAogIGZpbHRlcihjb250aW5lbnQgPT0gIkFzaWEiKQphc2lhbl9jb3VudHJpZXMKYGBgCgoKIyMgTG9naWNhbCBvcGVyYXRvcnMKCi0gVXNlIGB8YCB0byBjaGVjayBmb3IgYW55IGluIG11bHRpcGxlIGZpbHRlcnMgYmVpbmcgdHJ1ZToKCmBgYHtyfQpnYXBtaW5kZXIgJT4lIAogIGZpbHRlcih5ZWFyID09IDIwMDIgfCBjb250aW5lbnQgPT0gIkV1cm9wZSIpCmBgYAoKLSBVc2UgYCZgIG9yIGAsYCB0byBjaGVjayBmb3IgYWxsIG9mIG11bHRpcGxlIGZpbHRlcnMgYmVpbmcgdHJ1ZToKCmBgYHtyfQpnYXBtaW5kZXIgJT4lIAogIGZpbHRlcih5ZWFyID09IDIwMDIsIGNvbnRpbmVudCA9PSAiRXVyb3BlIikKYGBgCgotIFVzZSBgJWluJWAgdG8gY2hlY2sgZm9yIGFueSBiZWluZyB0cnVlIChzaG9ydGN1dCB0byB1c2luZyBgfGAgcmVwZWF0ZWRseSB3aXRoIGA9PWApCgpgYGB7cn0KZ2FwbWluZGVyICU+JSAKICBmaWx0ZXIoY291bnRyeSAlaW4lIGMoIkFyZ2VudGluYSIsICJCZWxnaXVtIiwgIk1leGljbyIpLAogICAgICAgICB5ZWFyICVpbiUgYygxOTg3LCAxOTkyKSkKYGBgCgojIyBZb3VyIHR1cm4KCiMjIyBFeGVyY2lzZSAxLjIKCkNob29zZSBvbmx5IHJvd3MgY29ycmVzcG9uZGluZyB0byBBZnJpY2FuIG9yIEV1cm9wZWFuIG5hdGlvbnMuCgpgYGB7cn0KIyBFbnRlciBhbmQgdHJ5IHlvdXIgYW5zd2VyIGhlcmUuCiMgT05FIFNPTFVUSU9OCmdhcG1pbmRlciAlPiUgCiAgZmlsdGVyKGNvbnRpbmVudCAlaW4lIGMoIkFmcmljYSIsICJFdXJvcGUiKSkKYGBgCgoKIyMgYHN1bW1hcml6ZSgpYAoKLSBBbnkgbnVtZXJpY2FsIHN1bW1hcnkgdGhhdCB5b3Ugd2FudCB0byBhcHBseSB0byBhIGNvbHVtbiBvZiBhIGRhdGEgZnJhbWUgaXMgc3BlY2lmaWVkIHdpdGhpbiBgc3VtbWFyaXplKClgLgoKYGBge3J9Cm1heF9leHBfMTk5NyA8LSBnYXBtaW5kZXIgJT4lIAogIGZpbHRlcih5ZWFyID09IDE5OTcpICU+JSAKICBzdW1tYXJpemUobWF4X2V4cCA9IG1heChsaWZlRXhwKSkKbWF4X2V4cF8xOTk3CmBgYAoKIyMjIENvbWJpbmluZyBgc3VtbWFyaXplKClgIHdpdGggYGdyb3VwX2J5KClgCgpXaGVuIHlvdSdkIGxpa2UgdG8gZGV0ZXJtaW5lIGEgbnVtZXJpY2FsIHN1bW1hcnkgZm9yIGFsbApsZXZlbHMgb2YgYSBkaWZmZXJlbnQgY2F0ZWdvcmljYWwgdmFyaWFibGUKCmBgYHtyfQptYXhfZXhwXzE5OTdfYnlfY29udCA8LSBnYXBtaW5kZXIgJT4lIAogIGZpbHRlcih5ZWFyID09IDE5OTcpICU+JSAKICBncm91cF9ieShjb250aW5lbnQpICU+JQogIHN1bW1hcml6ZShtYXhfZXhwID0gbWF4KGxpZmVFeHApKQptYXhfZXhwXzE5OTdfYnlfY29udApgYGAKCgojIyMgV2l0aG91dCB0aGUgYCU+JWAKCkl0J3MgaGFyZCB0byBhcHByZWNpYXRlIHRoZSBgJT4lYCB3aXRob3V0IHNlZWluZyB3aGF0IHRoZSBjb2RlIHdvdWxkIGxvb2sgbGlrZSB3aXRob3V0IGl0OgoKYGBge3J9Cm1heF9leHBfMTk5N19ieV9jb250IDwtIAogIHN1bW1hcml6ZSgKICAgIGdyb3VwX2J5KAogICAgICBmaWx0ZXIoCiAgICAgICAgZ2FwbWluZGVyLCAKICAgICAgICAgIHllYXIgPT0gMTk5NyksIAogICAgICBjb250aW5lbnQpLAogICAgbWF4X2V4cCA9IG1heChsaWZlRXhwKSkKbWF4X2V4cF8xOTk3X2J5X2NvbnQKYGBgCgojIyBgbXV0YXRlKClgCgotIEFsbG93cyB5b3UgdG8gCiAgICAxLiBjcmVhdGUgYSBuZXcgdmFyaWFibGUgYmFzZWQgb24gb3RoZXIgdmFyaWFibGVzIE9SCiAgICAyLiBjaGFuZ2UgdGhlIGNvbnRlbnRzIG9mIGFuIGV4aXN0aW5nIHZhcmlhYmxlCgoKMS4gY3JlYXRlIGEgbmV3IHZhcmlhYmxlIGJhc2VkIG9uIG90aGVyIHZhcmlhYmxlcwoKYGBge3J9CmdhcF93X2dkcCA8LSBnYXBtaW5kZXIgJT4lIG11dGF0ZShnZHAgPSBwb3AgKiBnZHBQZXJjYXApCmdhcF93X2dkcApgYGAKCiMjIGBtdXRhdGUoKWAKCjIuIGNoYW5nZSB0aGUgY29udGVudHMgb2YgYW4gZXhpc3RpbmcgdmFyaWFibGUKCmBgYHtyfQpnYXBfd2VpcmQgPC0gZ2FwbWluZGVyICU+JSBtdXRhdGUocG9wID0gcG9wICsgMTAwMCkKZ2FwX3dlaXJkCmBgYAoKIyMgYGFycmFuZ2UoKWAKCi0gUmVvcmRlcnMgdGhlIHJvd3MgaW4gYSBkYXRhIGZyYW1lIGJhc2VkIG9uIHRoZSB2YWx1ZXMgb2Ygb25lIG9yIG1vcmUgdmFyaWFibGVzCgpgYGB7cn0KZ2FwbWluZGVyICU+JQogIGFycmFuZ2UoeWVhciwgY291bnRyeSkKYGBgCgotIENhbiBhbHNvIHB1dCBpbnRvIGRlc2NlbmRpbmcgb3JkZXIKCmBgYHtyIGRlc2N9CmdhcG1pbmRlciAlPiUKICBmaWx0ZXIoeWVhciA+IDIwMDApICU+JQogIGFycmFuZ2UoZGVzYyhsaWZlRXhwKSkKYGBgCgojIyBPdGhlciB1c2VmdWwgYGRwbHlyYCB2ZXJicwoKLSBgc2VsZWN0YAotIGB0b3BfbmAKLSBgc2FtcGxlX25gCi0gYHNsaWNlYAotIGBnbGltcHNlYAotIGByZW5hbWVgCgojIyBZb3VyIHR1cm4KCiMjIyBFeGVyY2lzZSAxLjMKCkRldGVybWluZSB3aGljaCBBZnJpY2FuIGNvdW50cnkgaGFkIHRoZSBoaWdoZXN0IEdEUCBwZXIgY2FwaXRhIGluIDE5ODIgdXNpbmcgdGhlIGBnYXBtaW5kZXJgIGRhdGEgaW4gdGhlIGBnYXBtaW5kZXJgIHBhY2thZ2UuCgpgYGB7cn0KI1NwYWNlIGZvciB5b3VyIGFuc3dlciBoZXJlLgpnYXBtaW5kZXIgJT4lCiAgZmlsdGVyKGNvbnRpbmVudCA9PSAiQWZyaWNhIiwgeWVhciA9PSAxOTgyKSAlPiUgCiAgYXJyYW5nZShkZXNjKGdkcFBlcmNhcCkpCmBgYAoKIyMjIENoYWxsZW5nZSAtIEV4ZXJjaXNlIDEuM0MKCkZvciBib3RoIG9mIHRoZXNlIHRhc2tzIGJlbG93LCB1c2UgdGhlIGBiZWNoZGVsYCBkYXRhIGZyYW1lIGluIHRoZSBgZml2ZXRoaXJ0eWVpZ2h0YCBwYWNrYWdlOgoKLSBVc2UgdGhlIGBjb3VudGAgZnVuY3Rpb24gaW4gdGhlIGBkcGx5cmAgcGFja2FnZSB0byBkZXRlcm1pbmUgaG93IG1hbnkgbW92aWVzCmluIDIwMTMgZmVsbCBpbnRvIGVhY2ggb2YgdGhlIGRpZmZlcmVudCBjYXRlZ29yaWVzIGZvciBgY2xlYW5fdGVzdGAuCi0gRGV0ZXJtaW5lIHRoZSBwZXJjZW50YWdlIG9mIG1vdmllcyB0aGF0IHJlY2VpdmVkIHRoZSB2YWx1ZSBvZiBgIm9rImAgYWNyb3NzIGFsbCB5ZWFycy4KCmBgYHtyfQpsaWJyYXJ5KGZpdmV0aGlydHllaWdodCkKYmVjaGRlbCAlPiUgCiAgZmlsdGVyKHllYXIgPT0gMjAxMykgJT4lIAogIGNvdW50KGNsZWFuX3Rlc3QpCmBgYAoKYGBge3J9CmJlY2hkZWwgJT4lIAogIHN1bW1hcml6ZShwZXJjX29rID0gbWVhbihjbGVhbl90ZXN0ID09ICJvayIpICogMTAwKQpgYGAKCgotLS0KCgojIyBZb3VyIHR1cm4KCiMjIyBFeGVyY2lzZSAxLjQKCkRldGVybWluZSB0aGUgdG9wIGZpdmUgbW92aWVzIGluIHRlcm1zIG9mIGRvbWVzdGljIHJldHVybiBvbiBpbnZlc3RtZW50IGZvciAyMDEzIHNjYWxlZCBkYXRhIHVzaW5nIHRoZSBgYmVjaGRlbGAgZGF0YSBmcmFtZSBpbiB0aGUgYGZpdmV0aGlydHllaWdodGAgcGFja2FnZS4KCgpgYGB7cn0KI1NwYWNlIGZvciB5b3VyIGFuc3dlciBoZXJlLgpiZWNoZGVsICU+JSAKICBtdXRhdGUocm9pXzIwMTMgPSBkb21ncm9zc18yMDEzIC8gYnVkZ2V0XzIwMTMpICU+JSAKICB0b3Bfbig1LCByb2lfMjAxMykKYGBgCgoKIyMgV2hhdCB5b3UgbGVhcm5lZCBpbiB0aGlzIHNlY3Rpb24KCkRpZmZlcmVudCB3YXlzIGluIHRoZSBgZHBseXJgIHBhY2thZ2UgdG8KCi0gY2hvb3NlIHN1YnNldHMgb2YgZGF0YSB1c2luZyB0aGUgYGZpbHRlcigpYCB2ZXJiCi0gYWdncmVnYXRlIGRhdGEgdG8gY3JlYXRlIHN1bW1hcmllcyB1c2luZyB0aGUgYHN1bW1hcml6ZSgpYCBhbmQgYGdyb3VwX2J5YCB2ZXJicwotIG1ha2UgbmV3IHZhcmlhYmxlcyB1c2luZyB0aGUgYG11dGF0ZSgpYCB2ZXJiCi0gc29ydCB5b3VyIGRhdGEgZnJhbWVzIHVzaW5nIHRoZSBgYXJyYW5nZSgpYCB2ZXJi