In this section, we’ll load in external data using several built-in packages to the tidyverse
. In particular, we’ll cover:
- Loading excel spreadsheets with
readxl
- Loading in flat text file data, comma separated.
- Loading datafiles from SAS/SPSS with
haven
Some things to keep in mind
For the most part, we will be loading in our data as data.frame
s or tbl
s with these packages.
One thing to keep in mind is that oftentimes, files have a bit of explanatory text and the header is not always going to be on the first line in our files.
Important note: The most common problem encountered when reading in data from a file is not reading specifying where to tell RStudio to look for the file. RStudio Projects were created to help with this process.
Look down to just below the words Console in the tabs in the bottom left of RStudio. Here you will find the “current working directory.” This is where R is working from at the moment. You can assess all files relative to this directory without doing anything extra at all in terms of specifying the path to get to the file. We’ll see examples of this in a bit. If you want to read in files from somewhere outside of this current working directory, you’d need to tell R exactly where to find the file. More on this too below.
Directories and file management is often really tricky for beginners. If you struggle with this, that’s OK. It will get better as you get more practice. Pay careful attention to the error messages and learn to love the Files tab in the bottom right of RStudio. This will help you track down where your files are located relative to the current working directory.
readxl
package
The data we will use here comes from the movies
data frame in the ggplot2movies
package.
The readxl
package handles loading both 2008 (xlsx) and earlier excel files (xls). To read in Excel files, use the read_excel()
function:
library(readxl)
movies_from_excel <- read_excel("data/movies.xlsx", sheet = 1)
movies_from_excel
readr
package
Flat files, such as comma-separated-value (.csv), and other types of simple rectangular data.
read_csv
: Reading in comma separated value files
library(readr)
movies_from_csv <- read_csv("data/movies.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## title = col_character(),
## year = col_integer(),
## length = col_integer(),
## budget = col_integer(),
## votes = col_integer(),
## mpaa = col_character(),
## Action = col_integer(),
## Animation = col_integer(),
## Comedy = col_integer(),
## Drama = col_integer(),
## Documentary = col_integer(),
## Romance = col_integer(),
## Short = col_integer()
## )
## See spec(...) for full column specifications.
movies_from_csv
read_delim
: Your general purpose flat file reader. For example, if you have a tab delimited file, you can specify delim
= “” (“” is how you specify tabs).
library(readr)
movies <- read_delim("data/movies.csv", delim = ",")
## Parsed with column specification:
## cols(
## .default = col_double(),
## title = col_character(),
## year = col_integer(),
## length = col_integer(),
## budget = col_integer(),
## votes = col_integer(),
## mpaa = col_character(),
## Action = col_integer(),
## Animation = col_integer(),
## Comedy = col_integer(),
## Drama = col_integer(),
## Documentary = col_integer(),
## Romance = col_integer(),
## Short = col_integer()
## )
## See spec(...) for full column specifications.
movies
Be sure to check the help for all of these (via ?read_delim
) to see all of the other options available. Some that come up frequently in our own experience are
na
- If your missing values are coded differently than NA
, you can set the value here.
skip
- if there’s explanatory text before the header, use this argument to skip these lines. For example, if your header starts at line 4, use skip = 3
col_types
- this is where you have more control on the data types for each column.
haven
package
Reading in “external” data from inferior, I mean, other software packages like SPSS (.sav
), Stata (.dta
), SAS, etc. Check the haven package website for assistance.
Your turn
Exercise 2.1
Load the periodic table data stored in the data
folder. (Hint: you’ll have to use the haven
package - look for a read_
function that can read the .sav
file format.)
# Write and check your answer here.
What you learned in this section
Read in data from files into data.frame
s from
- Excel files using the
read_excel()
function in the readxl
package,
- CSV files using the
read_csv()
and read_delim()
functions in the readr
package
- data from proprietary statistical software using the
read_spss()
, read_stata()
, and read_sas()
functions in the haven
package
LS0tCnRpdGxlOiAiUGFydCAyIC0gTG9hZGluZyBkYXRhIHdpdGggYHJlYWR4bGAsIGBoYXZlbmAgYW5kIGByZWFkcmAiCmF1dGhvcjogIkNoZXN0ZXIgSXNtYXkgYW5kIFRlZCBMYWRlcmFzIgpvdXRwdXQ6IAogIGh0bWxfZG9jdW1lbnQ6CiAgICBjb2RlX2Rvd25sb2FkOiB0cnVlCiAgICBjb2RlX2ZvbGRpbmc6IGhpZGUKICAgIGRmX3ByaW50OiBwYWdlZAotLS0KCkluIHRoaXMgc2VjdGlvbiwgd2UnbGwgbG9hZCBpbiBleHRlcm5hbCBkYXRhIHVzaW5nIHNldmVyYWwgYnVpbHQtaW4gcGFja2FnZXMgdG8gdGhlIGB0aWR5dmVyc2VgLiBJbiBwYXJ0aWN1bGFyLCB3ZSdsbCBjb3ZlcjoKCjEuIExvYWRpbmcgZXhjZWwgc3ByZWFkc2hlZXRzIHdpdGggYHJlYWR4bGAKMi4gTG9hZGluZyBpbiBmbGF0IHRleHQgZmlsZSBkYXRhLCBjb21tYSBzZXBhcmF0ZWQuCjMuIExvYWRpbmcgZGF0YWZpbGVzIGZyb20gU0FTL1NQU1Mgd2l0aCBgaGF2ZW5gCgoKIyMgU29tZSB0aGluZ3MgdG8ga2VlcCBpbiBtaW5kCgpGb3IgdGhlIG1vc3QgcGFydCwgd2Ugd2lsbCBiZSBsb2FkaW5nIGluIG91ciBkYXRhIGFzIGBkYXRhLmZyYW1lYHMgb3IgYHRibGBzIHdpdGggdGhlc2UgcGFja2FnZXMuIAoKT25lIHRoaW5nIHRvIGtlZXAgaW4gbWluZCBpcyB0aGF0IG9mdGVudGltZXMsIGZpbGVzIGhhdmUgYSBiaXQgb2YgZXhwbGFuYXRvcnkgdGV4dCBhbmQgdGhlIGhlYWRlciBpcyBub3QgYWx3YXlzIGdvaW5nIHRvIGJlIG9uIHRoZSBmaXJzdCBsaW5lIGluIG91ciBmaWxlcy4KCioqSW1wb3J0YW50IG5vdGUqKjogVGhlIG1vc3QgY29tbW9uIHByb2JsZW0gZW5jb3VudGVyZWQgd2hlbiByZWFkaW5nIGluIGRhdGEgZnJvbSBhIGZpbGUgaXMgbm90IHJlYWRpbmcgc3BlY2lmeWluZyB3aGVyZSB0byB0ZWxsIFJTdHVkaW8gdG8gbG9vayBmb3IgdGhlIGZpbGUuIFJTdHVkaW8gUHJvamVjdHMgd2VyZSBjcmVhdGVkIHRvIGhlbHAgd2l0aCB0aGlzIHByb2Nlc3MuCgpMb29rIGRvd24gdG8ganVzdCBiZWxvdyB0aGUgd29yZHMgKipDb25zb2xlKiogaW4gdGhlIHRhYnMgaW4gdGhlIGJvdHRvbSBsZWZ0IG9mIFJTdHVkaW8uIEhlcmUgeW91IHdpbGwgZmluZCB0aGUgImN1cnJlbnQgd29ya2luZyBkaXJlY3RvcnkuIiBUaGlzIGlzIHdoZXJlIFIgaXMgd29ya2luZyBmcm9tIGF0IHRoZSBtb21lbnQuIFlvdSBjYW4gYXNzZXNzIGFsbCBmaWxlcyByZWxhdGl2ZSB0byB0aGlzIGRpcmVjdG9yeSB3aXRob3V0IGRvaW5nIGFueXRoaW5nIGV4dHJhIGF0IGFsbCBpbiB0ZXJtcyBvZiBzcGVjaWZ5aW5nIHRoZSBwYXRoIHRvIGdldCB0byB0aGUgZmlsZS4gV2UnbGwgc2VlIGV4YW1wbGVzIG9mIHRoaXMgaW4gYSBiaXQuIElmIHlvdSB3YW50IHRvIHJlYWQgaW4gZmlsZXMgZnJvbSBzb21ld2hlcmUgb3V0c2lkZSBvZiB0aGlzIGN1cnJlbnQgd29ya2luZyBkaXJlY3RvcnksIHlvdSdkIG5lZWQgdG8gdGVsbCBSIGV4YWN0bHkgd2hlcmUgdG8gZmluZCB0aGUgZmlsZS4gTW9yZSBvbiB0aGlzIHRvbyBiZWxvdy4KCkRpcmVjdG9yaWVzIGFuZCBmaWxlIG1hbmFnZW1lbnQgaXMgb2Z0ZW4gcmVhbGx5IHRyaWNreSBmb3IgYmVnaW5uZXJzLiBJZiB5b3Ugc3RydWdnbGUgd2l0aCB0aGlzLCB0aGF0J3MgT0suIEl0IHdpbGwgZ2V0IGJldHRlciBhcyB5b3UgZ2V0IG1vcmUgcHJhY3RpY2UuIFBheSBjYXJlZnVsIGF0dGVudGlvbiB0byB0aGUgZXJyb3IgbWVzc2FnZXMgYW5kIGxlYXJuIHRvIGxvdmUgdGhlICoqRmlsZXMqKiB0YWIgaW4gdGhlIGJvdHRvbSByaWdodCBvZiBSU3R1ZGlvLiBUaGlzIHdpbGwgaGVscCB5b3UgdHJhY2sgZG93biB3aGVyZSB5b3VyIGZpbGVzIGFyZSBsb2NhdGVkIHJlbGF0aXZlIHRvIHRoZSBjdXJyZW50IHdvcmtpbmcgZGlyZWN0b3J5LgoKIyMgYHJlYWR4bGAgcGFja2FnZQoKVGhlIGRhdGEgd2Ugd2lsbCB1c2UgaGVyZSBjb21lcyBmcm9tIHRoZSBgbW92aWVzYCBkYXRhIGZyYW1lIGluIHRoZSBgZ2dwbG90Mm1vdmllc2AgcGFja2FnZS4KClRoZSBgcmVhZHhsYCBwYWNrYWdlIGhhbmRsZXMgbG9hZGluZyBib3RoIDIwMDggKHhsc3gpIGFuZCBlYXJsaWVyIGV4Y2VsIGZpbGVzICh4bHMpLiBUbyByZWFkIGluIEV4Y2VsIGZpbGVzLCB1c2UgdGhlIGByZWFkX2V4Y2VsKClgIGZ1bmN0aW9uOgoKYGBge3J9CmxpYnJhcnkocmVhZHhsKQptb3ZpZXNfZnJvbV9leGNlbCA8LSByZWFkX2V4Y2VsKCJkYXRhL21vdmllcy54bHN4Iiwgc2hlZXQgPSAxKQptb3ZpZXNfZnJvbV9leGNlbApgYGAKCiMjIGByZWFkcmAgcGFja2FnZQoKRmxhdCBmaWxlcywgc3VjaCBhcyBjb21tYS1zZXBhcmF0ZWQtdmFsdWUgKC5jc3YpLCBhbmQgb3RoZXIgdHlwZXMgb2Ygc2ltcGxlIHJlY3Rhbmd1bGFyIGRhdGEuCgotIGByZWFkX2NzdmA6IFJlYWRpbmcgaW4gY29tbWEgc2VwYXJhdGVkIHZhbHVlIGZpbGVzCgogICAgYGBge3J9CiAgICBsaWJyYXJ5KHJlYWRyKQogICAgbW92aWVzX2Zyb21fY3N2IDwtIHJlYWRfY3N2KCJkYXRhL21vdmllcy5jc3YiKQogICAgbW92aWVzX2Zyb21fY3N2CiAgICBgYGAKCi0gYHJlYWRfZGVsaW1gOiBZb3VyIGdlbmVyYWwgcHVycG9zZSBmbGF0IGZpbGUgcmVhZGVyLiBGb3IgZXhhbXBsZSwgaWYgeW91IGhhdmUgYSAqdGFiIGRlbGltaXRlZCBmaWxlKiwgeW91IGNhbiBzcGVjaWZ5IGBkZWxpbWAgPSAiXHQiICgiXHQiIGlzIGhvdyB5b3Ugc3BlY2lmeSB0YWJzKS4gCgogICAgYGBge3J9CiAgICBsaWJyYXJ5KHJlYWRyKQogICAgbW92aWVzIDwtIHJlYWRfZGVsaW0oImRhdGEvbW92aWVzLmNzdiIsIGRlbGltID0gIiwiKQogICAgbW92aWVzCiAgICBgYGAKCkJlIHN1cmUgdG8gY2hlY2sgdGhlIGhlbHAgZm9yIGFsbCBvZiB0aGVzZSAodmlhIGA/cmVhZF9kZWxpbWApIHRvIHNlZSBhbGwgb2YgdGhlIG90aGVyIG9wdGlvbnMgYXZhaWxhYmxlLiBTb21lIHRoYXQgY29tZSB1cCBmcmVxdWVudGx5IGluIG91ciBvd24gZXhwZXJpZW5jZSBhcmUgCgotIGBuYWAgLSBJZiB5b3VyIG1pc3NpbmcgdmFsdWVzIGFyZSBjb2RlZCBkaWZmZXJlbnRseSB0aGFuIGBOQWAsIHlvdSBjYW4gc2V0IHRoZSB2YWx1ZSBoZXJlLgotIGBza2lwYCAtIGlmIHRoZXJlJ3MgZXhwbGFuYXRvcnkgdGV4dCBiZWZvcmUgdGhlIGhlYWRlciwgdXNlIHRoaXMgYXJndW1lbnQgdG8gc2tpcCB0aGVzZSBsaW5lcy4gRm9yIGV4YW1wbGUsIGlmIHlvdXIgaGVhZGVyIHN0YXJ0cyBhdCBsaW5lIDQsIHVzZSBgc2tpcCA9IDNgIAotIGBjb2xfdHlwZXNgIC0gdGhpcyBpcyB3aGVyZSB5b3UgaGF2ZSBtb3JlIGNvbnRyb2wgb24gdGhlIGRhdGEgdHlwZXMgZm9yIGVhY2ggY29sdW1uLgoKIyMgYGhhdmVuYCBwYWNrYWdlCgpSZWFkaW5nIGluICJleHRlcm5hbCIgZGF0YSBmcm9tIH5+aW5mZXJpb3J+fiwgSSBtZWFuLCBvdGhlciBzb2Z0d2FyZSBwYWNrYWdlcyBsaWtlIFNQU1MgKGAuc2F2YCksIFN0YXRhIChgLmR0YWApLCBTQVMsIGV0Yy4gQ2hlY2sgdGhlIFtoYXZlbiBwYWNrYWdlIHdlYnNpdGVdKGh0dHA6Ly9oYXZlbi50aWR5dmVyc2Uub3JnLykgZm9yIGFzc2lzdGFuY2UuCgojIyBZb3VyIHR1cm4KCiMjIyBFeGVyY2lzZSAyLjEKCkxvYWQgdGhlIHBlcmlvZGljIHRhYmxlIGRhdGEgc3RvcmVkIGluIHRoZSBgZGF0YWAgZm9sZGVyLiAoSGludDogeW91J2xsIGhhdmUgdG8gdXNlIHRoZSBgaGF2ZW5gIHBhY2thZ2UgLSBsb29rIGZvciBhIGByZWFkX2AgZnVuY3Rpb24gdGhhdCBjYW4gcmVhZCB0aGUgYC5zYXZgIGZpbGUgZm9ybWF0LikKCmBgYHtyfQojIFdyaXRlIGFuZCBjaGVjayB5b3VyIGFuc3dlciBoZXJlLgpgYGAKCiMjIFdoYXQgeW91IGxlYXJuZWQgaW4gdGhpcyBzZWN0aW9uCgpSZWFkIGluIGRhdGEgZnJvbSBmaWxlcyBpbnRvIGBkYXRhLmZyYW1lYHMgZnJvbQoKLSBFeGNlbCBmaWxlcyB1c2luZyB0aGUgYHJlYWRfZXhjZWwoKWAgZnVuY3Rpb24gaW4gdGhlIGByZWFkeGxgIHBhY2thZ2UsCi0gQ1NWIGZpbGVzIHVzaW5nIHRoZSBgcmVhZF9jc3YoKWAgYW5kIGByZWFkX2RlbGltKClgIGZ1bmN0aW9ucyBpbiB0aGUgYHJlYWRyYCBwYWNrYWdlCi0gZGF0YSBmcm9tIHByb3ByaWV0YXJ5IHN0YXRpc3RpY2FsIHNvZnR3YXJlIHVzaW5nIHRoZSBgcmVhZF9zcHNzKClgLCBgcmVhZF9zdGF0YSgpYCwgYW5kIGByZWFkX3NhcygpYCBmdW5jdGlvbnMgaW4gdGhlIGBoYXZlbmAgcGFja2FnZQoKCg==