1 Introduction
Prerequisites
In this and other chapters we will make use of data from the qss
package, which is available on github. Install it using the install_github()
function from the library devtools
.
devtools::install_github("kosukeimai/qss-package")
library("qss")
In the prerequisites section of each chapter, we’ll load any packages needed for the chapter, possibly define some functions, and possibly load data. It is good practice to load necessary libraries at the start of an R markdown file or script.
library("tidyverse")
We also load the readr package to load csv
files,
library("readr")
the haven package to load Stata dta
files,
library("haven")
and the rio package to load multiple types of files
library("rio")
1.1 Overview of the Book
This sections contains no code to translate – see QSS text.
1.2 How to use the Book
This sections contains no code to translate – see QSS text.
1.3 Introduction to R
These notes do not aim to completely teach R and the tidyverse. However, there are many other resources for that.
R for Data Science is a comprehensive introduction to R using the tidyverse.
Data Camp has interactive courses. In particular, I recommend starting with the following two courses.
1.3.1 Arithmetic Operations
This sections contains no code to translate—see QSS text.
1.3.2 Objects
This sections contains no code to translate—see QSS text.
Also see R4DS: Workflow basics.
1.3.3 Vectors
This sections contains no code to translate—see QSS text.
Also see R4DS: Vectors. In R for Data Science vectors are introduced much later, after data frames.
1.3.5 Data Files
Rather than using setwd()
in scripts, data analysis should be organized in projects. Read the introduction on RStudio projects in R4DS.1
Datasets used in R are accessed in two ways.
First, datasets can be distributed with R packages. These are often smaller datasets used in examples and tutorials in packages. These are loaded with the data()
function. For example you can load UN data on demographic statistics from the qss library, which distributes the data sets used in the QSS textbook. (The function data()
called without any arguments will list all the datasets distributed with installed packages.)
data("UNpop", package = "qss")
Second, datasets can be loaded from external files including both stored R objects (.RData
, .rda
) and other formats (.csv
, .dta
, .sav
). To read a csv file into R use the read_csv
function from the readr library, part of the tidyverse.
UNpop_URL <- "https://raw.githubusercontent.com/kosukeimai/qss/master/INTRO/UNpop.csv"
UNpop <- read_csv(UNpop_URL)
#> Parsed with column specification:
#> cols(
#> year = col_integer(),
#> world.pop = col_integer()
#> )
We use the readr functionread_csv()
instead of the base R function read.csv()
used in the QSS text. It is slightly faster, and returns a tibble
instead of a data frame. Check this by calling class()
on the new object.
class(UNpop)
#> [1] "tbl_df" "tbl" "data.frame"
UNpop
#> # A tibble: 7 x 2
#> year world.pop
#> <int> <int>
#> 1 1950 2525779
#> 2 1960 3026003
#> 3 1970 3691173
#> 4 1980 4449049
#> 5 1990 5320817
#> 6 2000 6127700
#> # ... with 1 more row
See R for Data Science Ch 11: Data Import for more discussion.
Note that in the previous code we loaded the file directly from a URL, but we could also work with local files on your computer, e.g.
UNpop <- read_csv("INTRO/UNpop.csv")
See R for Data Science Ch 10: Tibbles for a deeper discussion of data frames.
The single bracket, [
, is useful to select rows and columns in simple cases.
UNpop[c(1, 2, 3), ]
#> # A tibble: 3 x 2
#> year world.pop
#> <int> <int>
#> 1 1950 2525779
#> 2 1960 3026003
#> 3 1970 3691173
There are dplyr functions to select rows by number, to select rows by certain criteria, or to select columns.
To select rows 1–3, use slice()
.
slice(UNpop, 1:3)
#> # A tibble: 3 x 2
#> year world.pop
#> <int> <int>
#> 1 1950 2525779
#> 2 1960 3026003
#> 3 1970 3691173
Base R allows you to choose the column world.pop
column from the UNpop
data frame:
UNpop[, "world.pop"]
#> # A tibble: 7 x 1
#> world.pop
#> <int>
#> 1 2525779
#> 2 3026003
#> 3 3691173
#> 4 4449049
#> 5 5320817
#> 6 6127700
#> # ... with 1 more row
UNpop$world.pop
#> [1] 2525779 3026003 3691173 4449049 5320817 6127700 6916183
UNpop[["world.pop"]]
#> [1] 2525779 3026003 3691173 4449049 5320817 6127700 6916183
select(UNpop, world.pop)
#> # A tibble: 7 x 1
#> world.pop
#> <int>
#> 1 2525779
#> 2 3026003
#> 3 3691173
#> 4 4449049
#> 5 5320817
#> 6 6127700
#> # ... with 1 more row
Unlike [
, the [[
and $
operators can only select a single column and return a vector.2 The dplyr
function select()
always returns a tibble (data frame), and never a vector, even if only one column is selected.
Select rows 1–3 of the year
column:
UNpop[1:3, "year"]
#> # A tibble: 3 x 1
#> year
#> <int>
#> 1 1950
#> 2 1960
#> 3 1970
or,
select(slice(UNpop, 1:3), year)
#> # A tibble: 3 x 1
#> year
#> <int>
#> 1 1950
#> 2 1960
#> 3 1970
The same series of functions can be performed using the pipe operator, %>%
.
UNpop %>%
slice(1:3) %>%
select(year)
#> # A tibble: 3 x 1
#> year
#> <int>
#> 1 1950
#> 2 1960
#> 3 1970
This example may seem verbose, but later we can produce more complicated transformations of the data by chaining together simple functions.
Select every other row from UNpop
:
UNpop$world.pop[seq(from = 1, to = nrow(UNpop), by = 2)]
#> [1] 2525779 3691173 5320817 6916183
or
UNpop %>%
slice(seq(1, n(), by = 2)) %>%
select(world.pop)
#> # A tibble: 4 x 1
#> world.pop
#> <int>
#> 1 2525779
#> 2 3691173
#> 3 5320817
#> 4 6916183
or
UNpop %>%
filter(row_number() %% 2 == 1)
#> # A tibble: 4 x 2
#> year world.pop
#> <int> <int>
#> 1 1950 2525779
#> 2 1970 3691173
#> 3 1990 5320817
#> 4 2010 6916183
The function n()
when used in a dplyr function returns the number of rows in the data frame (or the number of rows in the group if used with group_by()
). The function row_number()
returns the row number of an observation. The %%
operator returns the modulus, i.e. division remainder.
1.3.6 Saving Objects
It is not recommended that you save the entire R workspace using save.image
due to the negative and unexpected impacts it can have on reproducibility.See the R for Data Science chapter Workflow Projects.
You should uncheck the options in RStudio to avoid saving and restoring from .RData
files (go to Tools > Global Options > General
). This will help ensure that your R code runs the way you think it does, instead of depending on some long forgotten code that is only saved in the workspace image. Everything important should be in a script. Anything saved or loaded from file should be done explicitly.
Your motto should be that the source is real, not the objects created by it.
The source code is real. The objects are realizations of the source code. Source for EVERY user modified object is placed in a particular directory or directories, for later editing and retrieval. – from the ESS manual
This means that while you should not save the entire workplace it is perfectly fine practice to run a script and save or load R objects to files, using or .
As with reading CSV files, use the readr package functions. In this case, write_csv()
writes a csv file and takes at least two objects: the data that you want to write to a csv and the name that you want to give the file.
write_csv(UNpop, "UNpop.csv")
1.3.7 Programming and Learning Tips
Use the haven package to read and write Stata (.dta
) and SPSS (.sav
) files. Stata and SPSS are two other statistical programs commonly used in social science. Even if you don’t ever use them, you’ll almost certainly encounter data stored in their native formats.
UNpop_dta_url <- "https://github.com/kosukeimai/qss/raw/master/INTRO/UNpop.dta"
UNpop <- read_dta(UNpop_dta_url)
UNpop
#> # A tibble: 7 x 2
#> year world_pop
#> <dbl> <dbl>
#> 1 1950 2526.
#> 2 1960 3026.
#> 3 1970 3691.
#> 4 1980 4449.
#> 5 1990 5321.
#> 6 2000 6128.
#> # ... with 1 more row
There is also the equivalent write_dta()
function to create Stata datasets.
write_dta(UNpop, "UNpop.dta")
While Stata and SPSS data sets are quite similar to data frames, they differ slightly in definitions of acceptable data types of columns and what metadata they store with the data. Be careful when reading and writing from these formats to ensure that information is not lost.
Also see the rio package which makes loading data even easier with smart defaults.
You can use the import()
function to load many types of files:
import("https://github.com/kosukeimai/qss/raw/master/INTRO/UNpop.csv")
#> year world.pop
#> 1 1950 2525779
#> 2 1960 3026003
#> 3 1970 3691173
#> 4 1980 4449049
#> 5 1990 5320817
#> 6 2000 6127700
#> 7 2010 6916183
import("https://github.com/kosukeimai/qss/raw/master/INTRO/UNpop.RData")
#> year world.pop
#> 1 1950 2525779
#> 2 1960 3026003
#> 3 1970 3691173
#> 4 1980 4449049
#> 5 1990 5320817
#> 6 2000 6127700
#> 7 2010 6916183
import("https://github.com/kosukeimai/qss/raw/master/INTRO/UNpop.dta")
#> year world_pop
#> 1 1950 2526
#> 2 1960 3026
#> 3 1970 3691
#> 4 1980 4449
#> 5 1990 5321
#> 6 2000 6128
#> 7 2010 6916
R also includes the foreign package, which contains functions for reading and writing files using haven. One reason to use these packages is that they are better maintained. For example, the R function read.dta()
does not read files created by the most recent versions of Stata (13+), whereas haven does.
1.3.8 Style Guide
Following a consistent coding style is important for your code to be readable by you and others. The preferred style is the tidyverse style guide, which differs slightly from Google’s R style guide.
The lintr package will check files for style errors.
The styler package provides functions for automatically formatting R code according to style guides.
In RStudio, go to the
Tools > Global Options > Code > Diagnostics
pane and check the box to activate style warnings. On this pane, there are other options that can be set in order to increase or decrease the amount of warnings while writing R code in RStudio.
For more on using projects read Project-oriented workflow.↩
See the discussion in R for DataScience on how
tibble
objects differ from basedata.frame
objects in how the single bracket[
is handled.↩