2 Causality

Prerequisites

library("tidyverse")
library("stringr")

2.1 Racial Discrimination in the Labor Market

Load the data from the qss package.

data("resume", package = "qss")

In addition to the dim(), summary(), and head() functions shown in the text,

dim(resume)
#> [1] 4870    4
summary(resume)
#>   firstname             sex                race                call     
#>  Length:4870        Length:4870        Length:4870        Min.   :0.00  
#>  Class :character   Class :character   Class :character   1st Qu.:0.00  
#>  Mode  :character   Mode  :character   Mode  :character   Median :0.00  
#>                                                           Mean   :0.08  
#>                                                           3rd Qu.:0.00  
#>                                                           Max.   :1.00
head(resume)
#>   firstname    sex  race call
#> 1   Allison female white    0
#> 2   Kristen female white    0
#> 3   Lakisha female black    0
#> 4   Latonya female black    0
#> 5    Carrie female white    0
#> 6       Jay   male white    0

we can also use glimpse() to get a quick understanding of the variables in the data frame:

glimpse(resume)
#> Observations: 4,870
#> Variables: 4
#> $ firstname <chr> "Allison", "Kristen", "Lakisha", "Latonya", "Carrie"...
#> $ sex       <chr> "female", "female", "female", "female", "female", "m...
#> $ race      <chr> "white", "white", "black", "black", "white", "white"...
#> $ call      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

The code in QSS uses table() and addmargins() to construct the table. However, this can be done easily with the dplyr package using grouping and summarizing.

Use group_by() to identify each combination of race and call, and then count() the observations:

race_call_tab <-
  resume %>%
  group_by(race, call) %>%
  count()
race_call_tab
#> # A tibble: 4 x 3
#> # Groups:   race, call [4]
#>   race   call     n
#>   <chr> <int> <int>
#> 1 black     0  2278
#> 2 black     1   157
#> 3 white     0  2200
#> 4 white     1   235

If we want to calculate callback rates by race, we can use the mutate() function from dplyr.

race_call_rate <-
  race_call_tab %>%
  group_by(race) %>%
  mutate(call_rate =  n / sum(n)) %>%
  filter(call == 1) %>%
  select(race, call_rate)
race_call_rate
#> # A tibble: 2 x 2
#> # Groups:   race [2]
#>   race  call_rate
#>   <chr>     <dbl>
#> 1 black    0.0645
#> 2 white    0.0965

If we want the overall callback rate, we can calculate it from the original data. Use the summarise() function from dplyr.

resume %>%
  summarise(call_back = mean(call))
#>   call_back
#> 1    0.0805

2.2 Subsetting Data in R

2.2.1 Subsetting

Create a new object of all individuals whose race variable equals black in the resume data:

resumeB <-
  resume %>%
  filter(race == "black")
glimpse(resumeB)
#> Observations: 2,435
#> Variables: 4
#> $ firstname <chr> "Lakisha", "Latonya", "Kenya", "Latonya", "Tyrone", ...
#> $ sex       <chr> "female", "female", "female", "female", "male", "fem...
#> $ race      <chr> "black", "black", "black", "black", "black", "black"...
#> $ call      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

Calculate the callback rate for black individuals:

resumeB %>%
  summarise(call_rate = mean(call))
#>   call_rate
#> 1    0.0645

You can combine the filter() and select() functions with multiple conditions. For example, to keep the call and first name variables for female individuals with stereotypically black names:

resumeBf <-
  resume %>%
  filter(race == "black", sex == "female") %>%
  select(call, firstname)
head(resumeBf)
#>   call firstname
#> 1    0   Lakisha
#> 2    0   Latonya
#> 3    0     Kenya
#> 4    0   Latonya
#> 5    0     Aisha
#> 6    0     Aisha

Now we can calculate the gender gap by group.

Now we can calculate the gender gap by group. Doing so may seem to require a little more code, but we will not duplicate as much as in QSS, and this would easily scale to more than two categories.

First, group by race and sex and calculate the callback rate for each group:

resume_race_sex <-
  resume %>%
  group_by(race, sex) %>%
  summarise(call = mean(call))
head(resume_race_sex)
#> # A tibble: 4 x 3
#> # Groups:   race [2]
#>   race  sex      call
#>   <chr> <chr>   <dbl>
#> 1 black female 0.0663
#> 2 black male   0.0583
#> 3 white female 0.0989
#> 4 white male   0.0887

Use spread() from the tidyr package to make each value of race a new column:


resume_sex <-
  resume_race_sex %>%
  ungroup() %>%
  spread(race, call)
resume_sex
#> # A tibble: 2 x 3
#>   sex     black  white
#>   <chr>   <dbl>  <dbl>
#> 1 female 0.0663 0.0989
#> 2 male   0.0583 0.0887

Now we can calculate the race wage differences by sex as before,

resume_sex %>%
  mutate(call_diff = white - black)
#> # A tibble: 2 x 4
#>   sex     black  white call_diff
#>   <chr>   <dbl>  <dbl>     <dbl>
#> 1 female 0.0663 0.0989    0.0326
#> 2 male   0.0583 0.0887    0.0304

This could be combined into a single chain with only six lines of code:

resume %>%
  group_by(race, sex) %>%
  summarise(call = mean(call)) %>%
  ungroup() %>%
  spread(race, call) %>%
  mutate(call_diff = white - black)
#> # A tibble: 2 x 4
#>   sex     black  white call_diff
#>   <chr>   <dbl>  <dbl>     <dbl>
#> 1 female 0.0663 0.0989    0.0326
#> 2 male   0.0583 0.0887    0.0304

For more information on a way to do this using the spread and gather functions from tidyr package, see the R for Data Science chapter “Tidy Data”.

WARNING The function ungroup removes the groupings in group_by. The function spread will not allow a grouping variable to be reshaped. Since many dplyr functions work differently depending on whether the data frame is grouped or not, I find that I can encounter many errors due to forgetting that a data frame is grouped. As such, I tend to ungroup data frames as soon as I am no longer are using the groupings.

Alternatively, we could have used summarise and the diff function:

resume %>%
  group_by(race, sex) %>%
  summarise(call = mean(call)) %>%
  group_by(sex) %>%
  arrange(race) %>%
  summarise(call_diff = diff(call))
#> # A tibble: 2 x 2
#>   sex    call_diff
#>   <chr>      <dbl>
#> 1 female    0.0326
#> 2 male      0.0304

I find the spread code preferable since the individual race callback rates are retained in the data, and since there is no natural ordering of the race variable (unlike if it were a time-series), it is not obvious from reading the code whether call_diff is black - white or white - black.

2.2.2 Simple conditional statements

dlpyr has three conditional statement functions if_else, recode and case_when.

The function if_else is like ifelse but corrects inconsistent behavior that ifelse exhibits in certain cases.

Create a variable BlackFemale using if_else() and confirm it is only equal to 1 for black and female observations:

resume %>%
  mutate(BlackFemale = if_else(race == "black" & sex == "female", 1, 0)) %>%
  group_by(BlackFemale, race, sex) %>%
  count()
#> # A tibble: 4 x 4
#> # Groups:   BlackFemale, race, sex [4]
#>   BlackFemale race  sex        n
#>         <dbl> <chr> <chr>  <int>
#> 1        0    black male     549
#> 2        0    white female  1860
#> 3        0    white male     575
#> 4        1.00 black female  1886

Warning The function if_else is more strict about the variable types than ifelse. While most R functions are forgiving about variables types, and will automatically convert integers to numeric or vice-versa, they are distinct. For example, these examples will produce errors:

resume %>%
  mutate(BlackFemale = if_else(race == "black" & sex == "female", TRUE, 0))
#> Error in mutate_impl(.data, dots): Evaluation error: `false` must be type logical, not double.

because TRUE is logical and 0 is numeric.

resume %>%
  mutate(BlackFemale = if_else(race == "black" & sex == "female", 1L, 0))
#> Error in mutate_impl(.data, dots): Evaluation error: `false` must be type integer, not double.

because 1L is an integer and 0 is numeric vector (floating-point number). The distinction between integers and numeric variables is often invisible because most functions coerce variables between integer and numeric vectors.

class(1)
#> [1] "numeric"
class(1L)
#> [1] "integer"

The : operator returns integers and as.integer coerces numeric vectors to integer vectors:

class(1:5)
#> [1] "integer"
class(c(1, 2, 3))
#> [1] "numeric"
class(as.integer(c(1, 2, 3)))
#> [1] "integer"

2.2.3 Factor Variables

For more on factors see the R for Data Science chapter “Factors” and the package forcats. Also see the R for Data Science chapter “Strings” for working with strings.

The function case_when is a generalization of the if_else function to multiple conditions. For example, to create categories for all combinations of race and sex,

resume %>%
  mutate(
    race_sex = case_when(
      race == "black" & sex == "female" ~ "black, female",
      race == "white" & sex == "female" ~ "white female",
      race == "black" & sex == "male" ~ "black male",
      race == "white" & sex == "male" ~ "white male"
    )
  ) %>%
  head()
#>   firstname    sex  race call      race_sex
#> 1   Allison female white    0  white female
#> 2   Kristen female white    0  white female
#> 3   Lakisha female black    0 black, female
#> 4   Latonya female black    0 black, female
#> 5    Carrie female white    0  white female
#> 6       Jay   male white    0    white male

Each condition is a formula (an R object created with the “tilde” ~). You will see formulas used extensively in the modeling section. The condition is on the left-hand side of the formula. The value to assign to observations meeting that condition is on the right-hand side. Observations are given the value of the first matching condition, so the order of these can matter.

The case_when function also supports a default value by using a condition TRUE as the last condition. This will match anything not already matched. For example, if you wanted three categories (“black male”, “black female”, “white”):

resume %>%
  mutate(
    race_sex = case_when(
      race == "black" & sex == "female" ~ "black female",
      race == "black" & sex == "male" ~ "black male",
      TRUE ~ "white"
    )
  ) %>%
  head()
#>   firstname    sex  race call     race_sex
#> 1   Allison female white    0        white
#> 2   Kristen female white    0        white
#> 3   Lakisha female black    0 black female
#> 4   Latonya female black    0 black female
#> 5    Carrie female white    0        white
#> 6       Jay   male white    0        white

Alternatively, we could have created this variable using string manipulation functions. Use mutate() to create a new variable, type, str_to_title to capitalize sex and race, and str_c to concatenate these vectors.

resume <-
  resume %>%
  mutate(type = str_c(str_to_title(race), str_to_title(sex)))

Some of the reasons given in QSS for using factors in this chapter are less important due to the functionality of modern tidyverse packages. For example, there is no reason to use tapply, as you can use group_by and summarise,

resume %>%
  group_by(type) %>%
  summarise(call = mean(call))
#> # A tibble: 4 x 2
#>   type          call
#>   <chr>        <dbl>
#> 1 BlackFemale 0.0663
#> 2 BlackMale   0.0583
#> 3 WhiteFemale 0.0989
#> 4 WhiteMale   0.0887

or,

resume %>%
  group_by(race, sex) %>%
  summarise(call = mean(call))
#> # A tibble: 4 x 3
#> # Groups:   race [?]
#>   race  sex      call
#>   <chr> <chr>   <dbl>
#> 1 black female 0.0663
#> 2 black male   0.0583
#> 3 white female 0.0989
#> 4 white male   0.0887

What’s nice about this approach is that we wouldn’t have needed to create the factor variable first as in QSS.

We can use that same approach to calculate the mean of first names, and use arrange() to sort in ascending order.

resume %>%
  group_by(firstname) %>%
  summarise(call = mean(call)) %>%
  arrange(call)
#> # A tibble: 36 x 2
#>   firstname   call
#>   <chr>      <dbl>
#> 1 Aisha     0.0222
#> 2 Rasheed   0.0299
#> 3 Keisha    0.0383
#> 4 Tremayne  0.0435
#> 5 Kareem    0.0469
#> 6 Darnell   0.0476
#> # ... with 30 more rows

**Tip:** General advice for working (or not) with factors:

  • Use character vectors instead of factors. They are easier to manipulate with string functions.
  • Use factor vectors only when you need a specific ordering of string values in a variable, e.g. in a model or a plot.

2.3 Causal Affects and the Counterfactual

Load the social dataset included in the qss package.

data("social", package = "qss")
summary(social)
#>      sex             yearofbirth    primary2004      messages        
#>  Length:305866      Min.   :1900   Min.   :0.000   Length:305866     
#>  Class :character   1st Qu.:1947   1st Qu.:0.000   Class :character  
#>  Mode  :character   Median :1956   Median :0.000   Mode  :character  
#>                     Mean   :1956   Mean   :0.401                     
#>                     3rd Qu.:1965   3rd Qu.:1.000                     
#>                     Max.   :1986   Max.   :1.000                     
#>   primary2006        hhsize    
#>  Min.   :0.000   Min.   :1.00  
#>  1st Qu.:0.000   1st Qu.:2.00  
#>  Median :0.000   Median :2.00  
#>  Mean   :0.312   Mean   :2.18  
#>  3rd Qu.:1.000   3rd Qu.:2.00  
#>  Max.   :1.000   Max.   :8.00

Calculate the mean turnout by message:

turnout_by_message <-
  social %>%
  group_by(messages) %>%
  summarize(turnout = mean(primary2006))
turnout_by_message
#> # A tibble: 4 x 2
#>   messages   turnout
#>   <chr>        <dbl>
#> 1 Civic Duty   0.315
#> 2 Control      0.297
#> 3 Hawthorne    0.322
#> 4 Neighbors    0.378

Since we want to calculate the difference by group, spread() the data set so each group is a column, then use mutate() to calculate the difference of each from the control group. Finally, use select() and matches() to return a dataframe with only those new variables that you have created:

turnout_by_message %>%
  spread(messages, turnout) %>%
  mutate(diff_civic_duty = `Civic Duty` - Control,
         diff_Hawthorne = Hawthorne - Control,
         diff_Neighbors = Neighbors - Control) %>%
  select(matches("diff_"))
#> # A tibble: 1 x 3
#>   diff_civic_duty diff_Hawthorne diff_Neighbors
#>             <dbl>          <dbl>          <dbl>
#> 1          0.0179         0.0257         0.0813

Find the mean values of age, 2004 turnout, and household size for each group:

social %>%
  mutate(age = 2006 - yearofbirth) %>%
  group_by(messages) %>%
  summarise(primary2004 = mean(primary2004),
            age = mean(age),
            hhsize = mean(hhsize))
#> # A tibble: 4 x 4
#>   messages   primary2004   age hhsize
#>   <chr>            <dbl> <dbl>  <dbl>
#> 1 Civic Duty       0.399  49.7   2.19
#> 2 Control          0.400  49.8   2.18
#> 3 Hawthorne        0.403  49.7   2.18
#> 4 Neighbors        0.407  49.9   2.19

The function summarise_at allows you to summarize multiple variables, using multiple functions, or both.

social %>%
  mutate(age = 2006 - yearofbirth) %>%
  group_by(messages) %>%
  summarise_at(vars(primary2004, age, hhsize), funs(mean))
#> # A tibble: 4 x 4
#>   messages   primary2004   age hhsize
#>   <chr>            <dbl> <dbl>  <dbl>
#> 1 Civic Duty       0.399  49.7   2.19
#> 2 Control          0.400  49.8   2.18
#> 3 Hawthorne        0.403  49.7   2.18
#> 4 Neighbors        0.407  49.9   2.19

2.4 Observational Studies

Load and inspect the minimum wage data from the qss package:

data("minwage", package = "qss")
glimpse(minwage)
#> Observations: 358
#> Variables: 8
#> $ chain      <chr> "wendys", "wendys", "burgerking", "burgerking", "kf...
#> $ location   <chr> "PA", "PA", "PA", "PA", "PA", "PA", "PA", "PA", "PA...
#> $ wageBefore <dbl> 5.00, 5.50, 5.00, 5.00, 5.25, 5.00, 5.00, 5.00, 5.0...
#> $ wageAfter  <dbl> 5.25, 4.75, 4.75, 5.00, 5.00, 5.00, 4.75, 5.00, 4.5...
#> $ fullBefore <dbl> 20.0, 6.0, 50.0, 10.0, 2.0, 2.0, 2.5, 40.0, 8.0, 10...
#> $ fullAfter  <dbl> 0.0, 28.0, 15.0, 26.0, 3.0, 2.0, 1.0, 9.0, 7.0, 18....
#> $ partBefore <dbl> 20.0, 26.0, 35.0, 17.0, 8.0, 10.0, 20.0, 30.0, 27.0...
#> $ partAfter  <dbl> 36, 3, 18, 9, 12, 9, 25, 32, 39, 10, 20, 4, 13, 20,...
summary(minwage)
#>     chain             location           wageBefore     wageAfter   
#>  Length:358         Length:358         Min.   :4.25   Min.   :4.25  
#>  Class :character   Class :character   1st Qu.:4.25   1st Qu.:5.05  
#>  Mode  :character   Mode  :character   Median :4.50   Median :5.05  
#>                                        Mean   :4.62   Mean   :4.99  
#>                                        3rd Qu.:4.99   3rd Qu.:5.05  
#>                                        Max.   :5.75   Max.   :6.25  
#>    fullBefore     fullAfter      partBefore     partAfter   
#>  Min.   : 0.0   Min.   : 0.0   Min.   : 0.0   Min.   : 0.0  
#>  1st Qu.: 2.1   1st Qu.: 2.0   1st Qu.:11.0   1st Qu.:11.0  
#>  Median : 6.0   Median : 6.0   Median :16.2   Median :17.0  
#>  Mean   : 8.5   Mean   : 8.4   Mean   :18.8   Mean   :18.7  
#>  3rd Qu.:12.0   3rd Qu.:12.0   3rd Qu.:25.0   3rd Qu.:25.0  
#>  Max.   :60.0   Max.   :40.0   Max.   :60.0   Max.   :60.0

First, calculate the proportion of restaurants by state whose hourly wages were less than the minimum wage in NJ, $5.05, for wageBefore and wageAfter:

Since the NJ minimum wage was $5.05, we’ll define a variable with that value. Even if you use them only once or twice, it is a good idea to put values like this in variables. It makes your code closer to self-documenting, i.e. easier for others (including you, in the future) to understand what the code does.

NJ_MINWAGE <- 5.05

Later, it will be easier to understand wageAfter < NJ_MINWAGE without any comments than it would be to understand wageAfter < 5.05. In the latter case you’d have to remember that the new NJ minimum wage was 5.05 and that’s why you were using that value. Using 5.05 in your code, instead of assigning it to an object called NJ_MINWAGE, is an example of a magic number; try to avoid them.

Note that the variable location has multiple values: PA and four regions of NJ. So we’ll add a state variable to the data.

minwage %>%
  count(location)
#> # A tibble: 5 x 2
#>   location      n
#>   <chr>     <int>
#> 1 centralNJ    45
#> 2 northNJ     146
#> 3 PA           67
#> 4 shoreNJ      33
#> 5 southNJ      67

We can extract the state from the final two characters of the location variable using thestringr function str_sub:

minwage <-
  mutate(minwage, state = str_sub(location, -2L))

Alternatively, since "PA" is the only value that an observation in Pennsylvania takes in location, and since all other observations are in New Jersey:

minwage <-
  mutate(minwage, state = if_else(location == "PA", "PA", "NJ"))

Let’s confirm that the restaurants followed the law:

minwage %>%
  group_by(state) %>%
  summarise(prop_after = mean(wageAfter < NJ_MINWAGE),
            prop_Before = mean(wageBefore < NJ_MINWAGE))
#> # A tibble: 2 x 3
#>   state prop_after prop_Before
#>   <chr>      <dbl>       <dbl>
#> 1 NJ       0.00344       0.911
#> 2 PA       0.955         0.940

Create a variable for the proportion of full-time employees in NJ and PA after the increase:

minwage <-
  minwage %>%
  mutate(totalAfter = fullAfter + partAfter,
        fullPropAfter = fullAfter / totalAfter)

Now calculate the average proportion of full-time employees for each state:

full_prop_by_state <-
  minwage %>%
  group_by(state) %>%
  summarise(fullPropAfter = mean(fullPropAfter))
full_prop_by_state
#> # A tibble: 2 x 2
#>   state fullPropAfter
#>   <chr>         <dbl>
#> 1 NJ            0.320
#> 2 PA            0.272

We could compute the difference in means between NJ and PA by

(filter(full_prop_by_state, state == "NJ")[["fullPropAfter"]] -
  filter(full_prop_by_state, state == "PA")[["fullPropAfter"]])
#> [1] 0.0481

or

spread(full_prop_by_state, state, fullPropAfter) %>%
  mutate(diff = NJ - PA)
#> # A tibble: 1 x 3
#>      NJ    PA   diff
#>   <dbl> <dbl>  <dbl>
#> 1 0.320 0.272 0.0481

2.4.1 Confounding Bias

We can calculate the proportion of each chain out of all fast-food restaurants in each state:

chains_by_state <-
  minwage %>%
  group_by(state) %>%
  count(chain) %>%
  mutate(prop = n / sum(n))

We can easily compare these using a dot-plot:

ggplot(chains_by_state, aes(x = chain, y = prop, colour = state)) +
  geom_point() +
  coord_flip()

In the QSS text, only Burger King restaurants are compared. However, dplyr makes comparing all restaurants not much more complicated than comparing two. All we have to do is change the group_by statement we used previously so that we group by chain restaurants and states:

full_prop_by_state_chain <-
  minwage %>%
  group_by(state, chain) %>%
  summarise(fullPropAfter = mean(fullPropAfter))
full_prop_by_state_chain
#> # A tibble: 8 x 3
#> # Groups:   state [?]
#>   state chain      fullPropAfter
#>   <chr> <chr>              <dbl>
#> 1 NJ    burgerking         0.358
#> 2 NJ    kfc                0.328
#> 3 NJ    roys               0.283
#> 4 NJ    wendys             0.260
#> 5 PA    burgerking         0.321
#> 6 PA    kfc                0.236
#> # ... with 2 more rows

We can plot and compare the proportions easily in this format. In general, ordering categorical variables alphabetically is useless, so we’ll order the chains by the average of the NJ and PA fullPropAfter, using fct_reorder function:

ggplot(full_prop_by_state_chain,
       aes(x = forcats::fct_reorder(chain, fullPropAfter),
           y = fullPropAfter,
           colour = state)) +
  geom_point() +
  coord_flip() +
  labs(x = "chains")

To calculate the difference between states in the proportion of full-time employees after the change:

full_prop_by_state_chain %>%
  spread(state, fullPropAfter) %>%
  mutate(diff = NJ - PA)
#> # A tibble: 4 x 4
#>   chain         NJ    PA   diff
#>   <chr>      <dbl> <dbl>  <dbl>
#> 1 burgerking 0.358 0.321 0.0364
#> 2 kfc        0.328 0.236 0.0918
#> 3 roys       0.283 0.213 0.0697
#> 4 wendys     0.260 0.248 0.0117

2.4.2 Before and After and Difference-in-Difference Designs

To compute the estimates in the before and after design first create an additional variable for the proportion of full-time employees before the minimum wage increase.

minwage <-
  minwage %>%
  mutate(totalBefore = fullBefore + partBefore,
         fullPropBefore = fullBefore / totalBefore)

The before-and-after analysis is the difference between the full-time employment before and after the minimum wage law passed looking only at NJ:

minwage %>%
  filter(state == "NJ") %>%
  summarise(diff = mean(fullPropAfter) - mean(fullPropBefore))
#>     diff
#> 1 0.0239

The difference-in-differences design uses the difference in the before-and-after differences for each state.

minwage %>%
  group_by(state) %>%
  summarise(diff = mean(fullPropAfter) - mean(fullPropBefore)) %>%
  spread(state, diff) %>%
  mutate(diff_in_diff = NJ - PA)
#> # A tibble: 1 x 3
#>       NJ      PA diff_in_diff
#>    <dbl>   <dbl>        <dbl>
#> 1 0.0239 -0.0377       0.0616

Let’s create a single dataset with the mean values of each state before and after to visually look at each of these designs:

full_prop_by_state <-
  minwage %>%
  group_by(state) %>%
  summarise_at(vars(fullPropAfter, fullPropBefore), mean) %>%
  gather(period, fullProp, -state) %>%
  mutate(period = recode(period, fullPropAfter = 1, fullPropBefore = 0))
full_prop_by_state
#> # A tibble: 4 x 3
#>   state period fullProp
#>   <chr>  <dbl>    <dbl>
#> 1 NJ         1    0.320
#> 2 PA         1    0.272
#> 3 NJ         0    0.297
#> 4 PA         0    0.310

Now plot this new dataset:

ggplot(full_prop_by_state, aes(x = period, y = fullProp, colour = state)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = c(0, 1), labels = c("Before", "After"))

2.5 Descriptive Statistics for a Single Variable

To calculate the summary for the variables wageBefore and wageAfter for New Jersey only:

minwage %>%
  filter(state == "NJ") %>%
  select(wageBefore, wageAfter) %>%
  summary()
#>    wageBefore     wageAfter   
#>  Min.   :4.25   Min.   :5.00  
#>  1st Qu.:4.25   1st Qu.:5.05  
#>  Median :4.50   Median :5.05  
#>  Mean   :4.61   Mean   :5.08  
#>  3rd Qu.:4.87   3rd Qu.:5.05  
#>  Max.   :5.75   Max.   :5.75

We calculate the interquartile range for each state’s wages after the passage of the law using the same grouped summarize as we used before:

minwage %>%
  group_by(state) %>%
  summarise(wageAfter = IQR(wageAfter),
            wageBefore = IQR(wageBefore))
#> # A tibble: 2 x 3
#>   state wageAfter wageBefore
#>   <chr>     <dbl>      <dbl>
#> 1 NJ        0           0.62
#> 2 PA        0.575       0.75

Calculate the variance and standard deviation of wageAfter and wageBefore for each state:

minwage %>%
  group_by(state) %>%
  summarise(wageAfter_sd = sd(wageAfter),
               wageAfter_var = var(wageAfter),
               wageBefore_sd = sd(wageBefore),
               wageBefore_var = var(wageBefore))
#> # A tibble: 2 x 5
#>   state wageAfter_sd wageAfter_var wageBefore_sd wageBefore_var
#>   <chr>        <dbl>         <dbl>         <dbl>          <dbl>
#> 1 NJ           0.106        0.0112         0.343          0.118
#> 2 PA           0.359        0.129          0.358          0.128

Here we can see again how using summarise_at allows for more compact code to specify variables and summary statistics that would be the case using just summarise:

minwage %>%
  group_by(state) %>%
  summarise_at(vars(wageAfter, wageBefore), funs(sd, var))
#> # A tibble: 2 x 5
#>   state wageAfter_sd wageBefore_sd wageAfter_var wageBefore_var
#>   <chr>        <dbl>         <dbl>         <dbl>          <dbl>
#> 1 NJ           0.106         0.343        0.0112          0.118
#> 2 PA           0.359         0.358        0.129           0.128