I need your help!

If you find any typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

Opening an issue or submitting a pull request on GitHub

Hypothesis Adding an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.

15 Factors

15.1 Introduction

Functions and packages:

The forcats package does not need to be explicitly loaded, since the recent versions of the tidyverse package now attach it.

15.2 Creating factors

No exercises

15.3 General Social Survey

Exercise 15.3.1

Explore the distribution of rincome (reported income). What makes the default bar chart hard to understand? How could you improve the plot?

My first attempt is to use geom_bar() with the default settings.

The problem with default bar chart settings, are that the labels overlapping and impossible to read. I’ll try changing the angle of the x-axis labels to vertical so that they will not overlap.

This is better because the labels are not overlapping, but also difficult to read because the labels are vertical. I could try angling the labels so that they are easier to read, but not overlapping.

But the solution I prefer for bar charts with long labels is to flip the axes, so that the bars are horizontal. Then the category labels are also horizontal, and easy to read.

Though more than asked for in this question, I could further improve this plot by

  1. removing the “Not applicable” responses,
  2. renaming “Lt $1000” to “Less than $1000”,
  3. using color to distinguish non-response categories (“Refused”, “Don’t know”, and “No answer”) from income levels (“Lt $1000”, …),
  4. adding meaningful y- and x-axis titles, and
  5. formatting the counts axis labels to use commas.

If I were only interested in non-missing responses, then I could drop all respondents who answered “Not applicable”, “Refused”, “Don’t know”, or “No answer”.

A side-effect of coord_flip() is that the label ordering on the x-axis, from lowest (top) to highest (bottom) is counterintuitive. The next section introduces a function fct_reorder() which can help with this.

Exercise 15.3.2

What is the most common relig in this survey? What’s the most common partyid?

Exercise 15.3.3

Which relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualization?

From the context it is clear that denom refers to “Protestant” (and unsurprising given that it is the largest category in freq). Let’s filter out the non-responses, no answers, others, not-applicable, or no denomination, to leave only answers to denominations. After doing that, the only remaining responses are “Protestant”.

This is also clear in a scatter plot of relig vs. denom where the points are proportional to the size of the number of answers (since otherwise there would be overplotting).

15.4 Modifying factor order

Exercise 15.4.1

There are some suspiciously high numbers in tvhours. Is the mean a good summary?

Whether the mean is the best summary depends on what you are using it for :-), i.e. your objective. But probably the median would be what most people prefer. And the hours of TV doesn’t look that surprising to me.

Exercise 15.4.2

For each factor in gss_cat identify whether the order of the levels is arbitrary or principled.

The following piece of code uses functions introduced in Ch 21, to print out the names of only the factors.

There are six categorical variables: marital, race, rincome, partyid, relig, and denom.

The ordering of marital is “somewhat principled”. There is some sort of logic in that the levels are grouped “never married”, married at some point (separated, divorced, widowed), and “married”; though it would seem that “Never Married”, “Divorced”, “Widowed”, “Separated”, “Married” might be more natural. I find that the question of ordering can be determined by the level of aggregation in a categorical variable, and there can be more “partially ordered” factors than one would expect.

The ordering of race is principled in that the categories are ordered by count of observations in the data.

The levels of rincome are ordered in decreasing order of the income; however the placement of “No answer”, “Don’t know”, and “Refused” before, and “Not applicable” after the income levels is arbitrary. It would be better to place all the missing income level categories either before or after all the known values.

The levels of relig is arbitrary: there is no natural ordering, and they don’t appear to be ordered by stats within the dataset.

The same goes for denom.

Ignoring “No answer”, “Don’t know”, and “Other party”, the levels of partyid are ordered from “Strong Republican”" to “Strong Democrat”.

Exercise 15.4.3

Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot?

Because that gives the level “Not applicable” an integer value of 1.

15.5 Modifying factor levels