I need your help!

If you find any typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

Opening an issue or submitting a pull request on GitHub

Hypothesis Adding an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.

12 Tidy data

12.1 Introduction

12.2 Tidy data

Exercise 12.2.1

Using prose, describe how the variables and observations are organized in each of the sample tables.

In table table1, each row represents a (country, year) combination. The columns cases and population contain the values for those variables.

In table2, each row represents a (country, year, variable) combination. The column count contains the values of variables cases and population in separate rows.

In table3, each row represents a (country, year) combination. The column rate provides the values of both cases and population in a string formatted like cases / population.

Table 4 is split into two tables, one table for each variable. The table table4a contains the values of cases and table4b contains the values of population. Within each table, each row represents a country, each column represents a year, and the cells are the value of the table’s variable for that country and year.

Exercise 12.2.2

Compute the rate for table2, and table4a + table4b. You will need to perform four operations:

  1. Extract the number of TB cases per country per year.
  2. Extract the matching population per country per year.
  3. Divide cases by population, and multiply by 10000.
  4. Store back in the appropriate place.

Which representation is easiest to work with? Which is hardest? Why?

To calculate cases per person, we need to divide cases by population for each country and year. This is easiest if the cases and population variables are two columns in a data frame in which rows represent (country, year) combinations.

Table 2: First, create separate tables for cases and population and ensure that they are sorted in the same order.

Then create a new data frame with the population and cases columns, and calculate the cases per capita in a new column.

To store this new variable in the appropriate location, we will add new rows to table2.

Note that after adding the cases_per_cap rows, the type of count is coerced to numeric (double) because cases_per_cap is not an integer.

For table4a and table4b, create a new table for cases per capita, which we’ll name table4c, with country rows and year columns.

Neither table is particularly easy to work with. Since table2 has separate rows for cases and population we needed to generate a table with columns for cases and population where we could calculate cases per capita. table4a and table4b split the cases and population variables into different tables which made it easy to divide cases by population. However, we had to repeat this calculation for each row.

The ideal format of a data frame to answer this question is one with columns country, year, cases, and population. Then problem could be answered with a single mutate() call.

Exercise 12.2.3

Recreate the plot showing change in cases over time using table2 instead of table1. What do you need to do first?

12.3 Pivoting

This code is reproduced from the chapter because it is needed by the exercises.

Exercise 12.3.1

Why are pivot_longer() and pivot_wider() not perfectly symmetrical? Carefully consider the following example:

Carefully consider the following example:

(Hint: look at the variable types and think about column names.)

pivot_longer() has a names_ptype argument, e.g. names_ptype = list(year = double()). What does it do?

The functions pivot_longer() and pivot_wider() are not perfectly symmetrical because column type information is lost when a data frame is converted from wide to long. The function pivot_longer() stacks multiple columns which may have had multiple data types into a single column with a single data type. This transformation throws away the individual data types of the original columns. The function pivot_wider() creates column names from values in column. These column names will always be treated as character values by pivot_longer() so if the original variable used to create the column names did not have a character data type, then the round-trip will not reproduce the same dataset.

In the provided example, columns have the following data types:

The pivot_wider() expression pivots the table to create a data frame with years as column names, and the values in return as the column values.

The pivot_longer() expression unpivots the table, returning it to a tidy data frame with columns for half, year, and return.

There is one difference, in the new data frame, year has a data type of character rather than numeric. The names_to column created from column names by pivot_longer() will be character by default, which is usually a safe assumption, since syntactically valid-column names can only be character values.

The original data types of column which pivot_wider() used to create the column names was not stored, so pivot_longer() has no idea that the column names in this case should be numeric values. In the current version of tidyr, the names_ptype argument does not convert the year column to a numeric vector, and it will raise an error.

Instead, use the names_transform argument to pivot_longer(), which provides a function to coerce the column to a different data type.

Exercise 12.3.3

Widening this data frame using pivot_wider() produces columns that are lists of numeric vectors because the name and key columns do not uniquely identify rows. In particular, there are two rows with values for the age of “Phillip Woods”.

We could solve the problem by adding a row with a distinct observation count for each combination of name and key.

We can make people2 wider because the combination of name and obs will uniquely identify the rows in the wide data frame.

Another way to solve this problem is by keeping only distinct rows of the name and key values, and dropping duplicate rows.

However, before doing this understand why there are duplicates in the data. The duplicate values may not be just a nuisance, but may indicate deeper problems with the data.

Exercise 12.3.4

Tidy the simple tibble below. Do you need to make it wider or longer? What are the variables?

To tidy the preg table use pivot_longer() to create a long table. The variables in this data are:

  • sex (“female”, “male”)
  • pregnant (“yes”, “no”)
  • count, which is a non-negative integer representing the number of observations.

The observations in this data are unique combinations of sex and pregnancy status.

Remove the (male, pregnant) row with a missing value to simplify the tidied data frame.

This an example of turning an explicit missing value into an implicit missing value, which is discussed in the upcoming section, Missing Values section. The missing (male, pregnant) row represents an implicit missing value because the value of count can be inferred from its absence. In the tidy data, we can represent rows with missing values of count either explicitly with an NA (as in preg_tidy) or implicitly by the absence of a row (as in preg_tidy2). But in the wide data, the missing values can only be represented explicitly.

Though we have already done enough to make the data tidy, there are some other transformations that can clean the data further. If a variable takes two values, like pregnant and sex, it is often preferable to store them as logical vectors.

In the previous data frame, I named the logical variable representing the sex female, not sex. This makes the meaning of the variable self-documenting. If the variable were named sex with values TRUE and FALSE, without reading the documentation, we wouldn’t know whether TRUE means male or female.

Apart from some minor memory savings, representing these variables as logical vectors results in more clear and concise code. Compare the filter() calls to select non-pregnant females from preg_tidy2 and preg_tidy.

12.4 Separating and uniting

Exercise 12.4.1

The extra argument tells separate() what to do if there are too many pieces, and the fill argument tells it what to do if there aren’t enough. By default, separate() drops extra values with a warning.

Adding the argument, extra = "drop", produces the same result as above but without the warning.

Setting extra = "merge", then the extra values are not split, so "f,g" appears in column three.

In this example, one of the values, "d,e", has too few elements. The default for fill is similar to those in separate(); it fills columns with missing values but emits a warning. In this example, the 2nd row of column three is NA.

Alternative options for the fill are "right", to fill with missing values from the right, but without a warning

The option fill = "left" also fills with missing values without emitting a warning, but this time from the left side. Now, the 2nd row of column one will be missing, and the other values in that row are shifted right.

Exercise 12.4.2

Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?

The remove argument discards input columns in the result data frame. You would set it to FALSE if you want to create a new variable, but keep the old one.

Exercise 12.4.3

Compare and contrast separate() and extract(), Why are there three variations of separation (by position, by separator, and with groups), but only one unite?

The function separate(), splits a column into multiple columns by separator, if the sep argument is a character vector, or by character positions, if sep is numeric.

The function extract() uses a regular expression to specify groups in character vector and split that single character vector into multiple columns. This is more flexible than separate() because it does not require a common separator or specific column positions.

Both separate() and extract() convert a single column to many columns. However, unite() converts many columns to one, with a choice of a separator to include between column values.

In other words, with extract() and separate() only one column can be chosen, but there are many choices how to split that single column into different columns. With unite(), there are many choices as to which columns to include, but only one choice as to how to combine their contents into a single vector.

12.5 Missing values

Exercise 12.5.1

Compare and contrast the fill arguments to pivot_wider() and complete().

The values_fill argument in pivot_wider() and the fill argument to complete() both set vales to replace NA. Both arguments accept named lists to set values for each column. Additionally, the values_fill argument of pivot_wider() accepts a single value. In complete(), the fill argument also sets a value to replace NAs but it is named list, allowing for different values for different variables. Also, both cases replace both implicit and explicit missing values.

For example, this will fill in the missing values of the long data frame with 0 complete():

For example, this will fill in the missing values of the long data frame with 0 complete():

Exercise 12.5.2

What does the direction argument to fill() do?

With fill, the direction determines whether NA values should be replaced by the previous non-missing value ("down") or the next non-missing value ("up").

12.6 Case Study

This code is repeated from the chapter because it is needed by the exercises.

Exercise 12.6.1

In this case study, I set na.rm = TRUE just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA and zero?

The reasonableness of using na.rm = TRUE depends on how missing values are represented in this dataset. The main concern is whether a missing value means that there were no cases of TB or whether it means that the WHO does not have data on the number of TB cases. Here are some things we should look for to help distinguish between these cases.

  • If there are no 0 values in the data, then missing values may be used to indicate no cases.

  • If there are both explicit and implicit missing values, then it suggests that missing values are being used differently. In that case, it is likely that explicit missing values would mean no cases, and implicit missing values would mean no data on the number of cases.

First, I’ll check for the presence of zeros in the data.

There are zeros in the data, so it appears that cases of zero TB are explicitly indicated, and the value ofNA is used to indicate missing data.

Second, I should check whether all values for a (country, year) are missing or whether it is possible for only some columns to be missing.

From the results above, it looks like it is possible for a (country, year) row to contain some, but not all, missing values in its columns.

Finally, I will check for implicit missing values. Implicit missing values are (year, country) combinations that do not appear in the data.

Since the number of complete cases of (country, year) is greater than the number of rows in who, there are some implicit values. But that doesn’t tell us what those implicit missing values are. To do this, I will use the anti_join() function introduced in the later Relational Data chapter.

All of these refer to (country, year) combinations for years prior to the existence of the country. For example, Timor-Leste achieved independence in 2002, so years prior to that are not included in the data.

To summarize:

  • 0 is used to represent no cases of TB.
  • Explicit missing values (NAs) are used to represent missing data for (country, year) combinations in which the country existed in that year.
  • Implicit missing values are used to represent missing data because a country did not exist in that year.

Exercise 12.6.3

I claimed that iso2 and iso3 were redundant with country. Confirm this claim.

If iso2 and iso3 are redundant with country, then, within each country, there should only be one distinct combination of iso2 and iso3 values, which is the case.

This makes sense, since iso2 and iso3 contain the 2- and 3-letter country abbreviations for the country. The iso2 variable contains each country’s ISO 3166 alpha-2, and the iso3 variable contains each country’s ISO 3166 alpha-3 abbreviation. You may recognize the ISO 3166-2 abbreviations, since they are almost identical to internet country-code top level domains, such as .uk (United Kingdom), .ly (Libya), .tv (Tuvalu), and .io (British Indian Ocean Territory).

Exercise 12.6.4

For each country, year, and sex compute the total number of cases of TB. Make an informative visualization of the data.

A small multiples plot faceting by country is difficult given the number of countries. Focusing on those countries with the largest changes or absolute magnitudes after providing the context above is another option.

12.7 Non-tidy data

No exercises

[ex-12.2.2]: It would be better to join these tables using the methods covered in the Relational Data. We could use inner_join(t2_cases, t2_population, by = c("country", "year")).

[non-syntactic]: See the Creating Tibbles section.