I need your help!

If you find any typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

Opening an issue or submitting a pull request on GitHub

Hypothesis Adding an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.

11 Data import

11.1 Introduction

11.2 Getting started

Exercise 11.2.1

What function would you use to read a file where fields were separated with “|”?

Use the read_delim() function with the argument delim="|".

Exercise 11.2.2

Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?

They have the following arguments in common:

  • col_names and col_types are used to specify the column names and how to parse the columns
  • locale is important for determining things like the encoding and whether “.” or “,” is used as a decimal mark.
  • na and quoted_na control which strings are treated as missing values when parsing vectors
  • trim_ws trims whitespace before and after cells before parsing
  • n_max sets how many rows to read
  • guess_max sets how many rows to use when guessing the column type
  • progress determines whether a progress bar is shown.

In fact, the two functions have the exact same arguments:

Exercise 11.2.3

What are the most important arguments to read_fwf()?

The most important argument to read_fwf() which reads “fixed-width formats”, is col_positions which tells the function where data columns begin and end.

Exercise 11.2.4

Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or '. By convention, read_csv() assumes that the quoting character will be ", and if you want to change it you’ll need to use read_delim() instead. What arguments do you need to specify to read the following text into a data frame?

"x,y\n1,'a,b'"

For read_delim(), we will will need to specify a delimiter, in this case ",", and a quote argument.

However, this question is out of date. read_csv() now supports a quote argument, so the following code works.

Exercise 11.2.5

Identify what is wrong with each of the following inline CSV files. What happens when you run the code?

Only two columns are specified in the header “a” and “b”, but the rows have three columns, so the last column is dropped.

The numbers of columns in the data do not match the number of columns in the header (three). In row one, there are only two values, so column c is set to missing. In row two, there is an extra value, and that value is dropped.

It’s not clear what the intent was here. The opening quote "1 is dropped because it is not closed, and a is treated as an integer.

Both “a” and “b” are treated as character vectors since they contain non-numeric strings. This may have been intentional, or the author may have intended the values of the columns to be “1,2” and “a,b”.

The values are separated by “;” rather than “,”. Use read_csv2() instead:

11.3 Parsing a vector

Exercise 11.3.1

What are the most important arguments to locale()?

The locale object has arguments to set the following:

  • date and time formats: date_names, date_format, and time_format
  • time zone: tz
  • numbers: decimal_mark, grouping_mark
  • encoding: encoding

Exercise 11.3.2

What happens if you try and set decimal_mark and grouping_mark to the same character? What happens to the default value of grouping_mark when you set decimal_mark to ","? What happens to the default value of decimal_mark when you set the grouping_mark to "."?

Exercise 11.3.3

I didn’t discuss the date_format and time_format options to locale(). What do they do? Construct an example that shows when they might be useful.

They provide default date and time formats. The readr vignette discusses using these to parse dates: since dates can include languages specific weekday and month names, and different conventions for specifying AM/PM

Examples from the readr vignette of parsing French dates

Both the date format and time format are used for guessing column types. Thus if you were often parsing data that had non-standard formats for the date and time, you could specify custom values for date_format and time_format.

Exercise 11.3.4

If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.

Read the help page for locale() using ?locale to learn about the different variables that can be set.

As an example, consider Australia. Most of the defaults values are valid, except that the date format is “(d)d/mm/yyyy”, meaning that January 2, 2006 is written as 02/01/2006.

However, default locale will parse that date as February 1, 2006.

To correctly parse Australian dates, define a new locale object.

Using parse_date() with the au_locale as its locale will correctly parse our example date.

Exercise 11.3.5

What’s the difference between read_csv() and read_csv2()?

The delimiter. The function read_csv() uses a comma, while read_csv2() uses a semi-colon (;). Using a semi-colon is useful when commas are used as the decimal point (as in Europe).

Exercise 11.3.6

What are the most common encodings used in Europe? What are the most common encodings used in Asia? Do some googling to find out.

UTF-8 is standard now, and ASCII has been around forever.

For the European languages, there are separate encodings for Romance languages and Eastern European languages using Latin script, Cyrillic, Greek, Hebrew, Turkish: usually with separate ISO and Windows encoding standards. There is also Mac OS Roman.

For Asian languages Arabic and Vietnamese have ISO and Windows standards. The other major Asian scripts have their own:

  • Japanese: JIS X 0208, Shift JIS, ISO-2022-JP
  • Chinese: GB 2312, GBK, GB 18030
  • Korean: KS X 1001, EUC-KR, ISO-2022-KR

The list in the documentation for stringi::stri_enc_detect() is a good list of encodings since it supports the most common encodings.

  • Western European Latin script languages: ISO-8859-1, Windows-1250 (also CP-1250 for code-point)
  • Eastern European Latin script languages: ISO-8859-2, Windows-1252
  • Greek: ISO-8859-7
  • Turkish: ISO-8859-9, Windows-1254
  • Hebrew: ISO-8859-8, IBM424, Windows 1255
  • Russian: Windows 1251
  • Japanese: Shift JIS, ISO-2022-JP, EUC-JP
  • Korean: ISO-2022-KR, EUC-KR
  • Chinese: GB18030, ISO-2022-CN (Simplified), Big5 (Traditional)
  • Arabic: ISO-8859-6, IBM420, Windows 1256

For more information on character encodings see the following sources.

Programs that identify the encoding of text include:

  • readr::guess_encoding()
  • stringi::str_enc_detect()
  • iconv
  • chardet (Python)

11.4 Parsing a file

No exercises

11.5 Writing to a file

No exercises

11.6 Other types of data

No exercises