If you find any typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.
Adding an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.
11 Data import
11.2 Getting started
What function would you use to read a file where fields were separated with “|”?
read_delim() function with the argument
comment, what other arguments do
read_tsv() have in common?
They have the following arguments in common:
col_typesare used to specify the column names and how to parse the columns
localeis important for determining things like the encoding and whether “.” or “,” is used as a decimal mark.
quoted_nacontrol which strings are treated as missing values when parsing vectors
trim_wstrims whitespace before and after cells before parsing
n_maxsets how many rows to read
guess_maxsets how many rows to use when guessing the column type
progressdetermines whether a progress bar is shown.
In fact, the two functions have the exact same arguments:
What are the most important arguments to
The most important argument to
read_fwf() which reads “fixed-width formats”, is
col_positions which tells the function where data columns begin and end.
Sometimes strings in a CSV file contain commas.
To prevent them from causing problems they need to be surrounded by a quoting character, like
read_csv() assumes that the quoting character will be
", and if you want to change it you’ll need to use
What arguments do you need to specify to read the following text into a data frame?
read_delim(), we will will need to specify a delimiter, in this case
",", and a quote argument.
However, this question is out of date.
read_csv() now supports a quote argument, so the following code works.
Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
Only two columns are specified in the header “a” and “b”, but the rows have three columns, so the last column is dropped.
The numbers of columns in the data do not match the number of columns in the header (three).
In row one, there are only two values, so column
c is set to missing.
In row two, there is an extra value, and that value is dropped.
It’s not clear what the intent was here.
The opening quote
"1 is dropped because it is not closed, and
a is treated as an integer.
Both “a” and “b” are treated as character vectors since they contain non-numeric strings. This may have been intentional, or the author may have intended the values of the columns to be “1,2” and “a,b”.
The values are separated by “;” rather than “,”. Use
11.3 Parsing a vector
What are the most important arguments to
The locale object has arguments to set the following:
- date and time formats:
- time zone:
What happens if you try and set
grouping_mark to the same character?
What happens to the default value of
grouping_mark when you set
What happens to the default value of
decimal_mark when you set the
If the decimal and grouping marks are set to the same character,
locale throws an error:
decimal_mark is set to the comma "
,", then the grouping mark is set to the period
locale(decimal_mark = ",") #> <locale> #> Numbers: 123.456,78 #> Formats: %AD / %AT #> Timezone: UTC #> Encoding: UTF-8 #> <date_names> #> Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday #> (Thu), Friday (Fri), Saturday (Sat) #> Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May), #> June (Jun), July (Jul), August (Aug), September (Sep), October #> (Oct), November (Nov), December (Dec) #> AM/PM: AM/PM
If the grouping mark is set to a period, then the decimal mark is set to a comma
locale(grouping_mark = ".") #> <locale> #> Numbers: 123.456,78 #> Formats: %AD / %AT #> Timezone: UTC #> Encoding: UTF-8 #> <date_names> #> Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday #> (Thu), Friday (Fri), Saturday (Sat) #> Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May), #> June (Jun), July (Jul), August (Aug), September (Sep), October #> (Oct), November (Nov), December (Dec) #> AM/PM: AM/PM
I didn’t discuss the
time_format options to
What do they do?
Construct an example that shows when they might be useful.
They provide default date and time formats. The readr vignette discusses using these to parse dates: since dates can include languages specific weekday and month names, and different conventions for specifying AM/PM
locale() #> <locale> #> Numbers: 123,456.78 #> Formats: %AD / %AT #> Timezone: UTC #> Encoding: UTF-8 #> <date_names> #> Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday #> (Thu), Friday (Fri), Saturday (Sat) #> Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May), #> June (Jun), July (Jul), August (Aug), September (Sep), October #> (Oct), November (Nov), December (Dec) #> AM/PM: AM/PM
Examples from the readr vignette of parsing French dates
Both the date format and time format are used for guessing column types.
Thus if you were often parsing data that had non-standard formats for the date and time, you could specify custom values for
locale_custom <- locale(date_format = "Day %d Mon %M Year %y", time_format = "Sec %S Min %M Hour %H") date_custom <- c("Day 01 Mon 02 Year 03", "Day 03 Mon 01 Year 01") parse_date(date_custom) #> Warning: 2 parsing failures. #> row col expected actual #> 1 -- date like Day 01 Mon 02 Year 03 #> 2 -- date like Day 03 Mon 01 Year 01 #>  NA NA parse_date(date_custom, locale = locale_custom) #>  "2003-01-01" "2001-01-03" time_custom <- c("Sec 01 Min 02 Hour 03", "Sec 03 Min 02 Hour 01") parse_time(time_custom) #> Warning: 2 parsing failures. #> row col expected actual #> 1 -- time like Sec 01 Min 02 Hour 03 #> 2 -- time like Sec 03 Min 02 Hour 01 #> NA #> NA parse_time(time_custom, locale = locale_custom) #> 03:02:01 #> 01:02:03
If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
Read the help page for
?locale to learn about the different variables that can be set.
As an example, consider Australia.
Most of the defaults values are valid, except that the date format is “(d)d/mm/yyyy”, meaning that January 2, 2006 is written as
However, default locale will parse that date as February 1, 2006.
To correctly parse Australian dates, define a new
parse_date() with the
au_locale as its locale will correctly parse our example date.
What’s the difference between
The delimiter. The function
read_csv() uses a comma, while
read_csv2() uses a semi-colon (
;). Using a semi-colon is useful when commas are used as the decimal point (as in Europe).
What are the most common encodings used in Europe? What are the most common encodings used in Asia? Do some googling to find out.
UTF-8 is standard now, and ASCII has been around forever.
For the European languages, there are separate encodings for Romance languages and Eastern European languages using Latin script, Cyrillic, Greek, Hebrew, Turkish: usually with separate ISO and Windows encoding standards. There is also Mac OS Roman.
For Asian languages Arabic and Vietnamese have ISO and Windows standards. The other major Asian scripts have their own:
- Japanese: JIS X 0208, Shift JIS, ISO-2022-JP
- Chinese: GB 2312, GBK, GB 18030
- Korean: KS X 1001, EUC-KR, ISO-2022-KR
The list in the documentation for
stringi::stri_enc_detect() is a good list of encodings since it supports the most common encodings.
- Western European Latin script languages: ISO-8859-1, Windows-1250 (also CP-1250 for code-point)
- Eastern European Latin script languages: ISO-8859-2, Windows-1252
- Greek: ISO-8859-7
- Turkish: ISO-8859-9, Windows-1254
- Hebrew: ISO-8859-8, IBM424, Windows 1255
- Russian: Windows 1251
- Japanese: Shift JIS, ISO-2022-JP, EUC-JP
- Korean: ISO-2022-KR, EUC-KR
- Chinese: GB18030, ISO-2022-CN (Simplified), Big5 (Traditional)
- Arabic: ISO-8859-6, IBM420, Windows 1256
For more information on character encodings see the following sources.
- The Wikipedia page Character encoding, has a good list of encodings.
- Unicode CLDR project
- What is the most common encoding of each language (Stack Overflow)
- “What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text”, http://kunststube.net/encoding/.
Programs that identify the encoding of text include:
Generate the correct format string to parse each of the following dates and times:
The correct formats are:
t2 uses real seconds,