4 Importing Data


Summary

  • Data import refers to reading in data from storage into the main memory of a programming environment.

  • Common ways to organize data include comma separated files, fixed width files, and specialty formats such as the Rdata format and JSON.

  • Data export is the process of writing data from the programming environment to permanent storage.


While so far the data sets we have used have been built into R, most real data sets are not available so easily. The process of bringing data in to a computer so that it can be analyzed by a programming environment is called data import.

Data import is the process of reading in data from storage to a programming environment.

The other direction of moving data from the programming environment to storage is known as data export.

Data export is the process of writing out data from a programming environment to storage.

There are of course many types of storage that computers can read. These correspond to the different levels of memory in a modern computer. The fastest type of memory is built in to the computer workstation, and can be accessed practically instantly. Hard drive storage is not far behind, but much slower than the main memory. Much slower would be remote storage through a cloud service such as Dropbox or AWS. The methods that will be introduced here can draw files from any of these sources and bring them into the main environment of your computer.

Of course, some files are just too large to fit in local storage. These are referred to as big data.

A data set is big data if it cannot be stored in the fast main memory of a computer.

Different techniques are needed to handle data sets of this size, and will not be discussed here. The good news is that what constitutes big data is changing all the time as hardware gets better and better. In 2021 laptops often come with several terabytes of local storage and dozens of gigabytes of main memory. So data has to be very large to qualify as big data these days.

Since even a non-big data set can have millions of entries, it is helpful to have standard formats for recording entries. The gold standard is what is known as tidy data. In a tidy data set:

  1. Each row corresponds to an observation.
  2. Each column corresponds to a variable (which is something that can be written down.)
  3. Each entry only contains a single value.

The functions of the tidyverse include several functions for downloading files.

library(tidyverse)

The following sample of data from the International Comprehensive Ocean Atmosphere Data Set (ICOADS) is in tidy form. It is also stored in

## # A tibble: 6 × 4
##   Latitude Longitude `Time of Observation` `Air Temperature`
##      <dbl>     <dbl> <dttm>                            <dbl>
## 1    -35.8       2.9 2015-01-12 00:00:00                70.5
## 2    -35.9       6.9 2015-01-12 12:00:00                66.2
## 3    -35.9       8.6 2015-01-12 18:00:00                61.2
## 4    -35.7      11.9 2015-01-13 06:00:00                64.6
## 5    -35.6      13.7 2015-01-13 12:00:00                66.7
## 6    -35.4      15.5 2015-01-13 18:00:00                68.4

Each row in the table contains exactly one observation, and each column consists of things that are being measured at that observation.

4.1 Comma separated file

How might tidy data be stored in a text file? One way is to use a comma separated file, or CSV format.

A text document is a comma separated file or CSV if it stores data in a table using plain text where the entries for a row are separated by commas.

The above data in a CSV file looks like this:

Latitude,Longitude,Time of Observation,Air Temperature
-35.8,2.9,2015-01-12T00:00:00Z,70.5
-35.9,6.9,2015-01-12T12:00:00Z,66.2
-35.9,8.6,2015-01-12T18:00:00Z,61.2
-35.7,11.9,2015-01-13T06:00:00Z,64.6
-35.6,13.7,2015-01-13T12:00:00Z,66.7
-35.4,15.5,2015-01-13T18:00:00Z,68.4

Note that the variable names make up the first row, and then after that each row of data contains exactly one observation, where the columns are separated by commas.

To see this in practice, several commands that are a part of the tidyverse are needed, so load in the tidyverse in to start.

library(tidyverse)

To make a csv file with the above data on your local computer, the following command could be used.

write_lines(c("Latitude,Longitude,Time of Observation,Air Temperature",
              "-35.8,2.9,2015-01-12T00:00:00Z,70.5",
              "-35.9,6.9,2015-01-12T12:00:00Z,66.2",
              "-35.9,8.6,2015-01-12T18:00:00Z,61.2",
              "-35.7,11.9,2015-01-13T06:00:00Z,64.6",
              "-35.6,13.7,2015-01-13T12:00:00Z,66.7",
              "-35.4,15.5,2015-01-13T18:00:00Z,68.4"),
            "sample_icoads.csv"
            )

This writes out the data as the file sample_icoads.csv. Now this file could be read into R using the read_csv command.

sample1 <- read_csv("sample_icoads.csv")
## Rows: 6 Columns: 4
## ── Column specification ───────────────────────────────────
## Delimiter: ","
## dbl  (3): Latitude, Longitude, Air Temperature
## dttm (1): Time of Observation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

You can see that for each of the four variables in the data set, Latitude, Longitude, Time of Observation, and Air Temperature, the read_csv command tried to figure out what type of variable the data should be.

For the Latitude, it used a col_double(). That means that it made the data a double data type. This is what is typically used for floating point numbers. In early computers, the float data type was stored in a computer using 4 bytes of storage. More precision was quickly needed, so a new type of variable called double was created that uses 8 bytes of storage. So today, typically floating numbers such as -35.8 or 13.7 are stored as double type.

The Time of Observation variable looks to R like a date, and so it was placed in a datetime variable. If the data set is examined by typing its name, you see abbreviations for these data types.

sample1
## # A tibble: 6 × 4
##   Latitude Longitude `Time of Observation` `Air Temperature`
##      <dbl>     <dbl> <dttm>                            <dbl>
## 1    -35.8       2.9 2015-01-12 00:00:00                70.5
## 2    -35.9       6.9 2015-01-12 12:00:00                66.2
## 3    -35.9       8.6 2015-01-12 18:00:00                61.2
## 4    -35.7      11.9 2015-01-13 06:00:00                64.6
## 5    -35.6      13.7 2015-01-13 12:00:00                66.7
## 6    -35.4      15.5 2015-01-13 18:00:00                68.4

Here you see <dbl> right below Latitude, indicating that it is a double type variable, and <dttm> right below Time of Observation, indicating that it is a data and time variable.

4.1.1 Drawing data from the web

The sample_icoads.csv file was stored on your local file directory, but the read_csv command can also draw from a web address, typically given by a Universal Resource Locator (URL).

For example, (as of 2021-06-04), a larger sample from the same data set was stored on the National Centers for Environmental Information website, run by the National Oceanic and Atmospheric Administration at https://www1.ncdc.noaa.gov/pub/data/cdo/samples/Marine_CSV_sample.csv. The following loads this data directly from the website into a variable.

sample2 <- 
  read_csv("https://www1.ncdc.noaa.gov/pub/data/cdo/samples/Marine_CSV_sample.csv")
This data set contains many more variables and observations than the small sample given above. The first few observations are as follows.
Identification Latitude Longitude Time of Observation Ice Accretion On Ship Thickness of Ice Accretion On Ship Rate of Ice Accretion on Ship Sea Level Pressure Characteristics of Pressure Tendency Pressure Tendency Air Temperature Wet Bulb Temperature Dew Point Temperature Sea Surface Temperature Wave Direction Wave Period Wave Height Swell Direction Swell Period Swell Height Total Cloud Amount Low Cloud Amount Low Cloud Type Cloud Height Indicator Cloud Height Middle Cloud Type High Cloud Type Visibility Visibility Indicator Present Weather Past Weather Wind Direction Wind Speed
D5GN6 -35.8 2.9 2015-01-12 00:00:00 NA NA NA 29.83 8 0 70.5 NA 65.5 NA NA NA NA NA NA NA 7 NA NA NA 5 NA NA 97 NA 3 2 300 139
D5GN6 -35.9 6.9 2015-01-12 12:00:00 NA NA NA 29.93 1 0 66.2 NA 56.1 NA NA 9 10 NA NA NA 8 NA NA NA 2 1 A 97 NA 2 2 180 165
D5GN6 -35.9 8.6 2015-01-12 18:00:00 NA NA NA 30.02 1 0 61.2 NA 48.9 NA NA 7 7 24 12 8 8 NA NA NA 2 0 0 97 NA 2 2 170 154
D5GN6 -35.7 11.9 2015-01-13 06:00:00 NA NA NA 30.14 0 0 64.6 NA 45.0 NA NA 8 7 22 9 8 7 NA NA NA 3 0 0 98 NA 1 2 140 118
D5GN6 -35.6 13.7 2015-01-13 12:00:00 NA NA NA 30.12 8 NA 66.7 NA 50.5 NA NA 8 7 NA NA NA 4 NA NA NA 5 7 A 98 NA 2 1 160 123
D5GN6 -35.4 15.5 2015-01-13 18:00:00 NA NA NA 30.08 5 0 68.4 NA 48.9 NA NA NA NA NA NA NA 6 NA NA NA 3 NA NA 97 NA NA NA 120 118

4.2 Human readable formats

The csv format is an example of a human readable format.

A format for storing information is human readable if it is presented as a text format that a human can decipher by looking at it.

In general, human readable formats are preferable to use whenever possible. If the type of file becomes outdated or the file reader is lost, then with a human readable file it is still possible to recover the data.

There are many human readable variants on csv files, and functions in R to deal with them.

  • If the values are separated by a semicolon (;) instead of a comma (,), use read_csv2 instead of read_csv. This is more common in Europe than the US.

  • If a different symbol is used to separate values, use read_delim with the delim parameter to read such data. For instance,

    read_delim(filename, delim = "|")

    will read a file where values are separated by the vertical bar |.

  • One way to separate values is with a tab. This was more commonly used in early typewriters and computers, but still appears occasionally. To get a tab in R, you can use an escape character, namely, \t. Escape characters always start with a backslash followed by a single letter or symbol. They will be discussed in more detail later in the text. So

    read_delim(filename, delim = "\t")

    will read a tab separated file.

  • Instead of used separating, a file might used a fixed number of characters to represent each value. This is called a fixed width file, and can be read with the command read_fwf.

4.3 Fixing file read mistakes

The above formats all share something in common: the type of each variable is not explicitly given, and so the reader has to make a guess as to what type of variable is being used. Normally that is not a problem, especially when the file is very long. However, occasionally mistakes are made by the automatic reader.

Recall that when you first read in a file using read_csv (or any other read_ function) that part of the output is the specification used by the reader.

sample1 <- read_csv("sample_icoads.csv")
## Rows: 6 Columns: 4
## ── Column specification ───────────────────────────────────
## Delimiter: ","
## dbl  (3): Latitude, Longitude, Air Temperature
## dttm (1): Time of Observation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

If you need to get that specification back, just use the spec command on the variable.

spec(sample1)
## cols(
##   Latitude = col_double(),
##   Longitude = col_double(),
##   `Time of Observation` = col_datetime(format = ""),
##   `Air Temperature` = col_double()
## )

The specification for reading files can be set exactly as you wish using the col_types parameters. For instance, the following reads the file exactly as before.

sample1 <- read_csv(
  "sample_icoads.csv",
  col_types = cols(
    Latitude = col_double(),
    Longitude = col_double(),
    `Time of Observation` = col_datetime(format = ""),
    `Air Temperature` = col_double()
  )
)

Now suppose that instead the Air Temperature should be read in an a string of characters instead of as a number. Then just change its specified value in col_types.

sample3 <- read_csv(
  "sample_icoads.csv",
  col_types = cols(
    Latitude = col_double(),
    Longitude = col_double(),
    `Time of Observation` = col_datetime(format = ""),
    `Air Temperature` = col_character()
  )
)

Now a quick look at the data:

sample3
## # A tibble: 6 × 4
##   Latitude Longitude `Time of Observation` `Air Temperature`
##      <dbl>     <dbl> <dttm>                <chr>            
## 1    -35.8       2.9 2015-01-12 00:00:00   70.5             
## 2    -35.9       6.9 2015-01-12 12:00:00   66.2             
## 3    -35.9       8.6 2015-01-12 18:00:00   61.2             
## 4    -35.7      11.9 2015-01-13 06:00:00   64.6             
## 5    -35.6      13.7 2015-01-13 12:00:00   66.7             
## 6    -35.4      15.5 2015-01-13 18:00:00   68.4

reveals <chr> underneath the Air Temperature variable.

The only problem with this is it can get long with large numbers of variables. The types can be abbreviated with a single letter and the names omitted for brevity. For instance d stands for double.

sample4 <- read_csv(
  "sample_icoads.csv",
  col_types = cols("d", "d", "T", "c")
)

The list of possible data types with their abbreviations is as follows.

  • col_logical() [l], containing only T, F, TRUE or FALSE.

  • col_integer() [i], integers.

  • col_double() [d], doubles.

  • col_character() [c], everything else.

  • col_factor(levels, ordered) [f], a fixed set of values.

  • col_date(format = "") [D]: with the locale’s date_format.

  • col_time(format = "") [t]: with the locale’s time_format.

  • col_datetime(format = "") [T]: ISO8601 date times

  • col_number() [n], numbers containing the grouping_mark

  • col_skip() [_, -], don’t import this column.

  • col_guess() [?], parse using the “best” type based on the input.

4.4 Commercial formats

There are a number of commercial software programs for doing data science, many of which have their own proprietary non-human readable storage methods.

4.4.1 Microsoft Excel

One of the most popular tools for data science is the spreadsheet program Excel. Part of the tidyverse is the readxl package, and this contains several functions for reading files in .xls or .xlsx format.

library(readxl)

The read_excel command cannot read directly from a URL like read_csv can. Suppose the goal is to download the file https://s3-us-west-1.amazonaws.com/markhuber-datascience-resources.org/Data_sets/sample_icoads.xlsx. Then one way to do so is to use an Internet browser to go to the URL and then use the browser to download the file. Then the file can be read into your R session.

However, if the plan is to keep track of exactly how the data was obtained, it is best to download it using code. This can be done with the download.file function. The default behavior of download.file is to download text files. To download binary files like Excel spreadsheets use the parameter mode = "wb". This forces the target file to be treated as a binary file.

The first argument to download.file is the URL of the file, the second is what the file should be called locally, and the third is the mode = "wb" mentioned earlier that indicates the download is a binary file.

Because the URL is very long here, the str_c from the stringr package will be used to combine two strings together before sending it to download.file. The str_c combines strings together through concatenation, which is a fancy way of saying that the strings are put together with the end of the first string attached to the start of the second string.

"https://s3-us-west-1.amazonaws.com/markhuber-datascience-resources.org" |>
  stringr::str_c("/Data_sets/sample_icoads.xlsx") |>
  download.file("sample_icoads.xlsx", mode = "wb")

Once the file is downloaded, it can be read into R with read_xlsx.

sample5 <- read_xlsx("sample_icoads.xlsx")

4.4.2 Stata

The file format for Stata is called DTA. The package haven in the tidyverse contains methods for reading these files.

library(haven)

The functions of this package are still growing in usefulness. As of 2021-06-04, the read_dta function can be used to read DTA files with logical, integer, numeric, character, and factor variables. Then write_dta can be used to export the files.

4.4.3 SPSS

The haven package also contains a command for reading SPSS files, read_spss.

4.5 R format

R does have its own format for reading and writing data. It is most definitely not human readable. The write_rds can be used to create such a file, after which read_rds will read it in. The following will create an .rds file in your local file directory, and then read it back in.

write_rds(sample4, "sample_icoads.rds")
sample4 <- NULL
sample4 <- read_rds("sample_icoads.rds")
sample4
## # A tibble: 6 × 4
##   Latitude Longitude `Time of Observation` `Air Temperature`
##      <dbl>     <dbl> <dttm>                <chr>            
## 1    -35.8       2.9 2015-01-12 00:00:00   70.5             
## 2    -35.9       6.9 2015-01-12 12:00:00   66.2             
## 3    -35.9       8.6 2015-01-12 18:00:00   61.2             
## 4    -35.7      11.9 2015-01-13 06:00:00   64.6             
## 5    -35.6      13.7 2015-01-13 12:00:00   66.7             
## 6    -35.4      15.5 2015-01-13 18:00:00   68.4

Questions

The newline character \n can be used to tell R that a new line is starting in a string. That allows us to directly create a comma separated file to play with. Consider the following code.

df <- 
  read_csv(
    "1, Mole, Moly\n
     2, Rat, Ratty\n
     3, Mr. Toad, Toady\n
     4, Mr. Badger, Badger"
  )
## Rows: 3 Columns: 3
## ── Column specification ───────────────────────────────────
## Delimiter: ","
## chr (2): Mole, Moly
## dbl (1): 1
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Modify this code by changing parameters to read_csv so that the first row of data does not get turned into header names, and remove the first column of data.

This can be done with

df <- 
  read_csv(
    "1, Mole, Moly\n
     2, Rat, Ratty\n
     3, Mr. Toad, Toady\n
     4, Mr. Badger, Badger",
    col_names = FALSE
  ) |>
  select(-1)
## Rows: 4 Columns: 3
## ── Column specification ───────────────────────────────────
## Delimiter: ","
## chr (2): X2, X3
## dbl (1): X1
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df
## # A tibble: 4 × 2
##   X2         X3    
##   <chr>      <chr> 
## 1 Mole       Moly  
## 2 Rat        Ratty 
## 3 Mr. Toad   Toady 
## 4 Mr. Badger Badger

What command would you give R to read a file where fields are separated with “|”?

Explain why each of the following strings does not give a valid inline CSV file.

read_csv("a,b\n1,2,3\n4,5,6")
read_csv("a,b,c\n1,2\n1,2,3,4")
read_csv("a,b\n\"1")
read_csv("a,b\n1,2\na,b")
read_csv("a;b\n1;3")

Consider the following command:

aes <- read_csv("datasets/annual-enterprise-survey-2019-financial-year-provisional-csv.csv")
## Rows: 32445 Columns: 10
## ── Column specification ───────────────────────────────────
## Delimiter: ","
## chr (9): Industry_aggregation_NZSIOC, Industry_code_NZSIOC, Industry_name_NZ...
## dbl (1): Year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. What type of variable was Year parsed as?

  2. Modify the above command to read in the Year variable as an integer.

Consider the file datasets/example.xlsx on the course website.

  1. Give a command to read this into the tibble data using read_excel from the readxl library.

  2. How many data points are there in the resulting dataset?