2 - Loading Data Into R

27 November 2016

Importing Data

A stumbling block for many learning R is the Error Messages
We often see them while we're loading data
R is very strict about data formats
We can load .xlsx, xls, csv, txt, gtf/gff files + many more
The structure of the spreadsheet is vital

Importing Data

Things we do to make it "look nice" in Excel can create problems
In data, open RealTimeData.xlsx in Excel (or Libre Office)

Which sheet do you think will be the most problematic to load?

Importing Data

Sheet 1

This is actually the type of format R loves to see
- Simple column structure, with column and row names
- No blank rows at the top or separating sub-tables
- No blank columns
- No merged cells

Importing Data

Sheet 2

What about all those missing values?

Importing Data

Sheet 2

What about all those missing values?

R can happily deal with missing values: $\implies$ will load as NA
The missing column names may give strange behaviours
Otherwise no problems

Importing Data

Sheet 3

Here we effectively have 2 tables on the same sheet
- Will this cause error messages?
- R guesses the number of columns from the first row
- The plot will simply be ignored

Always think in terms of columns

Using the GUI To Load Data

In today's data folder is the file toothData.csv
Clicking on it will open it in the Script Window as a text file
We can use the button as shown

Using the GUI To Load Data

The Preview Window

Using the GUI To Load Data

Try changing a few settings & note changes in the Preview
Once you're happy, click "Import"

You will see two lines of code in the Console $\implies$ two things have just happened

toothData <- read.csv("data/toothData.csv")
View(toothData)

Using the GUI To Load Data

toothData <- read.csv("data/toothData.csv")

ALWAYS copy the first line into your script!

This is the exact command we've used to load the file
Anything we've set by clicking will also be in this line

Using the GUI To Load Data

View(toothData)

The second line has opened a preview of our R object

By default, the R object will be named using the file-name before the .csv
Also look at the toothData object in the Environment tab (click the arrow)

Data Frame Objects

The object toothData is known as a data.frame
R equivalent to a spreadsheet
Three possible ways to inspect this are:

View(toothData)
toothData
head(toothData)

What were the differences between each method?

Data Frame Objects

In a data.frame each column is a vector:
- Each column has a single type of data
- Different columns can be different types of data

Data Frame Objects

Factors

By default R assumes that a column of text is a categorical variable (i.e. a factor)
We can change this by un-checking the stringsAsFactors button during import

Data Frame Objects

Subsetting

We can use the square braces to select rows and columns [row, column]

toothData[1:2,] # Select the first 2 rows, all columns
toothData[1:5, 1] # The first 5 entries in the first column

Data Frame Objects

Subsetting

Because each column is a vector, we can use an alternative method ($)

toothData$len #Print the entire column vector called 'len'
toothData$len[1:5] # Now just the first 5 entries

Data Frame Objects

Subsetting

Calling by name:

toothData[1:5, "len"]
toothData$len[1:5]

When would you use either method?

Data Frame Objects

Subsetting

One easy trap to fall into - forgetting the comma

toothData["len"]
toothData[["len"]]

R Functions and Packages

What have we really done?

In the above section we called the R function read.csv()
This is in the utils package which is one of the default packages
A package is just a collection of related functions, e.g.
- median() & sd() are in the stats package
- max(), sum(), + are in the base package

A Better Alternative

The package readr has a similar, but slightly superior version called read_csv()
No GUI for this method

How do we install packages?

There is a central repository known as CRAN

Installing Packages

Tools > Install Packages ...

Installing Packages

Enter the package name readr then Install
It should auto-complete once you start typing
Now try the following packages: readxl, dplyr, ggplot2, reshape2, stringr

(This may take a few minutes - use the coloured post-it notes)

Loading Packages

library(dplyr)
library(readr)

You might see lots of friendly messages…

toothData <- read_csv("data/toothData.csv")

A Better Alternative

Why is this better?

Much faster
Character columns are left as plain text
It also runs some checks on your data $\implies$ can give more error messages
Output is a local data frame (i.e. a tibble)
$\implies$ display in the Console is more convenient

toothData

Reading Help Pages

?read_csv

Once again we have four functions but stick to read_csv()
This function has numerous arguments given as names
(e.g. file, col_names)
- If just the name is given, we need to specify something
- Otherwise the argument is given a default value
  (e.g. col_names = TRUE)
- If specifying in order, we don't need to name arguments

Reading Help Pages

toothData <- read_csv("data/toothData.csv")

Is equivalent to:

toothData <- read_csv(file = "data/toothData.csv")

Reading Help Pages

If we had a file with 3 blank lines to start, is there an argument that can help?

Reading Help Pages

If we had a file with 3 blank lines to start, is there an argument that can help?

read_csv("path/to/file.csv", skip = 3)

Reading Help Pages

Bonus Slide

The bottom three functions are simplified wrappers to read_delim()
read_csv() calls read_delim() using delim = ","
read_csv2() calls read_delim() using delim = ";"
read_tsv() calls read_delim() using delim = "\t"

What function would we call for space-delimited files?

Loading Excel Files

R also has a package for loading .xls and xlsx files.

library(readxl)

The main function is read_excel()

?read_excel

Dealing With Difficult Data

Data Cleaning

What if the data we have isn't nice?

Missing values might be given a value (e.g. 9999, "NA")
Column names might be missing
File may have comments
May be structural errors in the file
White-space in cells
Merged Cells (these are difficult…)

Dealing With Column Names

read_csv() has an argument called col_names
NB:read.csv() refers to column names as a header
By default, the first row is assumed to contain the names of the variables/columns, i.e. col_names = TRUE
This tells R how many columns you have

What happens if we get this wrong?

no_header <- read_csv("data/no_header.csv")

Dealing With Column Names

We can easily fix this

no_header <- read_csv("data/no_header.csv", col_names = FALSE)

What about that first column?

Dealing With Column Names

We can specify what is loaded or skipped using col_types

?read_csv

We can skip columns using the - symbol
character and numbers use c and n respectively

no_header <- read_csv("data/no_header.csv", col_names = FALSE,
                      col_types = "-ccnnc")

What if we get that wrong?

Dealing With Column Names

Getting it wrong

Let's mis-specify the third column as a number

no_header <- read_csv("data/no_header.csv", col_names = FALSE,
                      col_types = "-cnnnc")

Did the error message make any sense?
Did the file load?
What happened to the third column?

Dealing With Comments

Let's get it wrong first

comments <- read_csv("data/comments.csv")

Now we can get it right

comments <- read_csv("data/comments.csv", comment = "#")

This will work if there are comments in any rows

Structural Problems

What happens when you try to load the file bad_colnames_.csv

bad_colnames <- read_csv("data/bad_colnames.csv")

How could we fix this?

By editing the file, and
Without editing the file

Structural Problems

Here's my fix

bad_colnames <- read_csv("data/bad_colnames.csv", 
                             skip =  1, col_names = FALSE)
colnames(bad_colnames) <- c("rowname", "gender", "name",
                                "weight", "height", "transport")

Here we have set the column names manually using a character vector
Note that all words were quoted.

Removing Column Names

The most common function in R is c()

This stands for combine
Combines all values into a single R object, or vector
If left empty, it is equivalent to NULL

c()

## NULL

colnames(bad_colnames) <- c()

Encoded Missing Values

What if missing values have been set to "-"?

Let's get it wrong first

missing_data <- read_csv("data/missing_data.csv")

Where have the errors appeared?

Now we can get it right

missing_data <- read_csv("data/missing_data.csv", na = "-")

Exporting Data

After we've edited a file, we might wish to export it

?write_csv

This is a wrapper for write_delim()
Can export .csv, .txt, .tsv etc
Individual R objects can be exported using write_rds()