20 July 2016

Data Cleaning

Setup

If you've started a new session since last time:

library(dplyr)
library(readr)

Data Cleaning

What if the data we have isn't nice?

  • Missing values might be given a value (e.g. 9999, "NA")
  • Column names might be missing
  • File may have comments
  • May be structural errors in the file
  • White-space in cells

Dealing With Column Names

  • The function read.csv() refers to column names as a header
  • By default, the first row is assumed to contain the names of the variables/columns
  • This tells R how many columns you have

What happens if we get this wrong?

no_header <- read_csv("data/no_header.csv")

Dealing With Column Names

We can easily fix this

no_header <- read_csv("data/no_header.csv", col_names = FALSE)

What about that first column?

Dealing With Column Names

We can specify what is loaded or skipped using col_types

?read_csv
no_header <- read_csv("data/no_header.csv", col_names = FALSE,
                      col_types = "-ccnnc")

What if we get that wrong?

Dealing With Column Names

Getting it wrong

Let's mis-specify the third column as a number

no_header <- read_csv("data/no_header.csv", col_names = FALSE,
                      col_types = "-cnnnc")
  • Did the error message make any sense?
  • Did the file load?
  • What happened to the third column?

Dealing With Comments

Let's get it wrong first

comments <- read_csv("data/comments.csv")

Now we can get it right

comments <- read_csv("data/comments.csv", comment = "#")

This will work if there are comments in any rows

Structural Problems

What happens when you try to load the file bad_colnames_.csv

bad_colnames <- read_csv("data/bad_colnames.csv")

How could we fix this?

  1. By editing the file, and
  2. Without editing the file

Structural Problems

Here's my fix

bad_colnames <- read_csv("data/bad_colnames.csv", 
                             skip =  1, col_names = FALSE)
colnames(bad_colnames) <- c("rowname", "gender", "name",
                                "weight", "height", "transport")

We can set column names manually…

The c() function

The most common function in R is c()

  • This stands for combine
  • Combines all values into a single R object, or vector
  • If left empty, it is equivalent to NULL
c()
## NULL
colnames(bad_colnames) <- c()

Encoded Missing Values

What if missing values have been set to "-"?

Let's get it wrong first

missing_data <- read_csv("data/missing_data.csv")

Where have the errors appeared?

Now we can get it right

missing_data <- read_csv("data/missing_data.csv", na = "-")

Exporting Data

After we've edited a file, we might wish to export it

?write_csv
  • This is a wrapper for write_delim()
  • Can export .csv, .txt, .tsv etc
  • Individual R objects can be exported using write_rds()