1.2 Loading Difficult Data

15 April 2019

Loading Difficult Data

Setup

If you've started a new session since last time:

library(tidyverse)

What if the data we have isn't nice?

Missing values might be given a value (e.g. 9999, "NA")
Column names might be missing
File may have comments
Irrelevant columns
May be structural issues in the file (Different column numbers)
White-space in cells

What if the data we have isn't nice?

Make sure the file transport.csv is in your data folder
Navigate to the file in your Files pane
Click on the file and choose View File

What problems do we face here?

What if the data we have isn't nice?

What problems do we face here?

A comment in the first line
No column names
Missing data encoded as "-" in the 4^th line
A redundant column

What if the data we have isn't nice?

Let's try writing code for this instead of using the GUI

This will fail

transport <- read_csv("data/transport.csv")

R uses the first row to guess how many columns there are
- The comment is indicating 1 column

Removing Comments

We can tell R to ignore any lines beginning with #
Set the comment argument using comment = "#"

transport <- read_csv("data/transport.csv", comment = "#")
transport

Now R is guessing the correct number of columns \(\implies\) the file will load

What does all that red (or blue) stuff mean?

Data without column names

R has assumed the first row contains column names
We can tell R to ignore these using: col_names = FALSE

transport <- read_csv("data/transport.csv", 
                      comment = "#", 
                      col_names = FALSE)
transport

What has R used for column names?

Missing Data

What impact has the missing data in X5 had?

We can correctly assign missing data as NA (na = "-")

transport <- read_csv("data/transport.csv", 
                      comment = "#", 
                      col_names = FALSE, 
                      na = "-")
transport

Ignoring Columns

We can also use the code - to skip a column
We can leave R to guess any remaining columns using ?
Use the col_types argument: col_types = "-?????"

transport <- read_csv("data/transport.csv",
                      comment = "#", 
                      col_names = FALSE, 
                      na = "-", 
                      col_types = "-?????")
transport

Specifying Columns

Or we can specify the exact type of data in each column
Numeric columns can be specified as n
Text or character columns can be specified as c

transport <- read_csv("data/transport.csv", 
                      comment = "#", 
                      col_names = FALSE, 
                      na = "-", 
                      col_types = "-ccnnc")
transport

Specifying Columns

Let's accidentally get the final column wrong (use n instead of c)

transport <- read_csv("data/transport.csv",
                      comment = "#", 
                      col_names = FALSE, 
                      na = "-",
                      col_types = "-ccnnn")
transport

NB: No warning will be given if a numeric column contains non-numeric characters

Specifying Columns

Let's change that back to the correct code:

transport <- read_csv("data/transport.csv", 
                      comment = "#", 
                      col_names = FALSE, 
                      na = "-", 
                      col_types = "-ccnnc")
transport

Setting Column Names

My fix is to supply a vector of names

myNames <- c("gender", "name", "weight", "height", "method")
transport <- read_csv("data/transport.csv", 
                      comment = "#", 
                      col_names = myNames, 
                      na = "-", 
                      col_types = "-ccnnc")
transport

The `c()` function

The most common function in R is c()

This stands for combine
Combines all values into a single R object, or vector
If left empty, it is equivalent to NULL

c()

The `c()` function

I have used this to create a vector of column names
Used the assignment operator <- to assign this vector

What would happen if I gave too many or too few names?

We need to be careful here…

Exporting Data

After we've edited a file, we might also wish to export it

?write_csv

This is a wrapper for write_delim()
Can export .csv, .txt, .tsv etc.
Individual R objects can be exported using write_rds()

Exporting Data

The best way to export this is:

write_csv(transport, "data/transport_clean.csv")

Bonus Challenge

Download the file geneCounts.out (output from featureCounts)

Try to import this using the GUI
Try to import by using read_delim()
- Set this to skip the Chr, Start, End and Strand Columns
- Try to modify the column names using the function basename()
- Remove _Fem_hisat2_sorted.bam from the end of the column names

Loading Difficult Data

Setup

What if the data we have isn't nice?

What if the data we have isn't nice?

What if the data we have isn't nice?

What if the data we have isn't nice?

Removing Comments

Data without column names

Missing Data

Ignoring Columns

Specifying Columns

Specifying Columns

Specifying Columns

Setting Column Names

The c() function

The c() function

Exporting Data

Exporting Data

Bonus Challenge

The `c()` function

The `c()` function