- A stumbling block for many learning
Ris the Error Messages - We often see them while we're loading data
Ris very strict about data formats- We can load
.xlsx,xls,csv,txt,gtf/gfffiles + many more - The structure of the spreadsheet is vital
27 November 2016
R is the Error MessagesR is very strict about data formats.xlsx, xls, csv, txt, gtf/gff files + many moredata, open RealTimeData.xlsx in Excel (or Libre Office)Which sheet do you think will be the most problematic to load?
R loves to see
What about all those missing values?
What about all those missing values?
R can happily deal with missing values: \(\implies\) will load as NAR guesses the number of columns from the first rowAlways think in terms of columns
data folder is the file toothData.csvScript Window as a text fileYou will see two lines of code in the Console \(\implies\) two things have just happened
toothData <- read.csv("data/toothData.csv")
View(toothData)
toothData <- read.csv("data/toothData.csv")
ALWAYS copy the first line into your script!
View(toothData)
The second line has opened a preview of our R object
R object will be named using the file-name before the .csvtoothData object in the Environment tab (click the arrow)toothData is known as a data.frameR equivalent to a spreadsheetView(toothData) toothData head(toothData)
What were the differences between each method?
data.frame each column is a vector:
R assumes that a column of text is a categorical variable (i.e. a factor)stringsAsFactors button during import[row, column]toothData[1:2,] # Select the first 2 rows, all columns toothData[1:5, 1] # The first 5 entries in the first column
Because each column is a vector, we can use an alternative method ($)
toothData$len #Print the entire column vector called 'len' toothData$len[1:5] # Now just the first 5 entries
Calling by name:
toothData[1:5, "len"] toothData$len[1:5]
One easy trap to fall into - forgetting the comma
toothData["len"] toothData[["len"]]
R function read.csv()utils package which is one of the default packagesmedian() & sd() are in the stats packagemax(), sum(), + are in the base packagereadr has a similar, but slightly superior version called read_csv()CRANTools > Install Packages ...
Enter the package name readr then Install
It should auto-complete once you start typing
Now try the following packages: readxl, dplyr, ggplot2, reshape2, stringr
(This may take a few minutes - use the coloured post-it notes)
library(dplyr) library(readr)
You might see lots of friendly messages…
toothData <- read_csv("data/toothData.csv")
local data frame (i.e. a tibble)Console is more convenienttoothData
?read_csv
read_csv()file, col_names)
col_names = TRUE)toothData <- read_csv("data/toothData.csv")
Is equivalent to:
toothData <- read_csv(file = "data/toothData.csv")
If we had a file with 3 blank lines to start, is there an argument that can help?
If we had a file with 3 blank lines to start, is there an argument that can help?
read_csv("path/to/file.csv", skip = 3)
read_delim()read_csv() calls read_delim() using delim = ","read_csv2() calls read_delim() using delim = ";"read_tsv() calls read_delim() using delim = "\t"What function would we call for space-delimited files?
R also has a package for loading .xls and xlsx files.
library(readxl)
The main function is read_excel()
?read_excel
read_csv() has an argument called col_namesread.csv() refers to column names as a headercol_names = TRUER how many columns you haveWhat happens if we get this wrong?
no_header <- read_csv("data/no_header.csv")
We can easily fix this
no_header <- read_csv("data/no_header.csv", col_names = FALSE)
What about that first column?
We can specify what is loaded or skipped using col_types
?read_csv
- symbolcharacter and numbers use c and n respectivelyno_header <- read_csv("data/no_header.csv", col_names = FALSE,
col_types = "-ccnnc")
What if we get that wrong?
Let's mis-specify the third column as a number
no_header <- read_csv("data/no_header.csv", col_names = FALSE,
col_types = "-cnnnc")
Let's get it wrong first
comments <- read_csv("data/comments.csv")
Now we can get it right
comments <- read_csv("data/comments.csv", comment = "#")
This will work if there are comments in any rows
What happens when you try to load the file bad_colnames_.csv
bad_colnames <- read_csv("data/bad_colnames.csv")
How could we fix this?
Here's my fix
bad_colnames <- read_csv("data/bad_colnames.csv",
skip = 1, col_names = FALSE)
colnames(bad_colnames) <- c("rowname", "gender", "name",
"weight", "height", "transport")
The most common function in R is c()
combineR object, or vectorNULLc()
## NULL
colnames(bad_colnames) <- c()
What if missing values have been set to "-"?
Let's get it wrong first
missing_data <- read_csv("data/missing_data.csv")
Where have the errors appeared?
Now we can get it right
missing_data <- read_csv("data/missing_data.csv", na = "-")
After we've edited a file, we might wish to export it
?write_csv
write_delim().csv, .txt, .tsv etcR objects can be exported using write_rds()