“tidyr” is a package that is used to make data tidy. It is an evolution of “reshape2” and is designed specifically for data tidying (not general reshaping or aggregating) and works well with “dplyr” data pipelines. Important properties of tidy data are:
. Each column is a variable.
. Each row is an observation.
. Each cell is a value.
Arranging data in this way makes it easier to work with as it provides a consistent way of referring to variables (as column names) and observations (as row indices).
“tidyr” was a developed by Hadley Wickham.
install.packages("tidyr")
library(tidyr)
tidyr
provides four main functions. Let us elaborate on the functions with the help of the following example.
gather()
gather()
converts data from wide format to long format. It takes multiple columns and collapses into key-value pairs. key
is the name of new key column (made from names of data columns) value
is the name of new column.
stud_data<-read.csv("stud_data.csv",header=TRUE)
gather(stud_data)
key value
1 Student_ID 1
2 Student_ID 2
3 Student_ID 3
4 Student_ID 4
5 Student_ID 5
6 Names Rohan
7 Names John
8 Names Anisha
9 Names Agatha
10 Names Ashima
11 Maths 67
12 Maths 89
13 Maths 69
14 Maths 79
15 Maths 77
16 Economics 56
17 Economics 88
18 Economics <NA>
19 Economics 92
20 Economics 67
21 Statistics <NA>
22 Statistics 79
23 Statistics 88
24 Statistics 89
25 Statistics 89
gather()
converts data from wide to long format with a warning by treating all columns as key.
Note: In Wide Format data each column represents a different variable.In Long Format data, one column contains all the possible variables, another column contains their respective values.
longformat<-gather(stud_data,Subjects,Marks,Maths,Economics,Statistics)
longformat
Student_ID Names Subjects Marks
1 1 Rohan Maths 67
2 2 John Maths 89
3 3 Anisha Maths 69
4 4 Agatha Maths 79
5 5 Ashima Maths 77
6 1 Rohan Economics 56
7 2 John Economics 88
8 3 Anisha Economics NA
9 4 Agatha Economics 92
10 5 Ashima Economics 67
11 1 Rohan Statistics NA
12 2 John Statistics 79
13 3 Anisha Statistics 88
14 4 Agatha Statistics 89
15 5 Ashima Statistics 89
Here, we are adding new variables Subjects and Marks. In the syntax of gather()
, after the data is mentioned: 1st mentioned variable is key, 2nd mentioned variable is value and the subsequent variables are the ones’ to be gathered. Note that this data has 2 missing values To remove rows from output where the value column is NA, include na.rm=TRUE
spread()
spread()
converts data from long format to wide format. The spread()
function spreads a key-value pair across multiple columns. It’s a complement of gather()
.
spread(longformat,Subjects,Marks,fill=0)
Student_ID Names Economics Maths Statistics
1 1 Rohan 56 67 0
2 2 John 88 89 79
3 3 Anisha 0 69 88
4 4 Agatha 92 79 89
5 5 Ashima 67 77 89
fill=
is used to replace NA’s with the value provided to it. Note that there are two types of missingness in the input: explicit missing values (i.e. NA), and implicit missings, rows that simply aren’t present. Both types of missing value will be replaced by fill.
separate()
separate()
splits a single character column into multiple columns.
Let us create a data frame empdata with columns as empid, location and date and then split the column date into 3 columns.
empid<-c(101,102,103,104)
location<-c("Mumbai","Delhi","Delhi","Mumbai")
address<-c("4/Churchgate","12/Rohini","8/Pitampura", "21/Andheri")
date<-c("2016-10-09","2010-11-01","2009-09-23","1990-02-30")
empdata<-data.frame(empid,location,address,date)
empdata
empid location address date
1 101 Mumbai 4/Churchgate 2016-10-09
2 102 Delhi 12/Rohini 2010-11-01
3 103 Delhi 8/Pitampura 2009-09-23
4 104 Mumbai 21/Andheri 1990-02-30
sep_date<-separate(empdata,date,into=c("Year","Month","Date"))
sep_date
empid location address Year Month Date
1 101 Mumbai 4/Churchgate 2016 10 09
2 102 Delhi 12/Rohini 2010 11 01
3 103 Delhi 8/Pitampura 2009 09 23
4 104 Mumbai 21/Andheri 1990 02 30
class(sep_date$Year)
[1] "character"
By default, new columns created will be of the type of original column. Here, since date is of type character, columns Year, Month and Date will of the same type.
sep_date<-separate(empdata,date,into=c("Year","Month","Date"),convert=TRUE)
class(sep_date$Year)
[1] "integer"
convert=TRUE
will run type.convert with as.is=TRUE
on new columns.
This is useful if the component columns are integer, numeric or logical.
sep_address<-separate(empdata,address,into=c("sector","area"),sep="/", convert=TRUE)
sep_address
empid location sector area date
1 101 Mumbai 4 Churchgate 2016-10-09
2 102 Delhi 12 Rohini 2010-11-01
3 103 Delhi 8 Pitampura 2009-09-23
4 104 Mumbai 21 Andheri 1990-02-30
Here, we have split the column address into two new columns: sector and area with separator as ‘/’.
sep=
is used to specify separator between columns.
unite()
unite()
is a complement of separate()
. It unites multiple columns into single column.
Let us now unite the three date column into one.
unite_date<-unite(sep_date,date,c(Year,Month,Date),sep="/")
unite_date
empid location address date
1 101 Mumbai 4/Churchgate 2016/10/9
2 102 Delhi 12/Rohini 2010/11/1
3 103 Delhi 8/Pitampura 2009/9/23
4 104 Mumbai 21/Andheri 1990/2/30
unite()
takes the dataframe, name of the column to add, vector of columns to combine and a separator to use between the values as arguments