Data Management with tidyr

“tidyr” is a package that is used to make data tidy. It is an evolution of “reshape2” and is designed specifically for data tidying (not general reshaping or aggregating) and works well with “dplyr” data pipelines. Important properties of tidy data are:

. Each column is a variable.

. Each row is an observation.

. Each cell is a value.

Arranging data in this way makes it easier to work with as it provides a consistent way of referring to variables (as column names) and observations (as row indices).

“tidyr” was a developed by Hadley Wickham.

install.packages("tidyr")
library(tidyr)

tidyr provides four main functions. Let us elaborate on the functions with the help of the following example.

1. `gather()`

gather() converts data from wide format to long format. It takes multiple columns and collapses into key-value pairs. key is the name of new key column (made from names of data columns) value is the name of new column.

stud_data<-read.csv("stud_data.csv",header=TRUE)
gather(stud_data)

          key  value
1  Student_ID      1
2  Student_ID      2
3  Student_ID      3
4  Student_ID      4
5  Student_ID      5
6       Names  Rohan
7       Names   John
8       Names Anisha
9       Names Agatha
10      Names Ashima
11      Maths     67
12      Maths     89
13      Maths     69
14      Maths     79
15      Maths     77
16  Economics     56
17  Economics     88
18  Economics   <NA>
19  Economics     92
20  Economics     67
21 Statistics   <NA>
22 Statistics     79
23 Statistics     88
24 Statistics     89
25 Statistics     89

gather() converts data from wide to long format with a warning by treating all columns as key.

Note: In Wide Format data each column represents a different variable.In Long Format data, one column contains all the possible variables, another column contains their respective values.

longformat<-gather(stud_data,Subjects,Marks,Maths,Economics,Statistics)
longformat

   Student_ID  Names   Subjects Marks
1           1  Rohan      Maths    67
2           2   John      Maths    89
3           3 Anisha      Maths    69
4           4 Agatha      Maths    79
5           5 Ashima      Maths    77
6           1  Rohan  Economics    56
7           2   John  Economics    88
8           3 Anisha  Economics    NA
9           4 Agatha  Economics    92
10          5 Ashima  Economics    67
11          1  Rohan Statistics    NA
12          2   John Statistics    79
13          3 Anisha Statistics    88
14          4 Agatha Statistics    89
15          5 Ashima Statistics    89

Here, we are adding new variables Subjects and Marks. In the syntax of gather(), after the data is mentioned: 1st mentioned variable is key, 2nd mentioned variable is value and the subsequent variables are the ones’ to be gathered. Note that this data has 2 missing values To remove rows from output where the value column is NA, include na.rm=TRUE

2. `spread()`

spread() converts data from long format to wide format. The spread() function spreads a key-value pair across multiple columns. It’s a complement of gather().

spread(longformat,Subjects,Marks,fill=0)

  Student_ID  Names Economics Maths Statistics
1          1  Rohan        56    67          0
2          2   John        88    89         79
3          3 Anisha         0    69         88
4          4 Agatha        92    79         89
5          5 Ashima        67    77         89

fill= is used to replace NA’s with the value provided to it. Note that there are two types of missingness in the input: explicit missing values (i.e. NA), and implicit missings, rows that simply aren’t present. Both types of missing value will be replaced by fill.

3.`separate()`

separate() splits a single character column into multiple columns.

Let us create a data frame empdata with columns as empid, location and date and then split the column date into 3 columns.

empid<-c(101,102,103,104)
location<-c("Mumbai","Delhi","Delhi","Mumbai")
address<-c("4/Churchgate","12/Rohini","8/Pitampura", "21/Andheri")
date<-c("2016-10-09","2010-11-01","2009-09-23","1990-02-30")
empdata<-data.frame(empid,location,address,date)
empdata

  empid location      address       date
1   101   Mumbai 4/Churchgate 2016-10-09
2   102    Delhi    12/Rohini 2010-11-01
3   103    Delhi  8/Pitampura 2009-09-23
4   104   Mumbai   21/Andheri 1990-02-30

sep_date<-separate(empdata,date,into=c("Year","Month","Date"))
sep_date

  empid location      address Year Month Date
1   101   Mumbai 4/Churchgate 2016    10   09
2   102    Delhi    12/Rohini 2010    11   01
3   103    Delhi  8/Pitampura 2009    09   23
4   104   Mumbai   21/Andheri 1990    02   30

class(sep_date$Year)

[1] "character"

By default, new columns created will be of the type of original column. Here, since date is of type character, columns Year, Month and Date will of the same type.

sep_date<-separate(empdata,date,into=c("Year","Month","Date"),convert=TRUE)
class(sep_date$Year)

[1] "integer"

convert=TRUE will run type.convert with as.is=TRUE on new columns.

This is useful if the component columns are integer, numeric or logical.

sep_address<-separate(empdata,address,into=c("sector","area"),sep="/", convert=TRUE)
sep_address

  empid location sector       area       date
1   101   Mumbai      4 Churchgate 2016-10-09
2   102    Delhi     12     Rohini 2010-11-01
3   103    Delhi      8  Pitampura 2009-09-23
4   104   Mumbai     21    Andheri 1990-02-30

Here, we have split the column address into two new columns: sector and area with separator as ‘/’.

sep= is used to specify separator between columns.

4.`unite()`

unite() is a complement of separate(). It unites multiple columns into single column.

Let us now unite the three date column into one.

unite_date<-unite(sep_date,date,c(Year,Month,Date),sep="/")
unite_date

  empid location      address      date
1   101   Mumbai 4/Churchgate 2016/10/9
2   102    Delhi    12/Rohini 2010/11/1
3   103    Delhi  8/Pitampura 2009/9/23
4   104   Mumbai   21/Andheri 1990/2/30

unite() takes the dataframe, name of the column to add, vector of columns to combine and a separator to use between the values as arguments

Data Management with “tidyr”

1. gather()

2. spread()

3.separate()

4.unite()

1. `gather()`

2. `spread()`

3.`separate()`

4.`unite()`