Reshape and Tidy Data: reshape2 vs tidyr

There are two main formats of data; in Wide Format data each column represents a different variable whereas in Long Format data, one column contains all the possible variables, another column contains their respective values. In practical scenario, while wide format is more readable, long format is easier to analyse. Therefore, it is useful to know how to convert between the two.

“reshape2” and “tidyr” are two packages that are used to restructure data. In our previous analytics lab we have elaborated on functions of the package “tidyr” . In this article we will focus on the functions of package “reshape2” and will see the similarities and differences between the two packages while performing the following tasks:

From wide format to long format : `melt()` Vs `gather()`

Refer to the example in the previous section Now let us convert the data in wide format to long format using the function melt() in reshape2

install.packages("reshape2")
library(reshape2)

stud_data<-read.csv("stud_data.csv",header=TRUE)
melt(stud_data)

    Names   variable value
1   Rohan Student_ID     1
2    John Student_ID     2
3  Anisha Student_ID     3
4  Agatha Student_ID     4
5  Ashima Student_ID     5
6   Rohan      Maths    67
7    John      Maths    89
8  Anisha      Maths    69
9  Agatha      Maths    79
10 Ashima      Maths    77
11  Rohan  Economics    56
12   John  Economics    88
13 Anisha  Economics    NA
14 Agatha  Economics    92
15 Ashima  Economics    67
16  Rohan Statistics    NA
17   John Statistics    79
18 Anisha Statistics    88
19 Agatha Statistics    89
20 Ashima Statistics    89

The function in tidyr used to perform similar operation is gather(). We see a different behaviour: gather() had brought stud_data into a long data format with a warning by treating all columns as variable, while melt() has treated name as an “id variables”. Id columns are the columns that contain the identifier of the observation that is represented as a row in our data set. If melt() does not receive any id.variables specification, then it will use the factor or character columns as id variables. gather() requires the columns that needs to be treated as ids, all the other columns are to be used as key-value pairs.

There can be multiple Id variables and can be specified separately.

melt(stud_data,id.vars=c('Student_ID','Names'))

   Student_ID  Names   variable value
1           1  Rohan      Maths    67
2           2   John      Maths    89
3           3 Anisha      Maths    69
4           4 Agatha      Maths    79
5           5 Ashima      Maths    77
6           1  Rohan  Economics    56
7           2   John  Economics    88
8           3 Anisha  Economics    NA
9           4 Agatha  Economics    92
10          5 Ashima  Economics    67
11          1  Rohan Statistics    NA
12          2   John Statistics    79
13          3 Anisha Statistics    88
14          4 Agatha Statistics    89
15          5 Ashima Statistics    89

With id.vars= we can give multiple id variables. For multiple variables we can specify the list of variables enclosed in c().

We can also change the names of new columns created after melting

long_format<-melt(stud_data,id.vars=c('Student_ID','Names'), variable.name='Subjects', value.name='Marks')
long_format

   Student_ID  Names   Subjects Marks
1           1  Rohan      Maths    67
2           2   John      Maths    89
3           3 Anisha      Maths    69
4           4 Agatha      Maths    79
5           5 Ashima      Maths    77
6           1  Rohan  Economics    56
7           2   John  Economics    88
8           3 Anisha  Economics    NA
9           4 Agatha  Economics    92
10          5 Ashima  Economics    67
11          1  Rohan Statistics    NA
12          2   John Statistics    79
13          3 Anisha Statistics    88
14          4 Agatha Statistics    89
15          5 Ashima Statistics    89

Using variable.name= and value.name= arguments we can change the names of the columns variable and value.

In gather() we used bare variable names to specify the names of the new columns.

Note: both these commands of tidyr and reshape2 can be performed on dataframes. However, gather() cannot handle matrices or arrays but melt() can.

From long format to wide format : `dcast()` Vs `spread()`

cast function in reshape2 converts data from long format to wide format. It is a complement of melt() function. There are two types of cast functions: dcast() and acast()

dcast() - returns a dataframe as the output

acast() - returns a vector/matrix/array as the output

Since dataframe objects are the most common, we will explore the dcast() function.

Now let us reshape the long format object into wide format

dcast(long_format,Student_ID+Names~Subjects)

  Student_ID  Names Maths Economics Statistics
1          1  Rohan    67        56         NA
2          2   John    89        88         79
3          3 Anisha    69        NA         88
4          4 Agatha    79        92         89
5          5 Ashima    77        67         89

spread() from the package tidyr performs the same task as dcast().dcast() uses a formula to describe the shape of the data.

Try out the following commands :

dcast(long_format, Names+Student_ID~Subjects)
dcast(long_format,Student_ID~Subjects)
dcast(long_format,Names ~ Subjects)

We can also subset variables while converting data from long to wide format.

install.packages("plyr")
library(plyr)

dcast(long_format,Student_ID+Names~Subjects,subset=.(Subjects=="Maths"))

  Student_ID  Names Maths
1          1  Rohan    67
2          2   John    89
3          3 Anisha    69
4          4 Agatha    79
5          5 Ashima    77

Here, we have subsetted Maths from the variable Subjects. subset= is used to subset data using .() . Package plyr is needed to access '.'

Aggregating with `dcast()`

Let us check some aggregations that are possible with dcast()

dcast(long_format, Student_ID+Names ~ .,fun.aggregate=sum,na.rm=TRUE)

  Student_ID  Names   .
1          1  Rohan 123
2          2   John 256
3          3 Anisha 157
4          4 Agatha 260
5          5 Ashima 233

fun.aggregate is used to specify the aggregation function. na.rm=TRUE is used to remove NA values.

tidyr package has no function that allow us to perform a similar operation.

Conclusion:

tidyr and reshape2 are similar in a way as they have some functions that perform the same operation. reshape2’s functions can do data aggregation that is not possible with tidyr.

tidyr’s aim is data tidying while reshape2 is used for data reshaping and aggregating. tidyr syntax is easy to understand and to work with, but its functionalities are limited.

Reshape and Tidy Data: reshape2 vs tidyr

From wide format to long format : melt() Vs gather()

From long format to wide format : dcast() Vs spread()

Aggregating with dcast()

Conclusion:

From wide format to long format : `melt()` Vs `gather()`

From long format to wide format : `dcast()` Vs `spread()`

Aggregating with `dcast()`