There are two main formats of data; in Wide Format data each column represents a different variable whereas in Long Format data, one column contains all the possible variables, another column contains their respective values. In practical scenario, while wide format is more readable, long format is easier to analyse. Therefore, it is useful to know how to convert between the two.
“reshape2” and “tidyr” are two packages that are used to restructure data. In our previous analytics lab we have elaborated on functions of the package “tidyr” . In this article we will focus on the functions of package “reshape2” and will see the similarities and differences between the two packages while performing the following tasks:
melt()
Vs gather()
Refer to the example in the previous section Now let us convert the data in wide format to long format using the function melt()
in reshape2
install.packages("reshape2")
library(reshape2)
stud_data<-read.csv("stud_data.csv",header=TRUE)
melt(stud_data)
Names variable value
1 Rohan Student_ID 1
2 John Student_ID 2
3 Anisha Student_ID 3
4 Agatha Student_ID 4
5 Ashima Student_ID 5
6 Rohan Maths 67
7 John Maths 89
8 Anisha Maths 69
9 Agatha Maths 79
10 Ashima Maths 77
11 Rohan Economics 56
12 John Economics 88
13 Anisha Economics NA
14 Agatha Economics 92
15 Ashima Economics 67
16 Rohan Statistics NA
17 John Statistics 79
18 Anisha Statistics 88
19 Agatha Statistics 89
20 Ashima Statistics 89
The function in tidyr used to perform similar operation is gather()
. We see a different behaviour: gather()
had brought stud_data into a long data format with a warning by treating all columns as variable, while melt()
has treated name as an “id variables”. Id columns are the columns that contain the identifier of the observation that is represented as a row in our data set. If melt()
does not receive any id.variables specification, then it will use the factor or character columns as id variables. gather()
requires the columns that needs to be treated as ids, all the other columns are to be used as key-value pairs.
There can be multiple Id variables and can be specified separately.
melt(stud_data,id.vars=c('Student_ID','Names'))
Student_ID Names variable value
1 1 Rohan Maths 67
2 2 John Maths 89
3 3 Anisha Maths 69
4 4 Agatha Maths 79
5 5 Ashima Maths 77
6 1 Rohan Economics 56
7 2 John Economics 88
8 3 Anisha Economics NA
9 4 Agatha Economics 92
10 5 Ashima Economics 67
11 1 Rohan Statistics NA
12 2 John Statistics 79
13 3 Anisha Statistics 88
14 4 Agatha Statistics 89
15 5 Ashima Statistics 89
With id.vars=
we can give multiple id variables. For multiple variables we can specify the list of variables enclosed in c()
.
We can also change the names of new columns created after melting
long_format<-melt(stud_data,id.vars=c('Student_ID','Names'), variable.name='Subjects', value.name='Marks')
long_format
Student_ID Names Subjects Marks
1 1 Rohan Maths 67
2 2 John Maths 89
3 3 Anisha Maths 69
4 4 Agatha Maths 79
5 5 Ashima Maths 77
6 1 Rohan Economics 56
7 2 John Economics 88
8 3 Anisha Economics NA
9 4 Agatha Economics 92
10 5 Ashima Economics 67
11 1 Rohan Statistics NA
12 2 John Statistics 79
13 3 Anisha Statistics 88
14 4 Agatha Statistics 89
15 5 Ashima Statistics 89
Using variable.name=
and value.name=
arguments we can change the names of the columns variable and value.
In gather()
we used bare variable names to specify the names of the new columns.
Note: both these commands of tidyr and reshape2 can be performed on dataframes. However, gather()
cannot handle matrices or arrays but melt()
can.
dcast()
Vs spread()
cast function in reshape2 converts data from long format to wide format. It is a complement of melt()
function. There are two types of cast functions: dcast()
and acast()
dcast()
- returns a dataframe as the output
acast()
- returns a vector/matrix/array as the output
Since dataframe objects are the most common, we will explore the dcast()
function.
Now let us reshape the long format object into wide format
dcast(long_format,Student_ID+Names~Subjects)
Student_ID Names Maths Economics Statistics
1 1 Rohan 67 56 NA
2 2 John 89 88 79
3 3 Anisha 69 NA 88
4 4 Agatha 79 92 89
5 5 Ashima 77 67 89
spread()
from the package tidyr performs the same task as dcast()
.dcast()
uses a formula to describe the shape of the data.
Try out the following commands :
dcast(long_format, Names+Student_ID~Subjects)
dcast(long_format,Student_ID~Subjects)
dcast(long_format,Names ~ Subjects)
We can also subset variables while converting data from long to wide format.
install.packages("plyr")
library(plyr)
dcast(long_format,Student_ID+Names~Subjects,subset=.(Subjects=="Maths"))
Student_ID Names Maths
1 1 Rohan 67
2 2 John 89
3 3 Anisha 69
4 4 Agatha 79
5 5 Ashima 77
Here, we have subsetted Maths from the variable Subjects. subset=
is used to subset data using .()
. Package plyr is needed to access '.'
dcast()
Let us check some aggregations that are possible with dcast()
dcast(long_format, Student_ID+Names ~ .,fun.aggregate=sum,na.rm=TRUE)
Student_ID Names .
1 1 Rohan 123
2 2 John 256
3 3 Anisha 157
4 4 Agatha 260
5 5 Ashima 233
fun.aggregate
is used to specify the aggregation function. na.rm=TRUE
is used to remove NA values.
tidyr package has no function that allow us to perform a similar operation.
tidyr and reshape2 are similar in a way as they have some functions that perform the same operation. reshape2’s functions can do data aggregation that is not possible with tidyr.
tidyr’s aim is data tidying while reshape2 is used for data reshaping and aggregating. tidyr syntax is easy to understand and to work with, but its functionalities are limited.