R users (mostly beginners) struggle helplessly while dealing with large data sets. When your machine fails to work on large data sets, it gives repetitive warnings, error messages of insufficient memory usage.
Some of the practices like using read.csv()
to load large files and aggregate()
to perform aggregation on large datasets, impedes R’s performance.
The R package data.table
is written by Matt Dowle in year 2008. data.table
is an enhanced version of data.frame
that allows you to do
Let’s learn few data management tasks from the above list using package data.table
#Install and load data.table package
install.packages("data.table")
library(data.table)
fread()
of package data.table
loads large datasets faster and more conveniently than other base functions in R.
data<-fread("data.csv")
Read 17.4% of 7000000 rows
Read 34.1% of 7000000 rows
Read 53.3% of 7000000 rows
Read 69.6% of 7000000 rows
Read 88.9% of 7000000 rows
Read 7000000 rows and 4 (of 4) columns from 0.144 GB file in 00:00:07
#Number of rows and columns
dim(data)
[1] 7000000 4
head(data)
ID Capacity Code State
1: 1 216 D Alabama
2: 2 687 C Indiana
3: 3 598 D Texas
4: 4 468 D Nevada
5: 5 932 B Alabama
6: 6 688 D Indiana
#Adding a new column 'new_capacity' to data
data[,new_capacity:=Capacity+5]
head(data,4)
ID Capacity Code State new_capacity
1: 1 216 D Alabama 221
2: 2 687 C Indiana 692
3: 3 598 D Texas 603
4: 4 468 D Nevada 473
data[State=="Alabama",State:="Al"]
head(data,10)
ID Capacity Code State new_capacity
1: 1 216 D Al 221
2: 2 687 C Indiana 692
3: 3 598 D Texas 603
4: 4 468 D Nevada 473
5: 5 932 B Al 937
6: 6 688 D Indiana 693
7: 7 294 D Texas 299
8: 8 420 B Nevada 425
9: 9 924 A Al 929
10: 10 341 C Indiana 346
#Using Indices
data[4:6]
ID Capacity Code State new_capacity
1: 4 468 D Nevada 473
2: 5 932 B Al 937
3: 6 688 D Indiana 693
#Subsetting Observations by giving conditions
sub_rows<-data[Code=="C" & State=="Al"]
head(sub_rows,3)
ID Capacity Code State new_capacity
1: 69 393 C Al 398
2: 73 444 C Al 449
3: 81 174 C Al 179
sub_columns<-data[,.(ID,Capacity)]
head(sub_columns, 3)
ID Capacity
1: 1 216
2: 2 687
3: 3 598
setkey(data,Code,State)
sub_key<-data[.("C"),.(Code,State)]
head(sub_key, 2)
Code State
1: C Al
2: C Al
setkey()
reorders the column observations in ascending order. Keys in data table delivers incredibly fast results. You can set keys on any type of column i.e. numeric, factor, integer, character. Once the key is set, we no longer need to provide the column name again and again.
dt_order1<-data[order(Code)]
head(dt_order1,4)
ID Capacity Code State new_capacity
1: 9 924 A Al 929
2: 13 234 A Al 239
3: 17 531 A Al 536
4: 21 751 A Al 756
dt_order2<-data[order(-State)]
head(dt_order2,4)
ID Capacity Code State new_capacity
1: 27 105 A Texas 110
2: 35 903 A Texas 908
3: 47 138 A Texas 143
4: 59 924 A Texas 929
dt_order3<-data[order(Code,-new_capacity)]
head(dt_order3,4)
ID Capacity Code State new_capacity
1: 37 961 A Al 966
2: 137 961 A Al 966
3: 237 961 A Al 966
4: 337 961 A Al 966
#Calculate sum of variable 'new_capacity' by variable 'State' and name the columns as 'TotalCapacity'
setkey(data,new_capacity,State)
DT_agg<-data[,.(TotalCapacity=sum(new_capacity)),by=State]
DT_agg
State TotalCapacity
1: Al 932050000
2: Texas 932050000
3: Indiana 933730000
4: Nevada 933730000
new_capacity and State are set as keys for faster aggregation.
.()
notation allows you to rename the columns inside datatable.
by=
takes the factors by which data will be aggregated. Multiple factors can be specified with list()
.