Data Management Tasks Using Package data.table in R

R users (mostly beginners) struggle helplessly while dealing with large data sets. When your machine fails to work on large data sets, it gives repetitive warnings, error messages of insufficient memory usage.
Some of the practices like using read.csv() to load large files and aggregate() to perform aggregation on large datasets, impedes R’s performance.
The R package data.table is written by Matt Dowle in year 2008. data.table is an enhanced version of data.frame that allows you to do

Fast aggregation of large data
Fast ordered joins
Fast add/modify/delete columns by group using no copies at all
List columns
Fast file import.

Let’s learn few data management tasks from the above list using package data.table

#Install and load data.table package
install.packages("data.table")
library(data.table)

Importing Data:

fread() of package data.table loads large datasets faster and more conveniently than other base functions in R.

data<-fread("data.csv")


Read 17.4% of 7000000 rows
Read 34.1% of 7000000 rows
Read 53.3% of 7000000 rows
Read 69.6% of 7000000 rows
Read 88.9% of 7000000 rows
Read 7000000 rows and 4 (of 4) columns from 0.144 GB file in 00:00:07

#Number of rows and columns
dim(data)

[1] 7000000       4

head(data)

   ID Capacity Code   State
1:  1      216    D Alabama
2:  2      687    C Indiana
3:  3      598    D   Texas
4:  4      468    D  Nevada
5:  5      932    B Alabama
6:  6      688    D Indiana

Modifying Data:

Deriving a New Column

#Adding a new column 'new_capacity' to data
data[,new_capacity:=Capacity+5]
head(data,4)

   ID Capacity Code   State new_capacity
1:  1      216    D Alabama          221
2:  2      687    C Indiana          692
3:  3      598    D   Texas          603
4:  4      468    D  Nevada          473

Updating Values

data[State=="Alabama",State:="Al"]
head(data,10)

    ID Capacity Code   State new_capacity
 1:  1      216    D      Al          221
 2:  2      687    C Indiana          692
 3:  3      598    D   Texas          603
 4:  4      468    D  Nevada          473
 5:  5      932    B      Al          937
 6:  6      688    D Indiana          693
 7:  7      294    D   Texas          299
 8:  8      420    B  Nevada          425
 9:  9      924    A      Al          929
10: 10      341    C Indiana          346

Creating Subsets:

Subsetting Observations

#Using Indices
data[4:6]

   ID Capacity Code   State new_capacity
1:  4      468    D  Nevada          473
2:  5      932    B      Al          937
3:  6      688    D Indiana          693

#Subsetting Observations by giving conditions
sub_rows<-data[Code=="C" & State=="Al"]
head(sub_rows,3)

   ID Capacity Code State new_capacity
1: 69      393    C    Al          398
2: 73      444    C    Al          449
3: 81      174    C    Al          179

Subsetting Variables

sub_columns<-data[,.(ID,Capacity)]
head(sub_columns, 3)

   ID Capacity
1:  1      216
2:  2      687
3:  3      598

Subsetting Both Observations and Variables

setkey(data,Code,State)
sub_key<-data[.("C"),.(Code,State)]
head(sub_key, 2)

   Code State
1:    C    Al
2:    C    Al

setkey() reorders the column observations in ascending order. Keys in data table delivers incredibly fast results. You can set keys on any type of column i.e. numeric, factor, integer, character. Once the key is set, we no longer need to provide the column name again and again.

Sorting Data:

Sorting Data in Ascending Order

dt_order1<-data[order(Code)]
head(dt_order1,4)

   ID Capacity Code State new_capacity
1:  9      924    A    Al          929
2: 13      234    A    Al          239
3: 17      531    A    Al          536
4: 21      751    A    Al          756

Sorting Data in Descending Order

dt_order2<-data[order(-State)]
head(dt_order2,4)

   ID Capacity Code State new_capacity
1: 27      105    A Texas          110
2: 35      903    A Texas          908
3: 47      138    A Texas          143
4: 59      924    A Texas          929

Sorting Data in Multiple Ordering Levels

dt_order3<-data[order(Code,-new_capacity)]
head(dt_order3,4)

    ID Capacity Code State new_capacity
1:  37      961    A    Al          966
2: 137      961    A    Al          966
3: 237      961    A    Al          966
4: 337      961    A    Al          966

Aggregating Data:

#Calculate sum of variable 'new_capacity' by variable 'State' and name the columns as 'TotalCapacity'
setkey(data,new_capacity,State)
DT_agg<-data[,.(TotalCapacity=sum(new_capacity)),by=State]
DT_agg

     State TotalCapacity
1:      Al     932050000
2:   Texas     932050000
3: Indiana     933730000
4:  Nevada     933730000

new_capacity and State are set as keys for faster aggregation.
.() notation allows you to rename the columns inside datatable.
by= takes the factors by which data will be aggregated. Multiple factors can be specified with list().

Data Management Tasks Using Package ‘data.table’ in R

Importing Data:

Modifying Data:

Creating Subsets:

Sorting Data:

Aggregating Data: