Exploratory Data Analysis (EDA) is a part of every data science project and R provides number of ways to perform EDA. Base R “aggregate” function or “summarize” function in dplyr are commonly used functions in R to perform EDA. The skimr package is relatively new but provides beautiful summary reports.

skimr gives a set of summary statistics such as missing values, complete values, count, mean and sd. It also gives a histogram.

 

Importing library and R built-in data CO2

library(skimr)
data(CO2)
head(CO2)
  Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2

 

Summarize entire dataset using skim() function

skim(CO2)
Data summary
Name CO2
Number of rows 84
Number of columns 5
_______________________
Column type frequency:
factor 3
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Plant 0 1 TRUE 12 Qn1: 7, Qn2: 7, Qn3: 7, Qc1: 7
Type 0 1 FALSE 2 Que: 42, Mis: 42
Treatment 0 1 FALSE 2 non: 42, chi: 42

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
conc 0 1 435.00 295.92 95.0 175.0 350.0 675.00 1000.0 ▇▂▂▂▂
uptake 0 1 27.21 10.81 7.7 17.9 28.3 37.12 45.5 ▇▇▅▇▇

 

Summarize Specific Variables

skim(CO2, Plant, conc)
Data summary
Name CO2
Number of rows 84
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Plant 0 1 TRUE 12 Qn1: 7, Qn2: 7, Qn3: 7, Qc1: 7

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
conc 0 1 435 295.92 95 175 350 675 1000 ▇▂▂▂▂

 

Get grouped data summary by combining skim() and group_by() from dplyr.

library(dplyr)
CO2 %>% group_by(Type) %>% skim() 
Data summary
Name Piped data
Number of rows 84
Number of columns 5
_______________________
Column type frequency:
factor 2
numeric 2
________________________
Group variables Type

Variable type: factor

skim_variable Type n_missing complete_rate ordered n_unique top_counts
Plant Quebec 0 1 TRUE 6 Qn1: 7, Qn2: 7, Qn3: 7, Qc1: 7
Plant Mississippi 0 1 TRUE 6 Mn3: 7, Mn2: 7, Mn1: 7, Mc2: 7
Treatment Quebec 0 1 FALSE 2 non: 21, chi: 21
Treatment Mississippi 0 1 FALSE 2 non: 21, chi: 21

Variable type: numeric

skim_variable Type n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
conc Quebec 0 1 435.00 297.72 95.0 175.00 350.00 675.00 1000.0 ▇▂▂▂▂
conc Mississippi 0 1 435.00 297.72 95.0 175.00 350.00 675.00 1000.0 ▇▂▂▂▂
uptake Quebec 0 1 33.54 9.67 9.3 30.33 37.15 40.15 45.5 ▂▁▂▅▇
uptake Mississippi 0 1 20.88 7.82 7.7 13.87 19.30 28.05 35.5 ▇▆▇▅▇

 

Get grouped data summary by combining skim() and group_by() from dplyr.

CO2 %>% group_by(Type,Treatment) %>% skim() 
Data summary
Name Piped data
Number of rows 84
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 2
________________________
Group variables Type, Treatment

Variable type: factor

skim_variable Type Treatment n_missing complete_rate ordered n_unique top_counts
Plant Quebec nonchilled 0 1 TRUE 3 Qn1: 7, Qn2: 7, Qn3: 7, Qc1: 0
Plant Quebec chilled 0 1 TRUE 3 Qc1: 7, Qc3: 7, Qc2: 7, Qn1: 0
Plant Mississippi nonchilled 0 1 TRUE 3 Mn3: 7, Mn2: 7, Mn1: 7, Qn1: 0
Plant Mississippi chilled 0 1 TRUE 3 Mc2: 7, Mc3: 7, Mc1: 7, Qn1: 0

Variable type: numeric

skim_variable Type Treatment n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
conc Quebec nonchilled 0 1 435.00 301.42 95.0 175.0 350.0 675.0 1000.0 ▇▂▂▂▂
conc Quebec chilled 0 1 435.00 301.42 95.0 175.0 350.0 675.0 1000.0 ▇▂▂▂▂
conc Mississippi nonchilled 0 1 435.00 301.42 95.0 175.0 350.0 675.0 1000.0 ▇▂▂▂▂
conc Mississippi chilled 0 1 435.00 301.42 95.0 175.0 350.0 675.0 1000.0 ▇▂▂▂▂
uptake Quebec nonchilled 0 1 35.33 9.60 13.6 32.4 39.2 41.8 45.5 ▂▁▂▃▇
uptake Quebec chilled 0 1 31.75 9.64 9.3 27.3 35.0 38.7 42.4 ▂▁▂▅▇
uptake Mississippi nonchilled 0 1 25.95 7.40 10.6 22.0 28.1 31.1 35.5 ▃▂▁▇▇
uptake Mississippi chilled 0 1 15.81 4.06 7.7 12.5 17.9 18.9 22.2 ▃▅▃▇▅