Data Profiling Made Simple: Using describeBy() in R

Introduction:

In the realm of data analysis, understanding the underlying patterns and characteristics of your dataset is crucial for making informed decisions. R provides numerous tools and packages to assist in this endeavor. One such tool is the psych package, a comprehensive library for psychological research and data visualization. Among its many functions, the describeBy function stands out as a versatile and insightful tool for exploring and summarizing data based on grouping variables.

Understanding the basics:

psych package has a function describe which provides overall descriptive statistics for a single variable, while the describeBy function extends this functionality by allowing the user to obtain descriptive statistics grouped by one or more categorical variables. This function is particularly helpful when dealing with large datasets with multiple groups, as it efficiently summarizes key statistics, helping you identify patterns and trends.

Getting started:

First step would be installing and calling the required package - psych.

install.packages("psych")
library(psych)

The data we will be working with in this blog contains four variables (two categorical and numerical each) that are study hours of students, their scores in the previous exam, result of the current exam and whether they participate in extracurricular activities or not. Following is a snapshot:

  StudyHours PreviousExamScore CurrentExamResult Extracurricular
1          4                82              Fail              No
2         10                72              Pass             Yes
3          8                59              Fail              No
4          6                89              Pass              No
5          2                81              Fail             Yes

Using the function:

describeBy(x = df$StudyHours, group = df$CurrentExamResult) # the variable being summarized is the hours spent in studying by students and the grouping variable is whether they passed the exam or not

 Descriptive statistics by group 
group: Fail
   vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
X1    1 316 4.31 2.55      4     4.1 2.97   1  10     9 0.67    -0.61 0.14
------------------------------------------------------------ 
group: Pass
   vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
X1    1 184 7.56 1.48      7    7.55 1.48   5  10     5 0.05    -1.04 0.11

 

The summary statistics values obtained from this function are mean, std. deviation, median, trimmed median, mad (median absolute deviation), minimum, maximum, range, skew, kurtosis and std. error. If you specify skew = F as an argument, you will get values only for mean, std. deviation, minimum, maximum and std. error.

Some other arguments that you can pass are mat (“T” if you want output in form of a matrix and “F” if not) and digits (the number of digits to be reported in matrix output).

describeBy(x = df$StudyHours, group = df$CurrentExamResult, mat = T, digits = 2)
    item group1 vars   n mean   sd median trimmed  mad min max range skew
X11    1   Fail    1 316 4.31 2.55      4    4.10 2.97   1  10     9 0.67
X12    2   Pass    1 184 7.56 1.48      7    7.55 1.48   5  10     5 0.05
    kurtosis   se
X11    -0.61 0.14
X12    -1.04 0.11

You can also give more grouping variables instead of just one:

describeBy(StudyHours ~ CurrentExamResult + Extracurricular, data = df, mat = T, digits = 2, skew = F)
            item group1 group2 vars   n mean   sd min max range   se
StudyHours1    1   Fail     No    1 156 4.51 2.50   1  10     9 0.20
StudyHours2    2   Pass     No    1  90 7.43 1.39   5  10     5 0.15
StudyHours3    3   Fail    Yes    1 160 4.11 2.60   1  10     9 0.21
StudyHours4    4   Pass    Yes    1  94 7.68 1.55   5  10     5 0.16

 

You can calculate statistics for more than one variable. Either you can pass the whole data frame as x, or you can pass a data frame with only the required columns:

describeBy(df[, c("StudyHours", "PreviousExamScore")], group = list(df$CurrentExamResult, df$Extracurricular), mat = T, digits = 2)
                   item group1 group2 vars   n  mean    sd median trimmed   mad
StudyHours1           1   Fail     No    1 156  4.51  2.50    4.0    4.33  2.97
StudyHours2           2   Pass     No    1  90  7.43  1.39    7.0    7.40  1.48
StudyHours3           3   Fail    Yes    1 160  4.11  2.60    4.0    3.87  2.97
StudyHours4           4   Pass    Yes    1  94  7.68  1.55    8.0    7.70  1.48
PreviousExamScore1    5   Fail     No    2 156 63.48 17.41   59.5   62.17 17.05
PreviousExamScore2    6   Pass     No    2  90 79.43 11.76   76.5   79.29 14.08
PreviousExamScore3    7   Fail    Yes    2 160 62.78 17.27   56.0   61.22 13.34
PreviousExamScore4    8   Pass    Yes    2  94 78.32 10.88   77.5   78.24 14.08
                   min max range  skew kurtosis   se
StudyHours1          1  10     9  0.63    -0.61 0.20
StudyHours2          5  10     5  0.15    -0.78 0.15
StudyHours3          1  10     9  0.73    -0.61 0.21
StudyHours4          5  10     5 -0.06    -1.23 0.16
PreviousExamScore1  40 100    60  0.58    -0.92 1.39
PreviousExamScore2  60  99    39  0.13    -1.30 1.24
PreviousExamScore3  41 100    59  0.69    -0.84 1.37
PreviousExamScore4  60  98    38  0.08    -1.24 1.12

 

Conclusion:

The describeBy function serves as a valuable tool for exploring and summarizing data based on categorical variables. Whether you are analyzing educational data, social science surveys, or any other dataset with grouping variables, this function provides a convenient way to obtain key statistics and gain insights into the patterns within your data. Experimenting with this function will undoubtedly enhance your data exploration and analysis skills, making it an essential addition to your toolkit as a data analyst or researcher.