install.packages("psych")
library(psych)
Data Profiling Made Simple: Using describeBy() in R
Introduction:
In the realm of data analysis, understanding the underlying patterns and characteristics of your dataset is crucial for making informed decisions. R provides numerous tools and packages to assist in this endeavor. One such tool is the psych
package, a comprehensive library for psychological research and data visualization. Among its many functions, the describeBy
function stands out as a versatile and insightful tool for exploring and summarizing data based on grouping variables.
Understanding the basics:
psych
package has a function describe
which provides overall descriptive statistics for a single variable, while the describeBy
function extends this functionality by allowing the user to obtain descriptive statistics grouped by one or more categorical variables. This function is particularly helpful when dealing with large datasets with multiple groups, as it efficiently summarizes key statistics, helping you identify patterns and trends.
Getting started:
First step would be installing and calling the required package - psych
.
The data we will be working with in this blog contains four variables (two categorical and numerical each) that are study hours of students, their scores in the previous exam, result of the current exam and whether they participate in extracurricular activities or not. Following is a snapshot:
StudyHours PreviousExamScore CurrentExamResult Extracurricular
1 4 82 Fail No
2 10 72 Pass Yes
3 8 59 Fail No
4 6 89 Pass No
5 2 81 Fail Yes
Using the function:
describeBy(x = df$StudyHours, group = df$CurrentExamResult) # the variable being summarized is the hours spent in studying by students and the grouping variable is whether they passed the exam or not
Descriptive statistics by group
group: Fail
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 316 4.31 2.55 4 4.1 2.97 1 10 9 0.67 -0.61 0.14
------------------------------------------------------------
group: Pass
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 184 7.56 1.48 7 7.55 1.48 5 10 5 0.05 -1.04 0.11
The summary statistics values obtained from this function are mean, std. deviation, median, trimmed median, mad (median absolute deviation), minimum, maximum, range, skew, kurtosis and std. error. If you specify skew = F
as an argument, you will get values only for mean, std. deviation, minimum, maximum and std. error.
Some other arguments that you can pass are mat
(“T” if you want output in form of a matrix and “F” if not) and digits
(the number of digits to be reported in matrix output).
describeBy(x = df$StudyHours, group = df$CurrentExamResult, mat = T, digits = 2)
item group1 vars n mean sd median trimmed mad min max range skew
X11 1 Fail 1 316 4.31 2.55 4 4.10 2.97 1 10 9 0.67
X12 2 Pass 1 184 7.56 1.48 7 7.55 1.48 5 10 5 0.05
kurtosis se
X11 -0.61 0.14
X12 -1.04 0.11
You can also give more grouping variables instead of just one:
describeBy(StudyHours ~ CurrentExamResult + Extracurricular, data = df, mat = T, digits = 2, skew = F)
item group1 group2 vars n mean sd min max range se
StudyHours1 1 Fail No 1 156 4.51 2.50 1 10 9 0.20
StudyHours2 2 Pass No 1 90 7.43 1.39 5 10 5 0.15
StudyHours3 3 Fail Yes 1 160 4.11 2.60 1 10 9 0.21
StudyHours4 4 Pass Yes 1 94 7.68 1.55 5 10 5 0.16
You can calculate statistics for more than one variable. Either you can pass the whole data frame as x, or you can pass a data frame with only the required columns:
describeBy(df[, c("StudyHours", "PreviousExamScore")], group = list(df$CurrentExamResult, df$Extracurricular), mat = T, digits = 2)
item group1 group2 vars n mean sd median trimmed mad
StudyHours1 1 Fail No 1 156 4.51 2.50 4.0 4.33 2.97
StudyHours2 2 Pass No 1 90 7.43 1.39 7.0 7.40 1.48
StudyHours3 3 Fail Yes 1 160 4.11 2.60 4.0 3.87 2.97
StudyHours4 4 Pass Yes 1 94 7.68 1.55 8.0 7.70 1.48
PreviousExamScore1 5 Fail No 2 156 63.48 17.41 59.5 62.17 17.05
PreviousExamScore2 6 Pass No 2 90 79.43 11.76 76.5 79.29 14.08
PreviousExamScore3 7 Fail Yes 2 160 62.78 17.27 56.0 61.22 13.34
PreviousExamScore4 8 Pass Yes 2 94 78.32 10.88 77.5 78.24 14.08
min max range skew kurtosis se
StudyHours1 1 10 9 0.63 -0.61 0.20
StudyHours2 5 10 5 0.15 -0.78 0.15
StudyHours3 1 10 9 0.73 -0.61 0.21
StudyHours4 5 10 5 -0.06 -1.23 0.16
PreviousExamScore1 40 100 60 0.58 -0.92 1.39
PreviousExamScore2 60 99 39 0.13 -1.30 1.24
PreviousExamScore3 41 100 59 0.69 -0.84 1.37
PreviousExamScore4 60 98 38 0.08 -1.24 1.12
Conclusion:
The describeBy
function serves as a valuable tool for exploring and summarizing data based on categorical variables. Whether you are analyzing educational data, social science surveys, or any other dataset with grouping variables, this function provides a convenient way to obtain key statistics and gain insights into the patterns within your data. Experimenting with this function will undoubtedly enhance your data exploration and analysis skills, making it an essential addition to your toolkit as a data analyst or researcher.