# install.packages("vtable")
library(vtable)
Summary statistics made easy with vtable package in R
Introduction
Anyone who has worked with SAS is probably familiar with the PROC CONTENTS command — a quick way to view a dataset’s structure and metadata. In contrast, the typical approach in R or Python involves using multiple functions like head(), dim(), class(), or summary() to understand the data. The vtable package in R brings a similar convenience by offering a single, unified function to inspect and summarize datasets — much like PROC CONTENTS in SAS.
Let us begin with installing and loading the required package
A glance at the dataset
Let’s look at the usage of ‘vtable’ package
1. Getting Variable Names, Types, and Value Summaries
Here, we will first be updating the char variables to factor to see the level information in vtable
<- insurance_df %>% mutate(across(where(is.character), as.factor))
insurance_df vtable(insurance_df)
Name | Class | Values |
---|---|---|
age | integer | Num: 18 to 64 |
gender | factor | 'female' 'male' |
bmi | numeric | Num: 15.96 to 53.13 |
children | integer | Num: 0 to 5 |
smoker | factor | 'no' 'yes' |
region | factor | 'northeast' 'northwest' 'southeast' 'southwest' |
charges | numeric | Num: 1062.385 to 63770.428 |
2. Add description to understand the variables better
<- c("Insured person's age", "Gender","BMI", "Number of dependents", "Smoker status", "Residential region", "Insurance cost" )
my_labels
vtable(insurance_df,
labels = my_labels, # to add description column to understand the variables better
data.title = "Insurance data", # to add title
summ = c("mean(x)"," countNA(x)") # to add summary statistics column
)
Name | Class | Label | Values | Summary |
---|---|---|---|---|
age | integer | Insured person's age | Num: 18 to 64 | mean: 39.207, countNA: 0 |
gender | factor | Gender | 'female' 'male' | countNA: 0 |
bmi | numeric | BMI | Num: 15.96 to 53.13 | mean: 30.663, countNA: 0 |
children | integer | Number of dependents | Num: 0 to 5 | mean: 1.095, countNA: 0 |
smoker | factor | Smoker status | 'no' 'yes' | countNA: 0 |
region | factor | Residential region | 'northeast' 'northwest' 'southeast' 'southwest' | countNA: 0 |
charges | numeric | Insurance cost | Num: 1062.385 to 63770.428 | mean: 13293.678, countNA: 0 |
3. Creating Balance Tables in R with sumtable()
Balance tables are used to check whether groups (e.g., smokers vs. non-smokers) are comparable in terms of key covariates like age, BMI, and number of children. This is especially important before modeling or in observational studies where group equivalence is critical.
# Basic Variable summary Table
sumtable(insurance_df)
Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
---|---|---|---|---|---|---|---|
age | 1338 | 39 | 14 | 18 | 27 | 51 | 64 |
gender | 1338 | ||||||
... female | 662 | 49% | |||||
... male | 676 | 51% | |||||
bmi | 1338 | 31 | 6.1 | 16 | 26 | 35 | 53 |
children | 1338 | 1.1 | 1.2 | 0 | 0 | 2 | 5 |
smoker | 1338 | ||||||
... no | 1064 | 80% | |||||
... yes | 274 | 20% | |||||
region | 1338 | ||||||
... northeast | 324 | 24% | |||||
... northwest | 324 | 24% | |||||
... southeast | 365 | 27% | |||||
... southwest | 325 | 24% | |||||
charges | 1338 | 13294 | 12121 | 1062 | 4747 | 16747 | 63770 |
# Group-wise summary (e.g., by smoker status)
sumtable(insurance_df, group = "smoker", group.test = T)
Variable | N | Mean | SD | N | Mean | SD | Test |
---|---|---|---|---|---|---|---|
age | 1064 | 39 | 14 | 274 | 39 | 14 | F=0.837 |
gender | 1064 | 274 | X2=7.393*** | ||||
... female | 547 | 51% | 115 | 42% | |||
... male | 517 | 49% | 159 | 58% | |||
bmi | 1064 | 31 | 6 | 274 | 31 | 6.3 | F=0.019 |
children | 1064 | 1.1 | 1.2 | 274 | 1.1 | 1.2 | F=0.079 |
region | 1064 | 274 | X2=7.157* | ||||
... northeast | 257 | 24% | 67 | 24% | |||
... northwest | 266 | 25% | 58 | 21% | |||
... southeast | 274 | 26% | 91 | 33% | |||
... southwest | 267 | 25% | 58 | 21% | |||
charges | 1064 | 8464 | 6046 | 274 | 32050 | 11542 | F=2153.091*** |
Statistical significance markers: * p<0.1; ** p<0.05; *** p<0.01 |
# sumtable(insurance_df, group = "smoker", group.test = T, summ=c('notNA(x)', 'mean(x)','median(x)','propNA(x)'))
group.test = TRUE, sumtable() automatically performs statistical tests to compare the variables between the groups.
Numeric variables: t-test to compare means between two groups
Factor variables: Chi-squared test to compare proportions across categories
Benefits Using vtable
Beautiful formatting out-of-the-box (great for articles, reports).
Handles mixed variable types (numeric, factors) automatically.
Quick grouped summaries (like balance tables).
Supports export to Word/HTML, great for sharing.
Minimal code, maximum readability.
Drawback:
- Limited customization for statistical test options (e.g., no non-parametric tests)
- For deeper EDA, you may still need dplyr, ggplot2, or summarytools