Summary statistics made easy with vtable package in R

Introduction

Anyone who has worked with SAS is probably familiar with the PROC CONTENTS command — a quick way to view a dataset’s structure and metadata. In contrast, the typical approach in R or Python involves using multiple functions like head(), dim(), class(), or summary() to understand the data. The vtable package in R brings a similar convenience by offering a single, unified function to inspect and summarize datasets — much like PROC CONTENTS in SAS.

Let us begin with installing and loading the required package

# install.packages("vtable")
library(vtable)

A glance at the dataset


Let’s look at the usage of ‘vtable’ package

1. Getting Variable Names, Types, and Value Summaries

Here, we will first be updating the char variables to factor to see the level information in vtable

insurance_df <- insurance_df %>% mutate(across(where(is.character), as.factor))
vtable(insurance_df)
insurance_df
Name Class Values
age integer Num: 18 to 64
gender factor 'female' 'male'
bmi numeric Num: 15.96 to 53.13
children integer Num: 0 to 5
smoker factor 'no' 'yes'
region factor 'northeast' 'northwest' 'southeast' 'southwest'
charges numeric Num: 1062.385 to 63770.428

2. Add description to understand the variables better

my_labels <- c("Insured person's age", "Gender","BMI", "Number of dependents", "Smoker status", "Residential region", "Insurance cost" )

vtable(insurance_df, 
       labels = my_labels, # to add description column to understand the variables better 
       data.title = "Insurance data", # to add title
       summ = c("mean(x)"," countNA(x)") # to add summary statistics column
       )
Insurance data
Name Class Label Values Summary
age integer Insured person's age Num: 18 to 64 mean: 39.207, countNA: 0
gender factor Gender 'female' 'male' countNA: 0
bmi numeric BMI Num: 15.96 to 53.13 mean: 30.663, countNA: 0
children integer Number of dependents Num: 0 to 5 mean: 1.095, countNA: 0
smoker factor Smoker status 'no' 'yes' countNA: 0
region factor Residential region 'northeast' 'northwest' 'southeast' 'southwest' countNA: 0
charges numeric Insurance cost Num: 1062.385 to 63770.428 mean: 13293.678, countNA: 0

3. Creating Balance Tables in R with sumtable()

Balance tables are used to check whether groups (e.g., smokers vs. non-smokers) are comparable in terms of key covariates like age, BMI, and number of children. This is especially important before modeling or in observational studies where group equivalence is critical.

# Basic Variable summary Table
sumtable(insurance_df)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
age 1338 39 14 18 27 51 64
gender 1338
... female 662 49%
... male 676 51%
bmi 1338 31 6.1 16 26 35 53
children 1338 1.1 1.2 0 0 2 5
smoker 1338
... no 1064 80%
... yes 274 20%
region 1338
... northeast 324 24%
... northwest 324 24%
... southeast 365 27%
... southwest 325 24%
charges 1338 13294 12121 1062 4747 16747 63770
# Group-wise summary (e.g., by smoker status)
sumtable(insurance_df, group = "smoker", group.test = T)
Summary Statistics
smoker
no
yes
Variable N Mean SD N Mean SD Test
age 1064 39 14 274 39 14 F=0.837
gender 1064 274 X2=7.393***
... female 547 51% 115 42%
... male 517 49% 159 58%
bmi 1064 31 6 274 31 6.3 F=0.019
children 1064 1.1 1.2 274 1.1 1.2 F=0.079
region 1064 274 X2=7.157*
... northeast 257 24% 67 24%
... northwest 266 25% 58 21%
... southeast 274 26% 91 33%
... southwest 267 25% 58 21%
charges 1064 8464 6046 274 32050 11542 F=2153.091***
Statistical significance markers: * p<0.1; ** p<0.05; *** p<0.01
# sumtable(insurance_df, group = "smoker", group.test = T, summ=c('notNA(x)', 'mean(x)','median(x)','propNA(x)'))

group.test = TRUE, sumtable() automatically performs statistical tests to compare the variables between the groups.

  • Numeric variables: t-test to compare means between two groups

  • Factor variables: Chi-squared test to compare proportions across categories

Benefits Using vtable

  • Beautiful formatting out-of-the-box (great for articles, reports).

  • Handles mixed variable types (numeric, factors) automatically.

  • Quick grouped summaries (like balance tables).

  • Supports export to Word/HTML, great for sharing.

  • Minimal code, maximum readability.

Drawback:

  • Limited customization for statistical test options (e.g., no non-parametric tests)
  • For deeper EDA, you may still need dplyr, ggplot2, or summarytools