install.packages("h2o")
library("h2o")
# Download dependency packages:
<- c("methods", "statmod", "stats", "graphics", "RCurl", "jsonlite", "tools", "utils")
pkgs for (pkg in pkgs) {if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }}
# Initialize H2O on local machine
h2o.init()
# Check version (3.1.0 is incompatible, so check the version then proceed)
::installed.versions("h2o") versions
Exploring AutoML wth “h2o” package
Introduction
The h2o.automl
function in R automates training and tuning model by preprocessing data, running multiple models, and selecting the best one. It requires minimum code and knowledge of Machine learning. h2o.automl
consists of both classification and regression algorithms.
Typically used algorithms:
Generalized Linear Models (GLM)
Gradient Boosting Machines (GBM,including XGBoost)
Distributed Random Forest (DRF)
Deep Neural Networks (Deep Learning),
Stacked Ensembles
Note: “h2o” package enables the use of the H2O machine learning platform commands in R. Actual work is done on server, meaning no data is stored in R. R requests via REST API and server returns a JSON file with the information which is then displayed on R
Let us begin with installing and loading the required package
Pre-requisite: Download and install latest Java SE JDK
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
C:\Users\hp\AppData\Local\Temp\RtmpYp20R4\file6fcb18192e/h2o_hp_started_from_r.out
C:\Users\hp\AppData\Local\Temp\RtmpYp20R4\file6fce0a548d/h2o_hp_started_from_r.err
Starting H2O JVM and connecting: Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 4 seconds 807 milliseconds
H2O cluster timezone: Asia/Kolkata
H2O data parsing timezone: UTC
H2O cluster version: 3.44.0.3
H2O cluster version age: 9 months and 12 days
H2O cluster name: H2O_started_from_R_hp_zvh539
H2O cluster total nodes: 1
H2O cluster total memory: 1.96 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 4.4.1 (2024-06-14 ucrt)
[1] "3.44.0.3"
A glance at the dataset
Here, we will be using Credit data, used to detect bad loans
|
| | 0%
|
|======================================================================| 100%
Understanding the dataset
<- as.data.frame(data)
df dim(df)
[1] 6990 25
table(df$bad) # 0: Good loan, 1: bad loan
0 1
6065 925
Converting the data types
<- c("bad","preloan","veh","house", "selfemp","account","deposit", "branch", "ref", "age", "gender", "ms", "child", "zone", "emp_catg", "address_catg", "debtinc_catg", "creddebt_catg", "othdebt_catg")
cols_to_factor
<- as.factor(data[cols_to_factor])
data[cols_to_factor] # h2o.str(data) # View the structure of dataset
Defining the model arguments
# Response column
<- "bad"
y
# Predictor column names
<- c("debtinc","creddebt","othdebt","preloan","veh","house", "selfemp","account","deposit","emp", "address", "branch", "ref", "age", "gender", "ms", "child", "zone", "emp_catg", "address_catg", "debtinc_catg", "creddebt_catg", "othdebt_catg" ) x
Running the model
Case 1: Run AutoML for 10mins (600 seconds)
- The maximum time that the AutoML process will run for.
- The
seed
parameter may not ensure reproducibility when usingmax_runtime_secs
because the resources available during each run might differ, leading to variations in the results.
#
<- h2o.automl(x = x, # If missing, all variables except y are used
aml y = y, # Classification if y is factor; else regression
training_frame = data, # Specifies the training set
max_runtime_secs = 600,# Default 3600 seconds
exclude_algos = "DeepLearning" # Options: "DRF", "GLM", "XGBoost", "GBM", "DeepLearning", "StackedEnsemble". Defaults to NULL (uses all algos).
)
|
| | 0%
|
|= | 1%
09:46:45.695: AutoML: XGBoost is not available; skipping it.
|
|== | 2%
|
|== | 3%
|
|==== | 6%
|
|===== | 8%
|
|======= | 10%
|
|========= | 12%
|
|=========== | 15%
|
|============= | 19%
|
|================= | 24%
|
|================== | 26%
|
|===================== | 30%
|
|======================= | 33%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|======================================= | 55%
|
|========================================= | 58%
|
|=========================================== | 61%
|
|============================================= | 64%
|
|=============================================== | 67%
|
|================================================= | 70%
|
|================================================== | 71%
|
|================================================== | 72%
|
|=================================================== | 72%
|
|=================================================== | 73%
|
|==================================================== | 74%
|
|==================================================== | 75%
|
|===================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 78%
|
|=========================================================== | 84%
|
|======================================================================| 100%
# View the leaderboard
@leaderboard aml
model_id auc logloss
1 StackedEnsemble_BestOfFamily_6_AutoML_1_20241003_94645 0.7613174 0.3382096
2 StackedEnsemble_BestOfFamily_7_AutoML_1_20241003_94645 0.7610110 0.3383858
3 StackedEnsemble_BestOfFamily_4_AutoML_1_20241003_94645 0.7606597 0.3382553
4 StackedEnsemble_AllModels_1_AutoML_1_20241003_94645 0.7600385 0.3383988
5 StackedEnsemble_AllModels_5_AutoML_1_20241003_94645 0.7599243 0.3385273
6 StackedEnsemble_AllModels_2_AutoML_1_20241003_94645 0.7596834 0.3386844
aucpr mean_per_class_error rmse mse
1 0.3010454 0.3213903 0.3212951 0.1032306
2 0.2998955 0.3103933 0.3213547 0.1032688
3 0.3008490 0.3180442 0.3214169 0.1033088
4 0.3016766 0.3229745 0.3213194 0.1032462
5 0.3010226 0.3288211 0.3213613 0.1032731
6 0.3018481 0.3136263 0.3213904 0.1032918
[109 rows x 7 columns]
Checking which algorithms have been used
<- h2o.get_leaderboard(aml, extra_columns = "ALL")
full_lb <-as.data.frame(full_lb$algo)
algosprint(unique(algos))
algo
1 StackedEnsemble
13 GLM
14 GBM
76 DRF
Case 2: Run AutoML for 10 base models
- max_models = n, where n is number of models to build in the AutoML process. Defaults to NULL
- seed = 1234, AutoML guarantees reproducibility with max_models or early stopping
# ---------------------------------------------------------------
# Splitting data into train and validation sets
<- h2o.splitFrame(data = data, ratios = 0.8, seed = 1234)
splits
<- splits[[1]] # 80% training data
train_data <- splits[[2]] # 20% validation data
valid_data
<- h2o.automl(x = x,
aml2 y = y,
training_frame = train_data, # Use the training frame
validation_frame = valid_data, # Specify the validation frame
max_models = 10,
balance_classes = FALSE, # Specify whether to oversample minority classes; Defaults to FALSE.
seed = 1234
)
|
| | 0%
|
|== | 3%
10:00:55.586: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
10:00:55.592: AutoML: XGBoost is not available; skipping it.
|
|==== | 6%
|
|====== | 8%
|
|========= | 13%
|
|============ | 18%
|
|============== | 21%
|
|================ | 24%
|
|=================== | 26%
|
|================================= | 47%
|
|===================================== | 53%
|
|======================================================================| 100%
# Viewing the best model:
<-aml2@leader
best_model best_model
Model Details:
==============
H2OBinomialModel: stackedensemble
Model ID: StackedEnsemble_AllModels_1_AutoML_2_20241003_100055
Model Summary for Stacked Ensemble:
key value
1 Stacking strategy cross_validation
2 Number of base models (used / total) 7/10
3 # GBM base models (used / total) 4/6
4 # DRF base models (used / total) 2/2
5 # GLM base models (used / total) 1/1
6 # DeepLearning base models (used / total) 0/1
7 Metalearner algorithm GLM
8 Metalearner fold assignment scheme Random
9 Metalearner nfolds 5
10 Metalearner fold_column NA
11 Custom metalearner hyperparameters None
H2OBinomialMetrics: stackedensemble
** Reported on training data. **
MSE: 0.04710521
RMSE: 0.2170374
LogLoss: 0.17241
Mean Per-Class Error: 0.07122993
AUC: 0.9915184
AUCPR: 0.954627
Gini: 0.9830367
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 4842 65 0.013246 =65/4907
1 92 620 0.129213 =92/712
Totals 4934 685 0.027941 =157/5619
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.323439 0.887616 155
2 max f2 0.267866 0.906013 180
3 max f0point5 0.376246 0.910904 133
4 max accuracy 0.325849 0.972059 154
5 max precision 0.850609 1.000000 0
6 max recall 0.108709 1.000000 283
7 max specificity 0.850609 1.000000 0
8 max absolute_mcc 0.323439 0.871882 155
9 max min_per_class_accuracy 0.246020 0.950843 190
10 max mean_per_class_accuracy 0.261254 0.952448 183
11 max tns 0.850609 4907.000000 0
12 max fns 0.850609 711.000000 0
13 max fps 0.002354 4907.000000 399
14 max tps 0.108709 712.000000 283
15 max tnr 0.850609 1.000000 0
16 max fnr 0.850609 0.998596 0
17 max fpr 0.002354 1.000000 399
18 max tpr 0.108709 1.000000 283
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: stackedensemble
** Reported on validation data. **
MSE: 0.1085061
RMSE: 0.3294026
LogLoss: 0.345963
Mean Per-Class Error: 0.2683476
AUC: 0.8129465
AUCPR: 0.4306114
Gini: 0.625893
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 966 192 0.165803 =192/1158
1 79 134 0.370892 =79/213
Totals 1045 326 0.197666 =271/1371
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.215229 0.497217 167
2 max f2 0.106424 0.615281 262
3 max f0point5 0.269611 0.458372 125
4 max accuracy 0.424735 0.854121 45
5 max precision 0.710278 1.000000 0
6 max recall 0.007884 1.000000 387
7 max specificity 0.710278 1.000000 0
8 max absolute_mcc 0.215229 0.394225 167
9 max min_per_class_accuracy 0.152360 0.727700 218
10 max mean_per_class_accuracy 0.142857 0.738847 225
11 max tns 0.710278 1158.000000 0
12 max fns 0.710278 212.000000 0
13 max fps 0.002443 1158.000000 399
14 max tps 0.007884 213.000000 387
15 max tnr 0.710278 1.000000 0
16 max fnr 0.710278 0.995305 0
17 max fpr 0.002443 1.000000 399
18 max tpr 0.007884 1.000000 387
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: stackedensemble
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 0.09546108
RMSE: 0.3089678
LogLoss: 0.3107406
Mean Per-Class Error: 0.2829704
AUC: 0.8041598
AUCPR: 0.3413911
Gini: 0.6083195
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 3846 1061 0.216222 =1061/4907
1 249 463 0.349719 =249/712
Totals 4095 1524 0.233138 =1310/5619
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.176402 0.414132 215
2 max f2 0.123801 0.572752 261
3 max f0point5 0.273286 0.384719 147
4 max accuracy 0.577190 0.874533 23
5 max precision 0.721615 0.666667 1
6 max recall 0.005837 1.000000 391
7 max specificity 0.786416 0.999796 0
8 max absolute_mcc 0.123801 0.327319 261
9 max min_per_class_accuracy 0.148034 0.726106 239
10 max mean_per_class_accuracy 0.123801 0.740592 261
11 max tns 0.786416 4906.000000 0
12 max fns 0.786416 711.000000 0
13 max fps 0.001792 4907.000000 399
14 max tps 0.005837 712.000000 391
15 max tnr 0.786416 0.999796 0
16 max fnr 0.786416 0.998596 0
17 max fpr 0.001792 1.000000 399
18 max tpr 0.005837 1.000000 391
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
accuracy 0.773577 0.058581 0.751489 0.848718 0.777285 0.799823
auc 0.804522 0.028432 0.803688 0.835867 0.788468 0.828019
err 0.226423 0.058581 0.248511 0.151282 0.222715 0.200177
err_count 252.200000 54.476600 292.000000 177.000000 251.000000 226.000000
f0point5 0.356544 0.049552 0.353503 0.395809 0.319321 0.415713
cv_5_valid
accuracy 0.690570
auc 0.766569
err 0.309430
err_count 315.000000
f0point5 0.298372
---
mean sd cv_1_valid cv_2_valid cv_3_valid
precision 0.322305 0.052393 0.315341 0.373626 0.286232
r2 0.137534 0.026650 0.147860 0.151705 0.121886
recall 0.651441 0.094227 0.685185 0.519084 0.593985
residual_deviance 696.627200 45.698280 766.717040 644.703800 687.270750
rmse 0.308583 0.012018 0.318258 0.290423 0.302323
specificity 0.790065 0.076651 0.762093 0.890279 0.801811
cv_4_valid cv_5_valid
precision 0.377163 0.259162
r2 0.166646 0.099573
recall 0.703226 0.755725
residual_deviance 709.805660 674.638730
rmse 0.314171 0.317741
specificity 0.815195 0.680947
# Viewing complete model summary (all rows and column)
<- h2o.get_leaderboard(aml2, extra_columns = "ALL")
full_lb2 print(head(full_lb2, n = 100))
model_id auc logloss
1 StackedEnsemble_AllModels_1_AutoML_2_20241003_100055 0.8041598 0.3107406
2 StackedEnsemble_BestOfFamily_1_AutoML_2_20241003_100055 0.8035584 0.3107232
3 GBM_2_AutoML_2_20241003_100055 0.7969820 0.3181950
4 GBM_3_AutoML_2_20241003_100055 0.7956907 0.3193538
5 GBM_1_AutoML_2_20241003_100055 0.7906475 0.3184349
6 GBM_5_AutoML_2_20241003_100055 0.7899660 0.3209791
7 XRT_1_AutoML_2_20241003_100055 0.7838551 0.3236697
8 DRF_1_AutoML_2_20241003_100055 0.7817283 0.3471622
9 GBM_grid_1_AutoML_2_20241003_100055_model_1 0.7783961 0.3339926
10 GBM_4_AutoML_2_20241003_100055 0.7712734 0.3375627
11 GLM_1_AutoML_2_20241003_100055 0.7695979 0.3257758
12 DeepLearning_1_AutoML_2_20241003_100055 0.7236684 0.3551143
aucpr mean_per_class_error rmse mse training_time_ms
1 0.3413911 0.2829704 0.3089678 0.09546108 2464
2 0.3420521 0.2980535 0.3089722 0.09546383 2092
3 0.3295733 0.3009864 0.3125086 0.09766160 1554
4 0.3295498 0.2958531 0.3128368 0.09786684 429
5 0.3226722 0.2963379 0.3121005 0.09740675 461
6 0.3291834 0.2892274 0.3130270 0.09798590 264
7 0.3081259 0.2924907 0.3133731 0.09820268 903
8 0.3170896 0.3149846 0.3126784 0.09776778 1380
9 0.2962752 0.2901390 0.3186610 0.10154485 285
10 0.2842441 0.3174026 0.3205027 0.10272196 488
11 0.2993501 0.3255569 0.3156341 0.09962492 215
12 0.2400872 0.3339925 0.3255017 0.10595138 603
predict_time_per_row_ms algo
1 0.026530 StackedEnsemble
2 0.016157 StackedEnsemble
3 0.004990 GBM
4 0.004476 GBM
5 0.006429 GBM
6 0.004617 GBM
7 0.005900 DRF
8 0.006899 DRF
9 0.005292 GBM
10 0.004871 GBM
11 0.001973 GLM
12 0.002916 DeepLearning
The output gives 12 rows (10 Base models + Stacked Ensemble Best of Family + Stacked Ensemble All Models)
“Stacked Ensemble Best of Family”: This ensemble is built using only the best-performing model from each algorithm family (e.g., the best GLM, the best GBM, etc.).
“Stacked Ensemble All Models”: This ensemble uses predictions from all base models, regardless of their performance.
Let’s look at some operations to perform on Test data
We will use the “valid_data” created earlier as test data on the best_model.
# Store model performance
<- h2o.performance(best_model, newdata = valid_data)
performance
# ROC curve
# Extract the TPR (True Positive Rate) and FPR (False Positive Rate) from the model performance
<- h2o.performance(best_model, valid = TRUE)@metrics$thresholds_and_metric_scores$fpr
fpr <- h2o.performance(best_model, valid = TRUE)@metrics$thresholds_and_metric_scores$tpr
tpr
# Get the AUC value
<- h2o.auc(performance)
auc_value
# Plotting the ROC Curve using base R
plot(fpr, tpr, type = "l", col = "blue", lwd = 2, xlab = "False Positive Rate", ylab = "True Positive Rate", main = "ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red") # Add a diagonal line for reference
# Display the AUC value on the plot
text(x = 0.6, y = 0.2, labels = paste("AUC =", round(auc_value, 3)), col = "black", cex = 1.2, font = 2)
# Confusion Matrix
h2o.confusionMatrix(performance)
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.21522863095099:
0 1 Error Rate
0 966 192 0.165803 =192/1158
1 79 134 0.370892 =79/213
Totals 1045 326 0.197666 =271/1371
# Variable Importance:
h2o.varimp(aml2@leader) # No output since the best model is not tree-based, in our case
# Model Predictions:
<- h2o.predict(aml2@leader, newdata = valid_data) predictions
|
| | 0%
|
|======================================================================| 100%
# Convert H2OFrame to R data frame for manipulation
<- as.data.frame(predictions)
predictions_df
# Get the optimal threshold based on the Accuracy
<- h2o.find_threshold_by_max_metric(performance, "accuracy") # precision, f1, recall, etc.
optimal_threshold print(optimal_threshold)
[1] 0.4247352
# Apply the optimal cutoff to create new binary predictions
$custom_prediction <- ifelse(predictions_df$p1 >= optimal_threshold, 1, 0)
predictions_df
# View predictions
head(predictions_df)
predict p0 p1 custom_prediction
1 0 0.9946160 0.005383970 0
2 1 0.6541355 0.345864512 0
3 0 0.9831648 0.016835232 0
4 0 0.9907014 0.009298568 0
5 0 0.9589116 0.041088399 0
6 0 0.9678238 0.032176184 0
Check the balanced class distribution in the H2O training frame
# Original data
table(as.data.frame(data)[[y]])
0 1
6065 925
# Balanced data
<- h2o.getFrame(aml2@leader@parameters$training_frame)
balanced_data table(as.data.frame(balanced_data)[[y]])
0 1
4907 712
Note: In case of small data set, balancing might not be possible or effective.
View base models included in any particular stacked ensemble
<- h2o.getModel(best_model@model_id)
se_model print(se_model@model$base_models)
[1] "GBM_2_AutoML_2_20241003_100055"
[2] "GBM_3_AutoML_2_20241003_100055"
[3] "GBM_1_AutoML_2_20241003_100055"
[4] "GBM_5_AutoML_2_20241003_100055"
[5] "XRT_1_AutoML_2_20241003_100055"
[6] "DRF_1_AutoML_2_20241003_100055"
[7] "GBM_grid_1_AutoML_2_20241003_100055_model_1"
[8] "GBM_4_AutoML_2_20241003_100055"
[9] "GLM_1_AutoML_2_20241003_100055"
[10] "DeepLearning_1_AutoML_2_20241003_100055"
# View details of a base model:
# full_lb2
# se_model <- h2o.getModel(full_lb2[1, "model_id"]) # Replace 1 with the row index for stacked ensemble for which you want to view the base models
# base_model <- h2o.getModel(se_model@model$base_models[[5]]) # Access the fifth base model
# print(base_model)
Save model object to disk
# h2o.saveModel(
# object = base_model,
# path = "D:/My Documents/AutoML_R_h2o/",
# force = T,
# export_cross_validation_predictions = FALSE,
# filename = "RandomForest_basemodel"
# )
Load model object from disk
<-h2o.loadModel("D:/My Documents/AutoML_R_h2o/RandomForest_basemodel") loadtest
Benfits of using Automl:
Automates most of the steps involved in Machine learning, thus reducing time and effort.
Less Manual Intervention and minimum coding and machine learning knowledge
Increased performance
Best model selection based on performance metric.
Creates stacked ensemble models that combine multiple models’ predictions to improve overall accuracy and robustness
Works for parallel processing and larger dataset