Automated Machine Learning with Auto-sklearn

Introduction

Auto-sklearn is an open-source Python library that automates model selection and hyperparameter tuning using techniques like ensemble learning and Bayesian optimization. It’s a powerful tool for both beginners and experts looking to streamline their AutoML workflows. This blog explores key features and configurations.
1. The default Auto-sklearn model for a baseline.
2. Time-based constraints to optimize model training.
3. Custom classifier selection, including Logistic Regression and Random Forest.
4. Changing the evaluation metric to ROC AUC.

Importing Libraries

import pandas as pd
import autosklearn
import autosklearn.classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from ConfigSpace.configuration_space import ConfigurationSpace
from ConfigSpace.hyperparameters import UniformIntegerHyperparameter, UniformFloatHyperparameter
from autosklearn.pipeline.components.classification.random_forest import RandomForest

Data Description

The credit score datasets contain information about various financial metrics and demographic attributes of customers.

Importing data

import pandas as pd
customer_info=pd.read_csv('customer info.csv')
customer_details=pd.read_csv('customer details.csv')
customer_profiles=pd.read_csv('customer profiles.csv')

Merging data

grouped_customer_details = customer_details.groupby('custid').sum().reset_index()

customer_data = grouped_customer_details.merge(customer_info, on='custid', how='inner')\
                           .merge(customer_profiles, on='custid', how='inner')
custid debtinc creddebt othdebt preloan veh house selfemp account deposit emp address branch ref age gender ms child zone bad
0 1 12.34 13.26 5.88 2.0 2 1 2 1 2 18 12 1 2 2 1 2 2 7 1
1 2 18.65 2.12 5.13 1.0 2 1 1 2 1 11 7 1 2 2 2 1 1 5 0
2 3 7.22 3.31 3.65 1.0 1 2 1 1 1 16 15 1 1 1 1 2 1 15 0
3 4 6.15 2.95 2.34 1.0 1 2 1 1 2 15 14 2 1 2 1 1 1 3 0
4 5 20.64 2.67 4.07 2.0 2 1 1 1 1 2 1 2 2 2 2 1 1 20 0

Note- The dataset contains 5 numerical variables and 13 categorical variables excluding custid.

Preparing Features and Target Variable for Model Training

# Exclude 'custid' and 'bad' from features, keep 'bad' as target
X = customer_data.drop(columns=['custid', 'bad'])  # Features
y = customer_data['bad']  # Target variablele

# Perform train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

1. Default-all models evaluated, models for ensemble and final performance

1a) Initializing Auto-Sklearn Classifier with default parameters

The default AutoSklearnClassifier, which runs for 1 hour (by default) to automatically select the best model and tune hyperparameters based on the data. It serves as a baseline for model comparison.

automl_df = autosklearn.classification.AutoSklearnClassifier()

1b)Fitting the Auto-Sklearn Classifier to Training Data

automl_df.fit(X_train,y_train)

1c)Displaying Auto-Sklearn Training Statistics

automl_df.sprint_statistics() displays a summary of the performance metrics for the default Auto-sklearn classifier. It provides insights into the number of models evaluated and the performance of the best model, highlighting the efficiency of the automated machine learning process.

print(automl_df.sprint_statistics())
auto-sklearn results:
  Dataset name: 8e4b7b89-773c-11ef-8a93-0242ac1c000c
  Metric: accuracy
  Best validation score: 0.871827
  Number of target algorithm runs: 158
  Number of successful target algorithm runs: 131
  Number of crashed target algorithm runs: 21
  Number of target algorithms that exceeded the time limit: 6
  Number of target algorithms that exceeded the memory limit: 0

1d)Viewing the Auto-Sklearn Leaderboard (Displays All Models)

automl_df.leaderboard(ensemble_only=False) displays a leaderboard of all evaluated models. It presents key metrics such as rank, ensemble weight, model type, cost, and duration, allowing users to compare the performance of different models and assess their effectiveness in the automated machine learning process.

automl_df.leaderboard(ensemble_only=False)
rank ensemble_weight type cost duration
model_id
5 1 0.30 random_forest 0.128173 26.505598
4 2 0.04 random_forest 0.128793 131.006729
142 3 0.04 adaboost 0.129412 4.034574
13 6 0.12 extra_trees 0.130031 12.743632
15 7 0.10 gradient_boosting 0.130031 3.650947
... ... ... ... ... ...
10 182 0.00 sgd 0.529412 1.196238
140 183 0.00 passive_aggressive 0.547988 1.416111
177 184 0.00 sgd 0.591950 3.825298
101 185 0.00 qda 0.598762 12.537381
36 186 0.00 decision_tree 0.823529 1.128428

186 rows × 5 columns

1e)Viewing the Auto-Sklearn Leaderboard (Shortlisted Models)

automl_df.leaderboard(ensemble_only=True) generates a leaderboard that displays only the ensemble models evaluated by the Auto-sklearn classifier

automl_df.leaderboard(ensemble_only=True)
rank ensemble_weight type cost duration
model_id
5 1 0.30 random_forest 0.128173 26.505598
4 2 0.04 random_forest 0.128793 131.006729
142 3 0.04 adaboost 0.129412 4.034574
13 4 0.12 extra_trees 0.130031 12.743632
15 5 0.10 gradient_boosting 0.130031 3.650947
35 6 0.20 random_forest 0.130031 15.313135
71 7 0.10 random_forest 0.130031 9.515407
3 8 0.02 random_forest 0.130650 70.507130
54 9 0.02 adaboost 0.130650 8.681275
70 10 0.02 k_nearest_neighbors 0.130650 2.017158
90 11 0.02 k_nearest_neighbors 0.130650 1.736899
2 12 0.02 random_forest 0.131269 5.964926

1f)performance metrics of different models

The cv_results attribute in Auto-sklearn provides insights into the evaluated models, including metrics like mean test scores and fit times, as well as details on preprocessing techniques. This includes feature preprocessing, data cleaning, scaling methods, and balancing techniques, highlighting how these steps impact model performance.

df_df = pd.DataFrame(automl_df.cv_results_)
df_df
mean_test_score rank_test_scores mean_fit_time params status budgets param_balancing:strategy param_classifier:__choice__ param_data_preprocessor:__choice__ param_feature_preprocessor:__choice__ ... param_data_preprocessor:feature_type:numerical_transformer:rescaling:robust_scaler:q_max param_data_preprocessor:feature_type:numerical_transformer:rescaling:robust_scaler:q_min param_feature_preprocessor:fast_ica:n_components param_feature_preprocessor:kernel_pca:coef0 param_feature_preprocessor:kernel_pca:degree param_feature_preprocessor:kernel_pca:gamma param_feature_preprocessor:nystroem_sampler:coef0 param_feature_preprocessor:nystroem_sampler:degree param_feature_preprocessor:nystroem_sampler:gamma param_feature_preprocessor:select_rates_classification:mode
0 0.868731 13 5.964926 {'balancing:strategy': 'none', 'classifier:__c... Success 0.0 none random_forest feature_type no_preprocessing ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 0.869350 8 70.507130 {'balancing:strategy': 'none', 'classifier:__c... Success 0.0 none random_forest feature_type polynomial ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0.871207 2 131.006729 {'balancing:strategy': 'none', 'classifier:__c... Success 0.0 none random_forest feature_type polynomial ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 0.871827 1 26.505598 {'balancing:strategy': 'none', 'classifier:__c... Success 0.0 none random_forest feature_type polynomial ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0.861300 122 5.311892 {'balancing:strategy': 'none', 'classifier:__c... Success 0.0 none gradient_boosting feature_type fast_ica ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
206 0.868111 17 27.116169 {'balancing:strategy': 'none', 'classifier:__c... Success 0.0 none sgd feature_type fast_ica ... NaN NaN 1142.0 NaN NaN NaN NaN NaN NaN NaN
207 0.000000 187 360.118319 {'balancing:strategy': 'none', 'classifier:__c... Timeout 0.0 none libsvm_svc feature_type polynomial ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
208 0.868111 17 1.756671 {'balancing:strategy': 'none', 'classifier:__c... Success 0.0 none libsvm_svc feature_type extra_trees_preproc_for_classification ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
209 0.868111 17 2.885311 {'balancing:strategy': 'none', 'classifier:__c... Success 0.0 none liblinear_svc feature_type kitchen_sinks ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
210 0.000000 187 87.031239 {'balancing:strategy': 'weighting', 'classifie... Timeout 0.0 weighting gradient_boosting feature_type polynomial ... 0.789833 0.231894 NaN NaN NaN NaN NaN NaN NaN NaN

211 rows × 174 columns

1g)Score of the final ensemble

y_pred_df=automl_df.predict(X_test)
accuracy_df= accuracy_score(y_test, y_pred_df)
accuracy_df
0.865999046256557

2. Specific time to run each algorithm and overall time

Initializing Auto-Sklearn Classifier with time_left_for_this_task of 240 seconds (time_left_for_this_task=240) for the entire AutoML process, while restricting each individual model run to a maximum of 50 seconds (per_run_time_limit=50). This helps manage computational resources by controlling both overall task duration and the time spent on each model evaluation.

automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=240,  # Total time in seconds
                                                          per_run_time_limit=50,  # Time limit per model run
                                                          )

3. Specific methods- logistic and random forest but default tuning

Instead of relying on Auto-sklearn’s automatic model selection, we can manually specify classifiers like Logistic Regression or Random Forest, tailoring the pipeline to suit specific tasks or preferences.

automl_rflr = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=300,  # Time in seconds for the entire AutoML process
    per_run_time_limit=30,  # Time in seconds for each individual model run
    include={
        'classifier': ['random_forest', 'liblinear_svc']
    },
)

4. Change metrics roc_auc

Adjusting the evaluation metric, like using ROC AUC for binary classification, helps Auto-sklearn focus on the most relevant measure of performance, especially in imbalanced datasets. some of the available metrics are: - accuracy - average_precision - f1_macro

Note - for more metrics check autosklearn official site

automl_rocauc = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=240,  # Total time in seconds
                                                          per_run_time_limit=50,  # Time limit per model run
                                                          metric=autosklearn.metrics.roc_auc
                                                          )

Note - Only the first configuration(1) out of the four has been implemented in detail, while the remaining configurations(2,3,4) will only highlight the model changes, keeping all other aspects consistent.

There are additional points that can be explored using Auto-sklearn, such as:

  • Manual hyperparameter tuning for models like Random Forest, allowing fine-tuned control over parameters such as n_estimators, max_depth, and - min_samples_split.
  • Changing the type of resampling in Random Forest by adjusting the bootstrap parameter to control whether sampling is done with or without replacement.
  • Exploring different feature preprocessing techniques to optimize model performance.
    For more details, visit the official Auto-sklearn site.

Conclusion

Auto-sklearn simplifies automated machine learning by streamlining model selection, hyperparameter tuning, and allowing adjustments in feature and data preprocessing. By exploring various configurations, users can better understand their impact on model performance. This flexibility empowers both novices and experienced practitioners to enhance their AutoML processes effectively.