Automated Machine Learning with Auto-sklearn

Introduction

Auto-sklearn is an open-source Python library that automates model selection and hyperparameter tuning using techniques like ensemble learning and Bayesian optimization. It’s a powerful tool for both beginners and experts looking to streamline their AutoML workflows. This blog explores key features and configurations.
1. The default Auto-sklearn model for a baseline.
2. Time-based constraints to optimize model training.
3. Custom classifier selection, including Logistic Regression and Random Forest.
4. Changing the evaluation metric to ROC AUC.

Importing Libraries

import pandas as pd
import autosklearn
import autosklearn.classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from ConfigSpace.configuration_space import ConfigurationSpace
from ConfigSpace.hyperparameters import UniformIntegerHyperparameter, UniformFloatHyperparameter
from autosklearn.pipeline.components.classification.random_forest import RandomForest

Data Description

The credit score datasets contain information about various financial metrics and demographic attributes of customers.

Importing data

import pandas as pd
customer_info=pd.read_csv('customer info.csv')
customer_details=pd.read_csv('customer details.csv')
customer_profiles=pd.read_csv('customer profiles.csv')

Merging data

grouped_customer_details = customer_details.groupby('custid').sum().reset_index()

customer_data = grouped_customer_details.merge(customer_info, on='custid', how='inner')\
                           .merge(customer_profiles, on='custid', how='inner')

	custid	debtinc	creddebt	othdebt	preloan	veh	house	selfemp	account	deposit	emp	address	branch	ref	age	gender	ms	child	zone	bad
0	1	12.34	13.26	5.88	2.0	2	1	2	1	2	18	12	1	2	2	1	2	2	7	1
1	2	18.65	2.12	5.13	1.0	2	1	1	2	1	11	7	1	2	2	2	1	1	5	0
2	3	7.22	3.31	3.65	1.0	1	2	1	1	1	16	15	1	1	1	1	2	1	15	0
3	4	6.15	2.95	2.34	1.0	1	2	1	1	2	15	14	2	1	2	1	1	1	3	0
4	5	20.64	2.67	4.07	2.0	2	1	1	1	1	2	1	2	2	2	2	1	1	20	0

Note- The dataset contains 5 numerical variables and 13 categorical variables excluding custid.

Preparing Features and Target Variable for Model Training

# Exclude 'custid' and 'bad' from features, keep 'bad' as target
X = customer_data.drop(columns=['custid', 'bad'])  # Features
y = customer_data['bad']  # Target variablele

# Perform train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

1. Default-all models evaluated, models for ensemble and final performance

1a) Initializing Auto-Sklearn Classifier with default parameters

The default AutoSklearnClassifier, which runs for 1 hour (by default) to automatically select the best model and tune hyperparameters based on the data. It serves as a baseline for model comparison.

automl_df = autosklearn.classification.AutoSklearnClassifier()

1b)Fitting the Auto-Sklearn Classifier to Training Data

automl_df.fit(X_train,y_train)

1c)Displaying Auto-Sklearn Training Statistics

automl_df.sprint_statistics() displays a summary of the performance metrics for the default Auto-sklearn classifier. It provides insights into the number of models evaluated and the performance of the best model, highlighting the efficiency of the automated machine learning process.

print(automl_df.sprint_statistics())

auto-sklearn results:
  Dataset name: 8e4b7b89-773c-11ef-8a93-0242ac1c000c
  Metric: accuracy
  Best validation score: 0.871827
  Number of target algorithm runs: 158
  Number of successful target algorithm runs: 131
  Number of crashed target algorithm runs: 21
  Number of target algorithms that exceeded the time limit: 6
  Number of target algorithms that exceeded the memory limit: 0

1d)Viewing the Auto-Sklearn Leaderboard (Displays All Models)

automl_df.leaderboard(ensemble_only=False) displays a leaderboard of all evaluated models. It presents key metrics such as rank, ensemble weight, model type, cost, and duration, allowing users to compare the performance of different models and assess their effectiveness in the automated machine learning process.

automl_df.leaderboard(ensemble_only=False)

	rank	ensemble_weight	type	cost	duration
model_id
5	1	0.30	random_forest	0.128173	26.505598
4	2	0.04	random_forest	0.128793	131.006729
142	3	0.04	adaboost	0.129412	4.034574
13	6	0.12	extra_trees	0.130031	12.743632
15	7	0.10	gradient_boosting	0.130031	3.650947
...	...	...	...	...	...
10	182	0.00	sgd	0.529412	1.196238
140	183	0.00	passive_aggressive	0.547988	1.416111
177	184	0.00	sgd	0.591950	3.825298
101	185	0.00	qda	0.598762	12.537381
36	186	0.00	decision_tree	0.823529	1.128428

186 rows × 5 columns

1e)Viewing the Auto-Sklearn Leaderboard (Shortlisted Models)

automl_df.leaderboard(ensemble_only=True) generates a leaderboard that displays only the ensemble models evaluated by the Auto-sklearn classifier

automl_df.leaderboard(ensemble_only=True)

	rank	ensemble_weight	type	cost	duration
model_id
5	1	0.30	random_forest	0.128173	26.505598
4	2	0.04	random_forest	0.128793	131.006729
142	3	0.04	adaboost	0.129412	4.034574
13	4	0.12	extra_trees	0.130031	12.743632
15	5	0.10	gradient_boosting	0.130031	3.650947
35	6	0.20	random_forest	0.130031	15.313135
71	7	0.10	random_forest	0.130031	9.515407
3	8	0.02	random_forest	0.130650	70.507130
54	9	0.02	adaboost	0.130650	8.681275
70	10	0.02	k_nearest_neighbors	0.130650	2.017158
90	11	0.02	k_nearest_neighbors	0.130650	1.736899
2	12	0.02	random_forest	0.131269	5.964926

1f)performance metrics of different models

The cv_results attribute in Auto-sklearn provides insights into the evaluated models, including metrics like mean test scores and fit times, as well as details on preprocessing techniques. This includes feature preprocessing, data cleaning, scaling methods, and balancing techniques, highlighting how these steps impact model performance.

df_df = pd.DataFrame(automl_df.cv_results_)
df_df

	mean_test_score	rank_test_scores	mean_fit_time	params	status	budgets	param_balancing:strategy	param_classifier:__choice__	param_data_preprocessor:__choice__	param_feature_preprocessor:__choice__	...	param_data_preprocessor:feature_type:numerical_transformer:rescaling:robust_scaler:q_max	param_data_preprocessor:feature_type:numerical_transformer:rescaling:robust_scaler:q_min	param_feature_preprocessor:fast_ica:n_components	param_feature_preprocessor:kernel_pca:coef0	param_feature_preprocessor:kernel_pca:degree	param_feature_preprocessor:kernel_pca:gamma	param_feature_preprocessor:nystroem_sampler:coef0	param_feature_preprocessor:nystroem_sampler:degree	param_feature_preprocessor:nystroem_sampler:gamma	param_feature_preprocessor:select_rates_classification:mode
0	0.868731	13	5.964926	{'balancing:strategy': 'none', 'classifier:__c...	Success	0.0	none	random_forest	feature_type	no_preprocessing	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	0.869350	8	70.507130	{'balancing:strategy': 'none', 'classifier:__c...	Success	0.0	none	random_forest	feature_type	polynomial	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	0.871207	2	131.006729	{'balancing:strategy': 'none', 'classifier:__c...	Success	0.0	none	random_forest	feature_type	polynomial	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	0.871827	1	26.505598	{'balancing:strategy': 'none', 'classifier:__c...	Success	0.0	none	random_forest	feature_type	polynomial	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	0.861300	122	5.311892	{'balancing:strategy': 'none', 'classifier:__c...	Success	0.0	none	gradient_boosting	feature_type	fast_ica	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
206	0.868111	17	27.116169	{'balancing:strategy': 'none', 'classifier:__c...	Success	0.0	none	sgd	feature_type	fast_ica	...	NaN	NaN	1142.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
207	0.000000	187	360.118319	{'balancing:strategy': 'none', 'classifier:__c...	Timeout	0.0	none	libsvm_svc	feature_type	polynomial	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
208	0.868111	17	1.756671	{'balancing:strategy': 'none', 'classifier:__c...	Success	0.0	none	libsvm_svc	feature_type	extra_trees_preproc_for_classification	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
209	0.868111	17	2.885311	{'balancing:strategy': 'none', 'classifier:__c...	Success	0.0	none	liblinear_svc	feature_type	kitchen_sinks	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
210	0.000000	187	87.031239	{'balancing:strategy': 'weighting', 'classifie...	Timeout	0.0	weighting	gradient_boosting	feature_type	polynomial	...	0.789833	0.231894	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

211 rows × 174 columns

1g)Score of the final ensemble

y_pred_df=automl_df.predict(X_test)
accuracy_df= accuracy_score(y_test, y_pred_df)
accuracy_df

0.865999046256557

2. Specific time to run each algorithm and overall time

Initializing Auto-Sklearn Classifier with time_left_for_this_task of 240 seconds (time_left_for_this_task=240) for the entire AutoML process, while restricting each individual model run to a maximum of 50 seconds (per_run_time_limit=50). This helps manage computational resources by controlling both overall task duration and the time spent on each model evaluation.

automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=240,  # Total time in seconds
                                                          per_run_time_limit=50,  # Time limit per model run
                                                          )

3. Specific methods- logistic and random forest but default tuning

Instead of relying on Auto-sklearn’s automatic model selection, we can manually specify classifiers like Logistic Regression or Random Forest, tailoring the pipeline to suit specific tasks or preferences.

automl_rflr = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=300,  # Time in seconds for the entire AutoML process
    per_run_time_limit=30,  # Time in seconds for each individual model run
    include={
        'classifier': ['random_forest', 'liblinear_svc']
    },
)

4. Change metrics roc_auc

Adjusting the evaluation metric, like using ROC AUC for binary classification, helps Auto-sklearn focus on the most relevant measure of performance, especially in imbalanced datasets. some of the available metrics are: - accuracy - average_precision - f1_macro

Note - for more metrics check autosklearn official site

automl_rocauc = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=240,  # Total time in seconds
                                                          per_run_time_limit=50,  # Time limit per model run
                                                          metric=autosklearn.metrics.roc_auc
                                                          )

Note - Only the first configuration(1) out of the four has been implemented in detail, while the remaining configurations(2,3,4) will only highlight the model changes, keeping all other aspects consistent.

There are additional points that can be explored using Auto-sklearn, such as:

Manual hyperparameter tuning for models like Random Forest, allowing fine-tuned control over parameters such as n_estimators, max_depth, and - min_samples_split.
Changing the type of resampling in Random Forest by adjusting the bootstrap parameter to control whether sampling is done with or without replacement.
Exploring different feature preprocessing techniques to optimize model performance.
For more details, visit the official Auto-sklearn site.

Conclusion

Auto-sklearn simplifies automated machine learning by streamlining model selection, hyperparameter tuning, and allowing adjustments in feature and data preprocessing. By exploring various configurations, users can better understand their impact on model performance. This flexibility empowers both novices and experienced practitioners to enhance their AutoML processes effectively.