import pandas as pd
import autosklearn
import autosklearn.classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from ConfigSpace.configuration_space import ConfigurationSpace
from ConfigSpace.hyperparameters import UniformIntegerHyperparameter, UniformFloatHyperparameter
from autosklearn.pipeline.components.classification.random_forest import RandomForest
Automated Machine Learning with Auto-sklearn
Introduction
Auto-sklearn is an open-source Python library that automates model selection and hyperparameter tuning using techniques like ensemble learning and Bayesian optimization. It’s a powerful tool for both beginners and experts looking to streamline their AutoML workflows. This blog explores key features and configurations.
1. The default Auto-sklearn model for a baseline.
2. Time-based constraints to optimize model training.
3. Custom classifier selection, including Logistic Regression and Random Forest.
4. Changing the evaluation metric to ROC AUC.
Importing Libraries
Data Description
The credit score datasets contain information about various financial metrics and demographic attributes of customers.
Importing data
import pandas as pd
=pd.read_csv('customer info.csv')
customer_info=pd.read_csv('customer details.csv')
customer_details=pd.read_csv('customer profiles.csv') customer_profiles
Merging data
= customer_details.groupby('custid').sum().reset_index()
grouped_customer_details
= grouped_customer_details.merge(customer_info, on='custid', how='inner')\
customer_data ='custid', how='inner') .merge(customer_profiles, on
custid | debtinc | creddebt | othdebt | preloan | veh | house | selfemp | account | deposit | emp | address | branch | ref | age | gender | ms | child | zone | bad | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 12.34 | 13.26 | 5.88 | 2.0 | 2 | 1 | 2 | 1 | 2 | 18 | 12 | 1 | 2 | 2 | 1 | 2 | 2 | 7 | 1 |
1 | 2 | 18.65 | 2.12 | 5.13 | 1.0 | 2 | 1 | 1 | 2 | 1 | 11 | 7 | 1 | 2 | 2 | 2 | 1 | 1 | 5 | 0 |
2 | 3 | 7.22 | 3.31 | 3.65 | 1.0 | 1 | 2 | 1 | 1 | 1 | 16 | 15 | 1 | 1 | 1 | 1 | 2 | 1 | 15 | 0 |
3 | 4 | 6.15 | 2.95 | 2.34 | 1.0 | 1 | 2 | 1 | 1 | 2 | 15 | 14 | 2 | 1 | 2 | 1 | 1 | 1 | 3 | 0 |
4 | 5 | 20.64 | 2.67 | 4.07 | 2.0 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | 2 | 2 | 2 | 2 | 1 | 1 | 20 | 0 |
Note
- The dataset contains 5 numerical variables and 13 categorical variables excluding custid.
Preparing Features and Target Variable for Model Training
# Exclude 'custid' and 'bad' from features, keep 'bad' as target
= customer_data.drop(columns=['custid', 'bad']) # Features
X = customer_data['bad'] # Target variablele
y
# Perform train-test split (80% train, 20% test)
= train_test_split(X, y, test_size=0.3, random_state=42) X_train, X_test, y_train, y_test
1. Default-all models evaluated, models for ensemble and final performance
1a) Initializing Auto-Sklearn Classifier with default parameters
The default AutoSklearnClassifier, which runs for 1 hour (by default) to automatically select the best model and tune hyperparameters based on the data. It serves as a baseline for model comparison.
= autosklearn.classification.AutoSklearnClassifier() automl_df
1b)Fitting the Auto-Sklearn Classifier to Training Data
automl_df.fit(X_train,y_train)
1c)Displaying Auto-Sklearn Training Statistics
automl_df.sprint_statistics()
displays a summary of the performance metrics for the default Auto-sklearn classifier. It provides insights into the number of models evaluated and the performance of the best model, highlighting the efficiency of the automated machine learning process.
print(automl_df.sprint_statistics())
auto-sklearn results:
Dataset name: 8e4b7b89-773c-11ef-8a93-0242ac1c000c
Metric: accuracy
Best validation score: 0.871827
Number of target algorithm runs: 158
Number of successful target algorithm runs: 131
Number of crashed target algorithm runs: 21
Number of target algorithms that exceeded the time limit: 6
Number of target algorithms that exceeded the memory limit: 0
1d)Viewing the Auto-Sklearn Leaderboard (Displays All Models)
automl_df.leaderboard(ensemble_only=False)
displays a leaderboard of all evaluated models. It presents key metrics such as rank, ensemble weight, model type, cost, and duration, allowing users to compare the performance of different models and assess their effectiveness in the automated machine learning process.
=False) automl_df.leaderboard(ensemble_only
rank | ensemble_weight | type | cost | duration | |
---|---|---|---|---|---|
model_id | |||||
5 | 1 | 0.30 | random_forest | 0.128173 | 26.505598 |
4 | 2 | 0.04 | random_forest | 0.128793 | 131.006729 |
142 | 3 | 0.04 | adaboost | 0.129412 | 4.034574 |
13 | 6 | 0.12 | extra_trees | 0.130031 | 12.743632 |
15 | 7 | 0.10 | gradient_boosting | 0.130031 | 3.650947 |
... | ... | ... | ... | ... | ... |
10 | 182 | 0.00 | sgd | 0.529412 | 1.196238 |
140 | 183 | 0.00 | passive_aggressive | 0.547988 | 1.416111 |
177 | 184 | 0.00 | sgd | 0.591950 | 3.825298 |
101 | 185 | 0.00 | qda | 0.598762 | 12.537381 |
36 | 186 | 0.00 | decision_tree | 0.823529 | 1.128428 |
186 rows × 5 columns
1e)Viewing the Auto-Sklearn Leaderboard (Shortlisted Models)
automl_df.leaderboard(ensemble_only=True)
generates a leaderboard that displays only the ensemble models evaluated by the Auto-sklearn classifier
=True) automl_df.leaderboard(ensemble_only
rank | ensemble_weight | type | cost | duration | |
---|---|---|---|---|---|
model_id | |||||
5 | 1 | 0.30 | random_forest | 0.128173 | 26.505598 |
4 | 2 | 0.04 | random_forest | 0.128793 | 131.006729 |
142 | 3 | 0.04 | adaboost | 0.129412 | 4.034574 |
13 | 4 | 0.12 | extra_trees | 0.130031 | 12.743632 |
15 | 5 | 0.10 | gradient_boosting | 0.130031 | 3.650947 |
35 | 6 | 0.20 | random_forest | 0.130031 | 15.313135 |
71 | 7 | 0.10 | random_forest | 0.130031 | 9.515407 |
3 | 8 | 0.02 | random_forest | 0.130650 | 70.507130 |
54 | 9 | 0.02 | adaboost | 0.130650 | 8.681275 |
70 | 10 | 0.02 | k_nearest_neighbors | 0.130650 | 2.017158 |
90 | 11 | 0.02 | k_nearest_neighbors | 0.130650 | 1.736899 |
2 | 12 | 0.02 | random_forest | 0.131269 | 5.964926 |
1f)performance metrics of different models
The cv_results
attribute in Auto-sklearn provides insights into the evaluated models, including metrics like mean test scores and fit times, as well as details on preprocessing techniques. This includes feature preprocessing, data cleaning, scaling methods, and balancing techniques, highlighting how these steps impact model performance.
= pd.DataFrame(automl_df.cv_results_)
df_df df_df
mean_test_score | rank_test_scores | mean_fit_time | params | status | budgets | param_balancing:strategy | param_classifier:__choice__ | param_data_preprocessor:__choice__ | param_feature_preprocessor:__choice__ | ... | param_data_preprocessor:feature_type:numerical_transformer:rescaling:robust_scaler:q_max | param_data_preprocessor:feature_type:numerical_transformer:rescaling:robust_scaler:q_min | param_feature_preprocessor:fast_ica:n_components | param_feature_preprocessor:kernel_pca:coef0 | param_feature_preprocessor:kernel_pca:degree | param_feature_preprocessor:kernel_pca:gamma | param_feature_preprocessor:nystroem_sampler:coef0 | param_feature_preprocessor:nystroem_sampler:degree | param_feature_preprocessor:nystroem_sampler:gamma | param_feature_preprocessor:select_rates_classification:mode | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.868731 | 13 | 5.964926 | {'balancing:strategy': 'none', 'classifier:__c... | Success | 0.0 | none | random_forest | feature_type | no_preprocessing | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 0.869350 | 8 | 70.507130 | {'balancing:strategy': 'none', 'classifier:__c... | Success | 0.0 | none | random_forest | feature_type | polynomial | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 0.871207 | 2 | 131.006729 | {'balancing:strategy': 'none', 'classifier:__c... | Success | 0.0 | none | random_forest | feature_type | polynomial | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 0.871827 | 1 | 26.505598 | {'balancing:strategy': 'none', 'classifier:__c... | Success | 0.0 | none | random_forest | feature_type | polynomial | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 0.861300 | 122 | 5.311892 | {'balancing:strategy': 'none', 'classifier:__c... | Success | 0.0 | none | gradient_boosting | feature_type | fast_ica | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
206 | 0.868111 | 17 | 27.116169 | {'balancing:strategy': 'none', 'classifier:__c... | Success | 0.0 | none | sgd | feature_type | fast_ica | ... | NaN | NaN | 1142.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
207 | 0.000000 | 187 | 360.118319 | {'balancing:strategy': 'none', 'classifier:__c... | Timeout | 0.0 | none | libsvm_svc | feature_type | polynomial | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
208 | 0.868111 | 17 | 1.756671 | {'balancing:strategy': 'none', 'classifier:__c... | Success | 0.0 | none | libsvm_svc | feature_type | extra_trees_preproc_for_classification | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
209 | 0.868111 | 17 | 2.885311 | {'balancing:strategy': 'none', 'classifier:__c... | Success | 0.0 | none | liblinear_svc | feature_type | kitchen_sinks | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
210 | 0.000000 | 187 | 87.031239 | {'balancing:strategy': 'weighting', 'classifie... | Timeout | 0.0 | weighting | gradient_boosting | feature_type | polynomial | ... | 0.789833 | 0.231894 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
211 rows × 174 columns
1g)Score of the final ensemble
=automl_df.predict(X_test)
y_pred_df= accuracy_score(y_test, y_pred_df)
accuracy_df accuracy_df
0.865999046256557
2. Specific time to run each algorithm and overall time
Initializing Auto-Sklearn Classifier with time_left_for_this_task of 240 seconds (time_left_for_this_task=240
) for the entire AutoML process, while restricting each individual model run to a maximum of 50 seconds (per_run_time_limit=50
). This helps manage computational resources by controlling both overall task duration and the time spent on each model evaluation.
= autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=240, # Total time in seconds
automl =50, # Time limit per model run
per_run_time_limit )
3. Specific methods- logistic and random forest but default tuning
Instead of relying on Auto-sklearn’s automatic model selection, we can manually specify classifiers like Logistic Regression or Random Forest, tailoring the pipeline to suit specific tasks or preferences.
= autosklearn.classification.AutoSklearnClassifier(
automl_rflr =300, # Time in seconds for the entire AutoML process
time_left_for_this_task=30, # Time in seconds for each individual model run
per_run_time_limit={
include'classifier': ['random_forest', 'liblinear_svc']
}, )
4. Change metrics roc_auc
Adjusting the evaluation metric, like using ROC AUC for binary classification, helps Auto-sklearn focus on the most relevant measure of performance, especially in imbalanced datasets. some of the available metrics are: - accuracy - average_precision - f1_macro
Note
- for more metrics check autosklearn official site
= autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=240, # Total time in seconds
automl_rocauc =50, # Time limit per model run
per_run_time_limit=autosklearn.metrics.roc_auc
metric )
Note
- Only the first configuration(1) out of the four has been implemented in detail, while the remaining configurations(2,3,4) will only highlight the model changes, keeping all other aspects consistent.
There are additional points that can be explored using Auto-sklearn, such as:
- Manual hyperparameter tuning for models like Random Forest, allowing fine-tuned control over parameters such as n_estimators, max_depth, and - min_samples_split.
- Changing the type of resampling in Random Forest by adjusting the bootstrap parameter to control whether sampling is done with or without replacement.
- Exploring different feature preprocessing techniques to optimize model performance.
For more details, visit the official Auto-sklearn site.
Conclusion
Auto-sklearn simplifies automated machine learning by streamlining model selection, hyperparameter tuning, and allowing adjustments in feature and data preprocessing. By exploring various configurations, users can better understand their impact on model performance. This flexibility empowers both novices and experienced practitioners to enhance their AutoML processes effectively.