Saving and Reusing Machine Learning Models in Python: A Practical Guide

Introduction

Machine learning models can take significant time to train, especially with large datasets. Instead of retraining a model every time you need it, you can save and reuse it across different notebooks. In this blog, we will explore how to train a logistic regression model, save it, and load it in another notebook without any issues.

1. Import Necessary Libraries

To get started, we need to import the required libraries.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
import joblib

2. Load and Describe the Dataset

We will use the breast_cancer dataset from Scikit-learn, which is commonly used for binary classification problems.

# Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

Dataset Description

The dataset contains features extracted from digitized images of breast masses. It includes:

569 samples
30 numeric features
Two target classes: malignant (1) and benign (0)

3. Train a Logistic Regression Model

Now, we train a logistic regression model on the dataset.

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the logistic regression model
log_reg = LogisticRegression(max_iter=5000)
log_reg.fit(X_train, y_train)

LogisticRegression(max_iter=5000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Model Description

Logistic regression is a statistical method for binary classification that uses a sigmoid function to predict probabilities. It is widely used due to its interpretability and efficiency.

4. Save the Model

Once the model is trained, we save it for future use.

# Save the trained model
joblib.dump(log_reg, "logistic_regression_model.pkl")

['logistic_regression_model.pkl']

Why Save a Model?

Saving a trained model allows us to:
Reuse the model without retraining it.
Share it across different notebooks or environments.
Improve efficiency by saving computation time.

5. Load and Use the Model in Another Notebook

Now, let’s load the saved model in another notebook and make predictions.

Description

Once the model is loaded, we use it to make predictions on new data without needing to retrain it. This ensures consistency and efficiency in a machine learning workflow.

6. Avoiding Common Issues

When saving and loading models, consider the following:

Version compatibility: Ensure that the versions of Scikit-learn and dependencies remain consistent.

Feature consistency: The dataset used during training and inference should have the same feature names and order.

File path issues: When loading the model, ensure that the correct file path is provided.

7. Conclusion

Saving and reusing machine learning models simplifies workflows by eliminating the need to retrain models repeatedly. Logistic regression is a simple yet powerful model that can be easily saved and loaded using joblib. By following best practices, you can seamlessly deploy your models across different notebooks or applications.

Happy coding!