import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
import joblib
Saving and Reusing Machine Learning Models in Python: A Practical Guide
Introduction
Machine learning models can take significant time to train, especially with large datasets. Instead of retraining a model every time you need it, you can save and reuse it across different notebooks. In this blog, we will explore how to train a logistic regression model, save it, and load it in another notebook without any issues.
1. Import Necessary Libraries
To get started, we need to import the required libraries.
2. Load and Describe the Dataset
We will use the breast_cancer dataset from Scikit-learn, which is commonly used for binary classification problems.
# Load the dataset
= load_breast_cancer()
data = pd.DataFrame(data.data, columns=data.feature_names)
X = pd.Series(data.target) y
Dataset Description
The dataset contains features extracted from digitized images of breast masses. It includes:
569 samples
30 numeric features
Two target classes: malignant (1) and benign (0)
3. Train a Logistic Regression Model
Now, we train a logistic regression model on the dataset.
# Split the dataset into training and testing sets
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Initialize and train the logistic regression model
= LogisticRegression(max_iter=5000)
log_reg log_reg.fit(X_train, y_train)
LogisticRegression(max_iter=5000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=5000)
Model Description
Logistic regression is a statistical method for binary classification that uses a sigmoid function to predict probabilities. It is widely used due to its interpretability and efficiency.
4. Save the Model
Once the model is trained, we save it for future use.
# Save the trained model
"logistic_regression_model.pkl") joblib.dump(log_reg,
['logistic_regression_model.pkl']
Why Save a Model?
Saving a trained model allows us to:
Reuse the model without retraining it.
Share it across different notebooks or environments.
Improve efficiency by saving computation time.
5. Load and Use the Model in Another Notebook
Now, let’s load the saved model in another notebook and make predictions.
Description
Once the model is loaded, we use it to make predictions on new data without needing to retrain it. This ensures consistency and efficiency in a machine learning workflow.
6. Avoiding Common Issues
When saving and loading models, consider the following:
Version compatibility: Ensure that the versions of Scikit-learn and dependencies remain consistent.
Feature consistency: The dataset used during training and inference should have the same feature names and order.
File path issues: When loading the model, ensure that the correct file path is provided.
7. Conclusion
Saving and reusing machine learning models simplifies workflows by eliminating the need to retrain models repeatedly. Logistic regression is a simple yet powerful model that can be easily saved and loaded using joblib. By following best practices, you can seamlessly deploy your models across different notebooks or applications.
Happy coding!