Identifying and fitting of distributions for continuous variables using Python

In the previous two articles, we saw how fitting of distribution for continuous as well as discrete variables is done in R. Let’s now see how to implement it in Python for continuous variables.

Identifying and fitting distributions on the given data

Data : The given data is retail data consisting of 2 variables namely performance index and growth for 100 retailers.

1. Importing required libraries and data

import pandas as pd
import scipy.stats as st

data = pd.read_csv("salesdata dist fitting continuous.csv")

2. Checking which distribution is the best fit using Maximum Likelihood Estimation (MLE)

In R, we use the skewness-kurtosis to choose the best distribution(s) to be fitted. However, in python we use the MLE method to do the same.
We first list down the distributions which we need to check and then calculate MLE for all the respective distributions.

dist1 = [st.norm,st.uniform,st.expon,st.logistic,st.lognorm,st.gamma]
mles1 = []

for distribution in dist1:
    pars = distribution.fit(data['perindex'])
    mle = distribution.nnlf(pars, data['perindex'])
    mles1.append(mle)

results1 = [(distribution.name, mle) for distribution, mle in zip(dist1, mles1)]
results1

## [('norm', 369.7062661051115), ('uniform', 382.7771452980555), ('expon', 415.7225927636508), ('logistic', 372.1867588291285), ('lognorm', 369.78844724307504), ('gamma', 369.9837714720151)]

The distribution with the least MLE is the best option for fitting. In this case, it is normal distribution.

3. Fitting normal distribution for performance index and verifying if it is the best fit
We use the Kolmogrov-Smirnov test to verify if the identified distribution is indeed the best fit or not.

args1 = st.norm.fit(data['perindex'])
st.kstest(data['perindex'] ,'norm' , args1)

## KstestResult(statistic=0.07740397792764364, pvalue=0.5736738798956433)

Since p-value is greater than 0.05, we accept null hypothesis i.e. Normal distribution is the best fit fot the variable.

Lets check the same for growth variable.

Checking which distribution is the best for fitting using Maximum Likelihood Estimation (MLE)

dist2 = [st.norm,st.uniform,st.expon,st.logistic,st.lognorm,st.gamma]
mles2 = []

for distribution in dist2:
    pars = distribution.fit(data['growth'])
    mle = distribution.nnlf(pars, data['growth'])
    mles2.append(mle)

results2 = [(distribution.name, mle) for distribution, mle in zip(dist2, mles2)]
results2

## [('norm', 237.72880738639458), ('uniform', 277.4461966621462), ('expon', 230.36733324315537), ('logistic', 232.7570537947693), ('lognorm', 219.71722041186786), ('gamma', 219.6962856412584)]

The distribution with the least MLE is the best option for fitting. In this case, it is gamma distribution.

Fitting gamma distribution for performance index and verifying if it is the best fit

args2 = st.gamma.fit(data['growth'])
st.kstest(data['growth'] ,'gamma' , args2)

## KstestResult(statistic=0.061780480537573484, pvalue=0.839866188988607)

Since p-value is greater than 0.05, we accept null hypothesis i.e. Gamma distribution is the best fit fot the variable.