Weight of Evidence and Information Value

About Weight of Evidence :

Weight of Evidence (WoE) estimates the predictive power of an independent variable in relation to the dependent variable.
WoE is originally used in credit risk analytics, as the method of separation of “good” and “bad” customers(Non-defaulters as Y=0 and Defaulters as Y=1).
WoE is defined as ln(Distribution of Good_i/Distribution of Bad_i).
Here, Distribution of Good is the proportion of good customers in a category to total good customers. Similarly, Distribution of Bad is the proportion of bad customers in a category to total bad customers.
Information Value (IV) is a highly useful tool for variable selection.
IV of an independent variable expresses the amount of diagnostic information of that variable for separating the Goods from the Bads.
IV is calculated as ∑(Distribution of Good_i−Distribution of Bad_i) × ln(Distribution of Good_i/Distribution of Bad_i)
IV helps in ranking variables based on their importance.

Data Description: The bank possesses demographic and transactional data of its loan customers. If the bank has a robust model to predict defaulters it can undertake better resource allocation.

Objective: To predict whether the customer applying for the loan will be a defaulter.

WoE using R :

Importing data

data<-read.csv("BANK LOAN WOE-IV.csv",header=T)
head(data)

##   SN AGE      TOWN EMPLOY ADDRESS DEBTINC CREDDEBT OTHDEBT DEFAULTER
## 1  1   3    Mumbai     17      12     9.3    11.36    5.01         1
## 2  2   1     Delhi     10       6    17.3     1.36    4.00         0
## 3  3   2 Bengaluru     15      14     5.5     0.86    2.17         0
## 4  4   3 Hyderabad     15      14     2.9     2.66    0.82         0
## 5  5   1 Ahmedabad      2       0    17.3     1.79    3.06         1
## 6  6   3   Chennai      5       5    10.2     0.39    2.16         0

str(data)

## 'data.frame':    700 obs. of  9 variables:
##  $ SN       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ AGE      : int  3 1 2 3 1 3 2 3 1 2 ...
##  $ TOWN     : Factor w/ 15 levels "Ahmedabad","Bengaluru",..: 12 4 2 5 1 3 10 15 14 7 ...
##  $ EMPLOY   : int  17 10 15 15 2 5 20 12 3 0 ...
##  $ ADDRESS  : int  12 6 14 14 0 5 9 11 4 13 ...
##  $ DEBTINC  : num  9.3 17.3 5.5 2.9 17.3 10.2 30.6 3.6 24.4 19.7 ...
##  $ CREDDEBT : num  11.36 1.36 0.86 2.66 1.79 ...
##  $ OTHDEBT  : num  5.01 4 2.17 0.82 3.06 ...
##  $ DEFAULTER: int  1 0 0 0 1 0 0 0 1 0 ...

Converting AGE to Factor

data$AGE<-as.factor(data$AGE)

Changing Binary Values for Defaulter

data$DEFAULTERNEW = 1-data$DEFAULTER

Calculating WoE and IV

library(Information)

## Warning: package 'Information' was built under R version 3.6.2

IV <- create_infotables(data=data, y="DEFAULTERNEW")

Getting WoE and IV values for ‘AGE’ variable

woe_age<-as.data.frame(IV$Tables$AGE) 
woe_age

##   AGE   N   Percent        WOE         IV
## 1   1 242 0.3457143 -0.4430480 0.07452269
## 2   2 284 0.4057143  0.2577412 0.09978166
## 3   3 174 0.2485714  0.3051780 0.12120615

Checking the type of key variable before merging

str(woe_age)

## 'data.frame':    3 obs. of  5 variables:
##  $ AGE    : chr  "1" "2" "3"
##  $ N      : num  242 284 174
##  $ Percent: num  0.346 0.406 0.249
##  $ WOE    : num  -0.443 0.258 0.305
##  $ IV     : num  0.0745 0.0998 0.1212

woe_age$AGE<-as.factor(woe_age$AGE)
str(woe_age)

## 'data.frame':    3 obs. of  5 variables:
##  $ AGE    : Factor w/ 3 levels "1","2","3": 1 2 3
##  $ N      : num  242 284 174
##  $ Percent: num  0.346 0.406 0.249
##  $ WOE    : num  -0.443 0.258 0.305
##  $ IV     : num  0.0745 0.0998 0.1212

Merging the datasets

leftjoin<-merge(data,woe_age,by="AGE", all.x = TRUE)
head(leftjoin)

##   AGE  SN      TOWN EMPLOY ADDRESS DEBTINC CREDDEBT OTHDEBT DEFAULTER
## 1   1 523     Kochi      4       7     4.1     0.29    0.49         0
## 2   1 376     Kochi      1       4     2.5     0.13    0.29         0
## 3   1  39    Jaipur      1       8    17.1     1.34    2.77         1
## 4   1 201 Ahmedabad      3       7     4.1     0.26    0.52         0
## 5   1 245   Kolkata      3       4    13.3     1.60    3.05         0
## 6   1  46    Kanpur      0       1     6.8     0.15    0.94         0
##   DEFAULTERNEW   N   Percent       WOE         IV
## 1            1 242 0.3457143 -0.443048 0.07452269
## 2            1 242 0.3457143 -0.443048 0.07452269
## 3            0 242 0.3457143 -0.443048 0.07452269
## 4            1 242 0.3457143 -0.443048 0.07452269
## 5            1 242 0.3457143 -0.443048 0.07452269
## 6            1 242 0.3457143 -0.443048 0.07452269

WoE and IV for variable ‘EMPLOY’

IV$Tables$EMPLOY

##     EMPLOY  N    Percent        WOE        IV
## 1    [0,0] 62 0.08857143 -1.1030952 0.1288816
## 2    [1,1] 49 0.07000000 -0.5817983 0.1555268
## 3    [2,3] 86 0.12285714 -0.9920367 0.2987787
## 4    [4,4] 47 0.06714286 -0.1811065 0.3010739
## 5    [5,6] 82 0.11714286  0.0277947 0.3011638
## 6    [7,8] 69 0.09857143  0.6239910 0.3336590
## 7   [9,10] 75 0.10714286  0.2663920 0.3407685
## 8  [11,13] 83 0.11857143  0.6449892 0.3822789
## 9  [14,17] 70 0.10000000  0.8750926 0.4424923
## 10 [18,31] 77 0.11000000  1.4323637 0.5922371

WoE and IV for variable ‘EMPLOY’ with 3 bins

IV <- create_infotables(data=data, y="DEFAULTERNEW", bins = 3)
IV$Tables$EMPLOY

##    EMPLOY   N   Percent        WOE        IV
## 1   [0,3] 197 0.2814286 -0.9267653 0.2845505
## 2   [4,9] 243 0.3471429  0.1441387 0.2915113
## 3 [10,31] 260 0.3714286  0.8898857 0.5217636

Extracting IV for all predictor variables

IV <- create_infotables(data=data, y="DEFAULTERNEW")

IV_Value = data.frame(IV$Summary)
IV_Value

##    Variable        IV
## 6   DEBTINC 0.7871927
## 4    EMPLOY 0.5922371
## 5   ADDRESS 0.3359295
## 7  CREDDEBT 0.2835522
## 8   OTHDEBT 0.1453887
## 2       AGE 0.1212061
## 3      TOWN 0.1085686
## 1        SN 0.0424855
## 9 DEFAULTER 0.0000000

WoE using Python :

Importing the data

import pandas as pd
import numpy as np
data = pd.read_csv("BANK LOAN WOE-IV.csv")
data.head()

##    SN  AGE       TOWN  EMPLOY  ADDRESS  DEBTINC  CREDDEBT  OTHDEBT  DEFAULTER
## 0   1    3     Mumbai      17       12      9.3     11.36     5.01          1
## 1   2    1      Delhi      10        6     17.3      1.36     4.00          0
## 2   3    2  Bengaluru      15       14      5.5      0.86     2.17          0
## 3   4    3  Hyderabad      15       14      2.9      2.66     0.82          0
## 4   5    1  Ahmedabad       2        0     17.3      1.79     3.06          1

data.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 700 entries, 0 to 699
## Data columns (total 9 columns):
## SN           700 non-null int64
## AGE          700 non-null int64
## TOWN         700 non-null object
## EMPLOY       700 non-null int64
## ADDRESS      700 non-null int64
## DEBTINC      700 non-null float64
## CREDDEBT     700 non-null float64
## OTHDEBT      700 non-null float64
## DEFAULTER    700 non-null int64
## dtypes: float64(3), int64(5), object(1)
## memory usage: 49.3+ KB

Convert ‘AGE’ and ‘TOWN’ to Categorical

data['AGE'] = pd.Categorical(data['AGE'])
data['TOWN'] = pd.Categorical(data['TOWN'])
data.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 700 entries, 0 to 699
## Data columns (total 9 columns):
## SN           700 non-null int64
## AGE          700 non-null category
## TOWN         700 non-null category
## EMPLOY       700 non-null int64
## ADDRESS      700 non-null int64
## DEBTINC      700 non-null float64
## CREDDEBT     700 non-null float64
## OTHDEBT      700 non-null float64
## DEFAULTER    700 non-null int64
## dtypes: category(2), float64(3), int64(4)
## memory usage: 40.6 KB

Define function to calculate WoE and IV

def calculate_woe_iv(dataset, feature, target):
    lst = []
    for i in range(dataset[feature].nunique()):
        val = list(dataset[feature].unique())[i]
        lst.append({
            'Value': val,
            'All': dataset[dataset[feature] == val].count()[feature],
            'Good': dataset[(dataset[feature] == val) & (dataset[target] == 0)].count()[feature],
            'Bad': dataset[(dataset[feature] == val) & (dataset[target] == 1)].count()[feature]
        }) 
    dset = pd.DataFrame(lst)
    dset['Distr_Good'] = dset['Good'] / dset['Good'].sum()
    dset['Distr_Bad'] = dset['Bad'] / dset['Bad'].sum()
    dset['WoE'] = np.log(dset['Distr_Good'] / dset['Distr_Bad'])
    dset = dset.replace({'WoE': {np.inf: 0, -np.inf: 0}})
    dset['IV'] = (dset['Distr_Good'] - dset['Distr_Bad']) * dset['WoE']
    iv = dset['IV'].sum()
    dset = dset.sort_values(by='WoE')
    return dset, iv

Convert numeric data into bins

data['DEBTINC_bin']= pd.qcut(data['DEBTINC'], 10)
data['CREDDEBT_bin']= pd.qcut(data['CREDDEBT'], 10)
data['ADDRESS_bin']= pd.qcut(data['ADDRESS'], 10)
data['SN_bin']= pd.qcut(data['SN'], 10)
data['OTHDEBT_bin']= pd.qcut(data['OTHDEBT'], 10)
data['EMPLOY_bin']= pd.qcut(data['EMPLOY'], 10)

Subset data for WoE and IV calculation

data2 = data[['AGE','TOWN','DEBTINC_bin','CREDDEBT_bin','ADDRESS_bin',
              'SN_bin','OTHDEBT_bin','EMPLOY_bin','DEFAULTER']]

Calculate WoE and IV

lst = []
IV_df = pd.DataFrame(columns=['Variable','IV'])
for col in data2.columns:
    if col == 'DEFAULTER': continue
    else:
        df, iv = calculate_woe_iv(data2, col, 'DEFAULTER')
               
    lst.append(df)
    IV_df = IV_df.append({
                "Variable" :col ,
                "IV" : iv,
                },ignore_index=True)

Get WoE and IV values for ‘AGE’ variable

data_AGE = lst[0]
data_AGE

##    Value  All  Good  Bad  Distr_Good  Distr_Bad       WoE        IV
## 1      1  242   156   86    0.301741   0.469945 -0.443048  0.074523
## 2      2  284   223   61    0.431335   0.333333  0.257741  0.025259
## 0      3  174   138   36    0.266925   0.196721  0.305178  0.021424

Check the type of key variable before merging

data_AGE = data_AGE.rename(columns={"Value":"AGE"})
result = pd.merge(data, data_AGE[['AGE','WoE','IV']], on='AGE')
result

##       SN  AGE       TOWN  ...    EMPLOY_bin       WoE        IV
## 0      1    3     Mumbai  ...  (14.0, 18.0]  0.305178  0.021424
## 1      4    3  Hyderabad  ...  (14.0, 18.0]  0.305178  0.021424
## 2      6    3    Chennai  ...    (4.0, 5.6]  0.305178  0.021424
## 3      8    3      Surat  ...  (11.0, 14.0]  0.305178  0.021424
## 4     13    3     Nagpur  ...  (18.0, 31.0]  0.305178  0.021424
## ..   ...  ...        ...  ...           ...       ...       ...
## 695  688    2      Kochi  ...   (9.0, 11.0]  0.257741  0.025259
## 696  689    2     Kanpur  ...  (11.0, 14.0]  0.257741  0.025259
## 697  696    2  Ahmedabad  ...    (5.6, 7.0]  0.257741  0.025259
## 698  698    2      Kochi  ...  (14.0, 18.0]  0.257741  0.025259
## 699  700    2    Kolkata  ...  (11.0, 14.0]  0.257741  0.025259
## 
## [700 rows x 17 columns]

WoE and IV for variable ‘TOWN’

data_TOWN = lst[1]
data_TOWN = data_TOWN.rename(columns={"Value":"TOWN"})
result2 = pd.merge(data, data_TOWN[['TOWN','WoE','IV']], on='TOWN')

Extracting IV for all predictor variables

IV_df

##        Variable        IV
## 0           AGE  0.121206
## 1          TOWN  0.108569
## 2   DEBTINC_bin  0.775783
## 3  CREDDEBT_bin  0.343527
## 4   ADDRESS_bin  0.237193
## 5        SN_bin  0.040126
## 6   OTHDEBT_bin  0.143252
## 7    EMPLOY_bin  0.622351

Weight of Evidence and Information Value - R vs Python

About Weight of Evidence :

WoE using R :

WoE using Python :