Weight of Evidence (WoE) estimates the predictive power of an independent variable in relation to the dependent variable.
WoE is originally used in credit risk analytics, as the method of separation of “good” and “bad” customers(Non-defaulters as Y=0 and Defaulters as Y=1).
WoE is defined as ln(Distribution of Good_i/Distribution of Bad_i).
Here, Distribution of Good is the proportion of good customers in a category to total good customers. Similarly, Distribution of Bad is the proportion of bad customers in a category to total bad customers.
Information Value (IV) is a highly useful tool for variable selection.
IV of an independent variable expresses the amount of diagnostic information of that variable for separating the Goods from the Bads.
IV is calculated as ∑(Distribution of Good_i−Distribution of Bad_i) × ln(Distribution of Good_i/Distribution of Bad_i)
IV helps in ranking variables based on their importance.
Data Description: The bank possesses demographic and transactional data of its loan customers. If the bank has a robust model to predict defaulters it can undertake better resource allocation.
Objective: To predict whether the customer applying for the loan will be a defaulter.
Importing data
data<-read.csv("BANK LOAN WOE-IV.csv",header=T)
head(data)
## SN AGE TOWN EMPLOY ADDRESS DEBTINC CREDDEBT OTHDEBT DEFAULTER
## 1 1 3 Mumbai 17 12 9.3 11.36 5.01 1
## 2 2 1 Delhi 10 6 17.3 1.36 4.00 0
## 3 3 2 Bengaluru 15 14 5.5 0.86 2.17 0
## 4 4 3 Hyderabad 15 14 2.9 2.66 0.82 0
## 5 5 1 Ahmedabad 2 0 17.3 1.79 3.06 1
## 6 6 3 Chennai 5 5 10.2 0.39 2.16 0
str(data)
## 'data.frame': 700 obs. of 9 variables:
## $ SN : int 1 2 3 4 5 6 7 8 9 10 ...
## $ AGE : int 3 1 2 3 1 3 2 3 1 2 ...
## $ TOWN : Factor w/ 15 levels "Ahmedabad","Bengaluru",..: 12 4 2 5 1 3 10 15 14 7 ...
## $ EMPLOY : int 17 10 15 15 2 5 20 12 3 0 ...
## $ ADDRESS : int 12 6 14 14 0 5 9 11 4 13 ...
## $ DEBTINC : num 9.3 17.3 5.5 2.9 17.3 10.2 30.6 3.6 24.4 19.7 ...
## $ CREDDEBT : num 11.36 1.36 0.86 2.66 1.79 ...
## $ OTHDEBT : num 5.01 4 2.17 0.82 3.06 ...
## $ DEFAULTER: int 1 0 0 0 1 0 0 0 1 0 ...
Converting AGE to Factor
data$AGE<-as.factor(data$AGE)
Changing Binary Values for Defaulter
data$DEFAULTERNEW = 1-data$DEFAULTER
Calculating WoE and IV
library(Information)
## Warning: package 'Information' was built under R version 3.6.2
IV <- create_infotables(data=data, y="DEFAULTERNEW")
Getting WoE and IV values for ‘AGE’ variable
woe_age<-as.data.frame(IV$Tables$AGE)
woe_age
## AGE N Percent WOE IV
## 1 1 242 0.3457143 -0.4430480 0.07452269
## 2 2 284 0.4057143 0.2577412 0.09978166
## 3 3 174 0.2485714 0.3051780 0.12120615
Checking the type of key variable before merging
str(woe_age)
## 'data.frame': 3 obs. of 5 variables:
## $ AGE : chr "1" "2" "3"
## $ N : num 242 284 174
## $ Percent: num 0.346 0.406 0.249
## $ WOE : num -0.443 0.258 0.305
## $ IV : num 0.0745 0.0998 0.1212
woe_age$AGE<-as.factor(woe_age$AGE)
str(woe_age)
## 'data.frame': 3 obs. of 5 variables:
## $ AGE : Factor w/ 3 levels "1","2","3": 1 2 3
## $ N : num 242 284 174
## $ Percent: num 0.346 0.406 0.249
## $ WOE : num -0.443 0.258 0.305
## $ IV : num 0.0745 0.0998 0.1212
Merging the datasets
leftjoin<-merge(data,woe_age,by="AGE", all.x = TRUE)
head(leftjoin)
## AGE SN TOWN EMPLOY ADDRESS DEBTINC CREDDEBT OTHDEBT DEFAULTER
## 1 1 523 Kochi 4 7 4.1 0.29 0.49 0
## 2 1 376 Kochi 1 4 2.5 0.13 0.29 0
## 3 1 39 Jaipur 1 8 17.1 1.34 2.77 1
## 4 1 201 Ahmedabad 3 7 4.1 0.26 0.52 0
## 5 1 245 Kolkata 3 4 13.3 1.60 3.05 0
## 6 1 46 Kanpur 0 1 6.8 0.15 0.94 0
## DEFAULTERNEW N Percent WOE IV
## 1 1 242 0.3457143 -0.443048 0.07452269
## 2 1 242 0.3457143 -0.443048 0.07452269
## 3 0 242 0.3457143 -0.443048 0.07452269
## 4 1 242 0.3457143 -0.443048 0.07452269
## 5 1 242 0.3457143 -0.443048 0.07452269
## 6 1 242 0.3457143 -0.443048 0.07452269
WoE and IV for variable ‘EMPLOY’
IV$Tables$EMPLOY
## EMPLOY N Percent WOE IV
## 1 [0,0] 62 0.08857143 -1.1030952 0.1288816
## 2 [1,1] 49 0.07000000 -0.5817983 0.1555268
## 3 [2,3] 86 0.12285714 -0.9920367 0.2987787
## 4 [4,4] 47 0.06714286 -0.1811065 0.3010739
## 5 [5,6] 82 0.11714286 0.0277947 0.3011638
## 6 [7,8] 69 0.09857143 0.6239910 0.3336590
## 7 [9,10] 75 0.10714286 0.2663920 0.3407685
## 8 [11,13] 83 0.11857143 0.6449892 0.3822789
## 9 [14,17] 70 0.10000000 0.8750926 0.4424923
## 10 [18,31] 77 0.11000000 1.4323637 0.5922371
WoE and IV for variable ‘EMPLOY’ with 3 bins
IV <- create_infotables(data=data, y="DEFAULTERNEW", bins = 3)
IV$Tables$EMPLOY
## EMPLOY N Percent WOE IV
## 1 [0,3] 197 0.2814286 -0.9267653 0.2845505
## 2 [4,9] 243 0.3471429 0.1441387 0.2915113
## 3 [10,31] 260 0.3714286 0.8898857 0.5217636
Extracting IV for all predictor variables
IV <- create_infotables(data=data, y="DEFAULTERNEW")
IV_Value = data.frame(IV$Summary)
IV_Value
## Variable IV
## 6 DEBTINC 0.7871927
## 4 EMPLOY 0.5922371
## 5 ADDRESS 0.3359295
## 7 CREDDEBT 0.2835522
## 8 OTHDEBT 0.1453887
## 2 AGE 0.1212061
## 3 TOWN 0.1085686
## 1 SN 0.0424855
## 9 DEFAULTER 0.0000000
Importing the data
import pandas as pd
import numpy as np
data = pd.read_csv("BANK LOAN WOE-IV.csv")
data.head()
## SN AGE TOWN EMPLOY ADDRESS DEBTINC CREDDEBT OTHDEBT DEFAULTER
## 0 1 3 Mumbai 17 12 9.3 11.36 5.01 1
## 1 2 1 Delhi 10 6 17.3 1.36 4.00 0
## 2 3 2 Bengaluru 15 14 5.5 0.86 2.17 0
## 3 4 3 Hyderabad 15 14 2.9 2.66 0.82 0
## 4 5 1 Ahmedabad 2 0 17.3 1.79 3.06 1
data.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 700 entries, 0 to 699
## Data columns (total 9 columns):
## SN 700 non-null int64
## AGE 700 non-null int64
## TOWN 700 non-null object
## EMPLOY 700 non-null int64
## ADDRESS 700 non-null int64
## DEBTINC 700 non-null float64
## CREDDEBT 700 non-null float64
## OTHDEBT 700 non-null float64
## DEFAULTER 700 non-null int64
## dtypes: float64(3), int64(5), object(1)
## memory usage: 49.3+ KB
Convert ‘AGE’ and ‘TOWN’ to Categorical
data['AGE'] = pd.Categorical(data['AGE'])
data['TOWN'] = pd.Categorical(data['TOWN'])
data.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 700 entries, 0 to 699
## Data columns (total 9 columns):
## SN 700 non-null int64
## AGE 700 non-null category
## TOWN 700 non-null category
## EMPLOY 700 non-null int64
## ADDRESS 700 non-null int64
## DEBTINC 700 non-null float64
## CREDDEBT 700 non-null float64
## OTHDEBT 700 non-null float64
## DEFAULTER 700 non-null int64
## dtypes: category(2), float64(3), int64(4)
## memory usage: 40.6 KB
Define function to calculate WoE and IV
def calculate_woe_iv(dataset, feature, target):
lst = []
for i in range(dataset[feature].nunique()):
val = list(dataset[feature].unique())[i]
lst.append({
'Value': val,
'All': dataset[dataset[feature] == val].count()[feature],
'Good': dataset[(dataset[feature] == val) & (dataset[target] == 0)].count()[feature],
'Bad': dataset[(dataset[feature] == val) & (dataset[target] == 1)].count()[feature]
})
dset = pd.DataFrame(lst)
dset['Distr_Good'] = dset['Good'] / dset['Good'].sum()
dset['Distr_Bad'] = dset['Bad'] / dset['Bad'].sum()
dset['WoE'] = np.log(dset['Distr_Good'] / dset['Distr_Bad'])
dset = dset.replace({'WoE': {np.inf: 0, -np.inf: 0}})
dset['IV'] = (dset['Distr_Good'] - dset['Distr_Bad']) * dset['WoE']
iv = dset['IV'].sum()
dset = dset.sort_values(by='WoE')
return dset, iv
Convert numeric data into bins
data['DEBTINC_bin']= pd.qcut(data['DEBTINC'], 10)
data['CREDDEBT_bin']= pd.qcut(data['CREDDEBT'], 10)
data['ADDRESS_bin']= pd.qcut(data['ADDRESS'], 10)
data['SN_bin']= pd.qcut(data['SN'], 10)
data['OTHDEBT_bin']= pd.qcut(data['OTHDEBT'], 10)
data['EMPLOY_bin']= pd.qcut(data['EMPLOY'], 10)
Subset data for WoE and IV calculation
data2 = data[['AGE','TOWN','DEBTINC_bin','CREDDEBT_bin','ADDRESS_bin',
'SN_bin','OTHDEBT_bin','EMPLOY_bin','DEFAULTER']]
Calculate WoE and IV
lst = []
IV_df = pd.DataFrame(columns=['Variable','IV'])
for col in data2.columns:
if col == 'DEFAULTER': continue
else:
df, iv = calculate_woe_iv(data2, col, 'DEFAULTER')
lst.append(df)
IV_df = IV_df.append({
"Variable" :col ,
"IV" : iv,
},ignore_index=True)
Get WoE and IV values for ‘AGE’ variable
data_AGE = lst[0]
data_AGE
## Value All Good Bad Distr_Good Distr_Bad WoE IV
## 1 1 242 156 86 0.301741 0.469945 -0.443048 0.074523
## 2 2 284 223 61 0.431335 0.333333 0.257741 0.025259
## 0 3 174 138 36 0.266925 0.196721 0.305178 0.021424
Check the type of key variable before merging
data_AGE = data_AGE.rename(columns={"Value":"AGE"})
result = pd.merge(data, data_AGE[['AGE','WoE','IV']], on='AGE')
result
## SN AGE TOWN ... EMPLOY_bin WoE IV
## 0 1 3 Mumbai ... (14.0, 18.0] 0.305178 0.021424
## 1 4 3 Hyderabad ... (14.0, 18.0] 0.305178 0.021424
## 2 6 3 Chennai ... (4.0, 5.6] 0.305178 0.021424
## 3 8 3 Surat ... (11.0, 14.0] 0.305178 0.021424
## 4 13 3 Nagpur ... (18.0, 31.0] 0.305178 0.021424
## .. ... ... ... ... ... ... ...
## 695 688 2 Kochi ... (9.0, 11.0] 0.257741 0.025259
## 696 689 2 Kanpur ... (11.0, 14.0] 0.257741 0.025259
## 697 696 2 Ahmedabad ... (5.6, 7.0] 0.257741 0.025259
## 698 698 2 Kochi ... (14.0, 18.0] 0.257741 0.025259
## 699 700 2 Kolkata ... (11.0, 14.0] 0.257741 0.025259
##
## [700 rows x 17 columns]
WoE and IV for variable ‘TOWN’
data_TOWN = lst[1]
data_TOWN = data_TOWN.rename(columns={"Value":"TOWN"})
result2 = pd.merge(data, data_TOWN[['TOWN','WoE','IV']], on='TOWN')
Extracting IV for all predictor variables
IV_df
## Variable IV
## 0 AGE 0.121206
## 1 TOWN 0.108569
## 2 DEBTINC_bin 0.775783
## 3 CREDDEBT_bin 0.343527
## 4 ADDRESS_bin 0.237193
## 5 SN_bin 0.040126
## 6 OTHDEBT_bin 0.143252
## 7 EMPLOY_bin 0.622351