The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems.
KNN stores all available cases and classifies (or gives expected value of) new cases based on a similarity measure.
Data Description: The bank possesses demographic and transactional data of its loan customers. If the bank has a robust model to predict defaulters it can undertake better resource allocation.
Objective: To predict whether the customer applying for the loan will be a defaulter
Importing data and removing unwanted variables
bankloan<-read.csv("BANK LOAN KNN.csv",header=T)
bankloan2<-subset(bankloan,select=c(-AGE,-SN,-DEFAULTER))
head(bankloan2)
## EMPLOY ADDRESS DEBTINC CREDDEBT OTHDEBT
## 1 17 12 9.3 11.36 5.01
## 2 2 0 17.3 1.79 3.06
## 3 12 11 3.6 0.13 1.24
## 4 3 4 24.4 1.36 3.28
## 5 24 14 10.0 3.93 2.47
## 6 6 9 16.3 1.72 3.01
Scaling variables
bankloan3<-scale(bankloan2)
head(bankloan3)
## EMPLOY ADDRESS DEBTINC CREDDEBT OTHDEBT
## 1 1.5656796 0.6216799 -0.2881684 3.8774339687 0.51519694
## 2 -0.8239988 -1.1852951 0.7889154 0.0289356115 -0.02571385
## 3 0.7691201 0.4710987 -1.0555906 -0.6386200074 -0.53056393
## 4 -0.6646869 -0.5829701 1.7448273 -0.1439854223 0.03531198
## 5 2.6808628 0.9228424 -0.1939235 0.8895193612 -0.18937404
## 6 -0.1867512 0.1699362 0.6542799 0.0007856758 -0.03958336
Creating training and testing data sets
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
index<-createDataPartition(bankloan$SN,p=0.7,list=FALSE)
head(index)
## Resample1
## [1,] 3
## [2,] 4
## [3,] 5
## [4,] 7
## [5,] 8
## [6,] 10
traindata<-bankloan3[index,]
testdata<-bankloan3[-index,]
dim(traindata)
## [1] 273 5
dim(testdata)
## [1] 116 5
Creating class vectors
Ytrain<-bankloan$DEFAULTER[index]
Ytest<-bankloan$DEFAULTER[-index]
KNN classification (Contunuous predictors)
knn() in package “class” undertakes k-nearest neighbour classification testing set using training data. Distance is calculated by Euclidean measure, and the classification is decided by majority vote, with ties broken at random.
library(class)
model<-knn(traindata,testdata,k=20,cl=Ytrain)
Here same BANK LOAN DATA is used.
import numpy as np
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score, accuracy_score,roc_curve, roc_auc_score
Importing data and removing unwanted variables
bankloan = pd.read_csv("BANK LOAN KNN.csv")
bankloan1 = bankloan.drop(['SN','AGE'], axis = 1)
bankloan1.head()
## EMPLOY ADDRESS DEBTINC CREDDEBT OTHDEBT DEFAULTER
## 0 17 12 9.3 11.36 5.01 1
## 1 2 0 17.3 1.79 3.06 1
## 2 12 11 3.6 0.13 1.24 0
## 3 3 4 24.4 1.36 3.28 1
## 4 24 14 10.0 3.93 2.47 0
Creating training and testing data sets
X = bankloan1.loc[:,bankloan1.columns != 'DEFAULTER']
y = bankloan1.loc[:, 'DEFAULTER']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.30,
random_state = 999)
Preparing/Scaling variables
scaler = StandardScaler()
scaler.fit(X_train)
## StandardScaler(copy=True, with_mean=True, with_std=True)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Building the KNN Classifier (Continuous Predictors)
KNeighborsClassifier() from sklearn.neighbors undertakes k-nearest neighbour classification testing set using training data
KNNclassifier = KNeighborsClassifier(n_neighbors =
int(np.sqrt(len(X)).round()))
KNNclassifier.fit(X_train, y_train)
## KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
## metric_params=None, n_jobs=None, n_neighbors=20, p=2,
## weights='uniform')