AL-Optimal-Cutoff.knit

Calculating the Optimal Cutoff point for Logistic Regression in R

Logistic regression is a statistical model that is used to predict the probability of a binary outcome. The outcome can be either a yes or a no, a true or a false, or a success or a failure. Logistic regression is a type of regression analysis, which is a statistical technique that is used to model the relationship between a dependent variable and one or more independent variables.

A confusion matrix is a table that is used to evaluate the performance of a classification model. The confusion matrix shows the number of true positives, false positives, true negatives, and false negatives.

A cutoff value is a threshold that is used to classify instances into two classes. For example, if the cutoff value is 0.5, then any instance with a predicted probability of greater than or equal to 0.5 is classified as positive, and any instance with a predicted probability of less than 0.5 is classified as negative.

The choice of cutoff value can have a significant impact on the performance of a classification model. A high cutoff value will result in fewer false positives, but it will also result in more false negatives. A low cutoff value will result in fewer false negatives, but it will also result in more false positives. The optimal cutoff value will depend on the specific application. It is always important to know the best cutoff value for your model. In this lab,we will see how to obtain that optimal cutoff value R using two new packages.

Importing the dataset

This is a healthcare dataset consisting of medical data of people such as bmi,hypertension,glucose levels and so on. The target variable is stroke which says if the person had a stroke or not.

data<-read.csv("strokedata.csv")
head(data)

  X age hypertension heart_disease avg_glucose_level  bmi smoking_freq stroke
1 1   3            0             0             95.12 18.0            0      0
2 2  58            1             0             87.96 39.2            0      0
3 3   8            0             0            110.89 17.6            0      0
4 4  70            0             0             69.04 35.9            2      0
5 5  14            0             0            161.28 19.1            0      0
6 6  47            0             0            210.95 50.1            0      0

Creating train and test datasets using the createDataPartition() function from caret library and running logistic regression using glm()

library(caret)

index<-createDataPartition(data$stroke,p=0.7,list=FALSE)
data$stroke<-as.factor(data$stroke)
traindata<-data[index,]
testdata<-data[-index,]

model <- glm(stroke ~ ., data = traindata, family = "binomial")

Confusion matrix for basic threshold (0.5)

testdata$predprob<-predict(model,testdata,type='response')
testdata$predY<-as.factor(ifelse(testdata$predprob>0.5,1,0))

confusionMatrix(testdata$predY,testdata$stroke,positive="1")

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6377 5709
         1  259  236
                                          
               Accuracy : 0.5256          
                 95% CI : (0.5169, 0.5344)
    No Information Rate : 0.5275          
    P-Value [Acc > NIR] : 0.6627          
                                          
                  Kappa : 7e-04           
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.03970         
            Specificity : 0.96097         
         Pos Pred Value : 0.47677         
         Neg Pred Value : 0.52764         
             Prevalence : 0.47254         
         Detection Rate : 0.01876         
   Detection Prevalence : 0.03935         
      Balanced Accuracy : 0.50033         
                                          
       'Positive' Class : 1

Lets see how to get the optimal threshold cutoff value to get the best Confusion Matrix(accuracy, sensitivity and specificity)

1. Using Epi library

We first get the best combination of Sensitivity and Specificity i.e. Sensitivity + Specificity is maximal and then extract the cutoff point corresponding to that.

library(Epi)
rc <- ROC(form=stroke ~ ., data = traindata, plot="sp")

opt <- which.max(rowSums(rc$res[, c("sens", "spec")]))
threshold1<-rc$res$lr.eta[opt]
threshold1

[1] 0.4673331

Confusion Matrix for above calculated threshold

testdata$predY<-as.factor(ifelse(testdata$predprob>threshold1,1,0))

confusionMatrix(testdata$predY,testdata$stroke,positive="1")

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 3400 3035
         1 3236 2910
                                          
               Accuracy : 0.5015          
                 95% CI : (0.4928, 0.5103)
    No Information Rate : 0.5275          
    P-Value [Acc > NIR] : 1.00000         
                                          
                  Kappa : 0.0018          
                                          
 Mcnemar's Test P-Value : 0.01155         
                                          
            Sensitivity : 0.4895          
            Specificity : 0.5124          
         Pos Pred Value : 0.4735          
         Neg Pred Value : 0.5284          
             Prevalence : 0.4725          
         Detection Rate : 0.2313          
   Detection Prevalence : 0.4885          
      Balanced Accuracy : 0.5009          
                                          
       'Positive' Class : 1

We can see a clear difference between the first and the above Confusion Matrix. When we use the optimal threshold obtained as the cutoff, we get a much better sensitivity and specificity as compared to when we use the basic cutoff.

2. Using pROC library

We calculate roc first and then get the cutoff which maximizes the Youden index (sensitivity + specificity - 1) or minimizes the distance to the ideal point (0, 1) on the ROC curve.

library(pROC)
probabilities <- predict(model, type = "response")
roc_data <- roc(traindata$stroke, probabilities)
threshold2 <- coords(roc_data, "best")$threshold
threshold2

[1] 0.4673331

Confusion Matrix for above calculated threshold

testdata$predY<-as.factor(ifelse(testdata$predprob>threshold2,1,0))

confusionMatrix(testdata$predY,testdata$stroke,positive="1")

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 3400 3035
         1 3236 2910
                                          
               Accuracy : 0.5015          
                 95% CI : (0.4928, 0.5103)
    No Information Rate : 0.5275          
    P-Value [Acc > NIR] : 1.00000         
                                          
                  Kappa : 0.0018          
                                          
 Mcnemar's Test P-Value : 0.01155         
                                          
            Sensitivity : 0.4895          
            Specificity : 0.5124          
         Pos Pred Value : 0.4735          
         Neg Pred Value : 0.5284          
             Prevalence : 0.4725          
         Detection Rate : 0.2313          
   Detection Prevalence : 0.4885          
      Balanced Accuracy : 0.5009          
                                          
       'Positive' Class : 1

We can again see a clear difference between the first and the above Confusion Matrix. When we use the optimal threshold obtained as the cutoff, we get a much better sensitivity and specificity as compared to when we use the basic cutoff.