Cost-Based Decision Thresholds

or: Yet Another Kaggle Dataset

Author

Gio Circo

Getting Started

This a little example project I’ve been thinking about based on some discussions regarding choosing decision thresholds for classification models. It’s easy to fit a model that optimizes a metric like AUC or precision-recall. However, we often neglect to think about what the cost of various decision thresholds actually mean. To explore this, let’s start by loading in the kaggle dataset. We’re going to load this in as a pandas dataframe, then apply some basic descriptives to the data. We’ll also load a number of libraries we’ll need for this analysis.

Code
# Libraries set-up
import pandas as pd
import numpy as np
import lightgbm as lgb
import seaborn as sns
import matplotlib.pyplot as plt 
import os


# model fitting stuff
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from lightgbm import LGBMClassifier

# evaluation stuff
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler

# Load Data on local machine
os.chdir("C:/Users/gioc4/Dropbox/blogs/fraud/")
fraud = pd.read_csv("creditcard.csv")

np.random.seed(12435)

# colors from Paul Tol's Color Schemes
# blue #4477AA
# red #EE6677

Variable Summary

To start, we can scan through the variables using some basic descriptives. Looking at the data we see that we have a time index variable Time, a set of features labeled V1 to V28, a variable describing the value amount in each transaction Amount and whether or not the tranaction was fradulent or not Class. Looking at the table below, we see that the average transaction amount in a fraudulent charge is about 40 dollars higher than a non-fraudulent one. However, one thing we should be concerned about is the fact that fraudulent charges are really quite rare - in this case the total proportion of fraudulent cases is: \(492/284315 = 0.00173\) which is less than one percent!

Code
# Some basic descriptives: variable names, class and amount
print(fraud.columns)

fraud.groupby('Class').agg({'Amount': 'mean', 'Class': 'count'})
Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')
Amount Class
Class
0 88.291022 284315
1 122.211321 492

Descriptive Plots

While we don’t know what many of these variables actually mean, we can at least see how the conditional distributions vary between fraud and non-fraud transactions. The seaborn package is quite useful here and I do like it a lot compared to matplotlib (this is coming from a die-hard ggplot2 fan, no less). Below, we can see the distribution of a few variables - V1 to V4. A brief visual inspection gives us some indication that they can help discriminate between fraud and non-fraud charges.

Code
# Plot V1 to V4
for i in range(1, 5):

    sns.kdeplot(x = fraud[fraud['Class'] == 0].iloc[:,i], color = "#4477AA")
    sns.kdeplot(x = fraud[fraud['Class'] == 1].iloc[:,i], color = "#EE6677")
    plt.title('Fraud (Red) vs. Non-Fraud (Blue) ')

    plt.gcf().set_size_inches(5, 4)
    plt.show()

Density charts, fraud (in Red) vs non-fraud (in Blue)

Modelling

Let’s begin by dividing our data into a train and test set, then apply some light scaling to some of the variables. Since most of the variables are just PCA scores that have been rescaled, we’re not too worried about transformations. Also, we are probably going to end up relying on a tree-based model which is insensitive to these issues. Since our response variable is very rare, we’re also going to create a stratified sample for the test and train datasets to ensure we maintain the general proportions observed in the data.

Code
# apply min-max scaling to Time and Amount variables
fraud[['Time','Amount']] = MinMaxScaler().fit_transform(fraud[['Time','Amount']])

# test train split using stratified sampling to retain correct propotions
X, y = fraud.loc[:, fraud.columns != 'Class'], fraud['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .2, stratify = fraud['Class'])

Model Fitting

Now we can fit a series of models. Personally, I think tree-based models like lightgbm tend to work fairly well on data like this, but I always like to benchmark the more fancy models against simpler ones that are easier to understand. For this example we’ll try out a logistic regression, a discriminant analysis, and a boosted tree classifier using lightgbm. For the lightgbm model I will just set some very general parameters here and not worry too much about tuning (although that could certainly come later if we end up being satisfied with the model’s performance).

Code
# Set up dict of models
model_list = {}
model_list['logitL1'] = LogisticRegression(penalty='l1',solver='liblinear')
model_list['lda'] = LinearDiscriminantAnalysis()
model_list['lgbm'] = LGBMClassifier(n_estimators = 1000, learning_rate = 0.01)

# Fit Models
for nm, mod in model_list.items():
    print(f'Fitting Model {nm}')
    mod.fit(X_train, y_train)
Fitting Model logitL1
Fitting Model lda
Fitting Model lgbm

Model Evaluation

Now all we need to do is extract the predicted probabilities from each model and evaluate them against each other. Here we’ll use the F1 score to evaluate each model’s performance. In cases like this, where we have a high level of imbalance between classes, it makes more sense to use a metric that focuses on discriminating the positive cases. Accuracy makes no sense, because we can get a 99.9% accuracy by just predicting 0 for all cases! Below, I have a bit of code that creates a pandas dataframe and adds the predicted probabilities of each model. We can then run a quick for-loop to get the scores from each model. Looking at this below we see that out of a (best possible score of 1) the lightgbm model performs best, followed by the discriminant analysis and then the logistic regression. The ‘optimal’ cutpoint, based on the F1 score, is a probability of 0.150173 for the lightgbm model.

Code
# Extract Probabilities and append to labels
pred_df = pd.DataFrame({'Class' : y_test})

for nm, mod in model_list.items():
    pred_df[nm] = mod.predict_proba(X_test)[:, 1]

# get PR curve and F1 from preds
for nm in pred_df:
    precision, recall, thresh = precision_recall_curve(pred_df['Class'], pred_df[nm])
    fscore = (2 * precision * recall) / (precision + recall)
    ix = np.argmax(fscore)
    print(f'{nm} Best Threshold=%f, F-Score=%.3f' % (thresh[ix], fscore[ix]))
Class Best Threshold=1.000000, F-Score=1.000
logitL1 Best Threshold=0.058669, F-Score=0.823
lda Best Threshold=0.696560, F-Score=0.847
lgbm Best Threshold=0.118516, F-Score=0.911

Comparing Costs

One thing to consider is what the actual costs are. The F1 score is just a weighted version of the precision-recall curve, which doesn’t take into account what a misclassified observation costs. In actual use, a model that flags a transaction as fradulent will likely incur some cost to investigate the fraud. This means that we can assign costs to a false negative (the cost of missing a fradulent transaction) and the cost of inspecting a transaction. Below, I’ll set up a function to take a confusion matrix as input and return a list of costs, and a function to iterate these costs across a list of thresholds, given some fixed costs for auditing and losses.

Code
# function to classify observations based on cutpoint
def classifer(p, cut):
    x = [0]*len(p)
    for i in range(0, len(p)):
        if p[i] >= cut:
            x[i] = 1
    return(x)

# function to evaluate costs
def cm_costs(cm, audit_cost, loss_cost):
    
    tn,fp,fn,tp = cm.ravel()
    
    fp_cost = fp * audit_cost
    fn_cost = fn * loss_cost
    tp_cost = tp * audit_cost
    tn_cost = 0 * audit_cost

    return(sum([fp_cost,fn_cost,tp_cost,tn_cost]))


# function to iterate costs and generate plots
def cost_thresh(ypred, ytest,c_audit, c_loss):
    # set up prob thresholds and cost list
    thresh = [x * 0.01 for x in range(1, 100)]
    cost_list = []

    # iterate over thresholds
    for i in thresh :
        y_pred = classifer(ypred, i)
        cm = confusion_matrix(y_test, y_pred)
        c = cm_costs(cm, audit_cost = c_audit, loss_cost=c_loss)
    
        cost_list.append(c)

    # find optimal cutpoint
    try:    
        opt = thresh[np.argmin(cost_list)]
    except ValueError:
        pass

    # plot
    sns.set_style("whitegrid")
    p = sns.relplot(x = thresh, y  = cost_list, kind = "line", color = '#4477AA')
    p.set_xlabels("Cutpoint Threshold: %.3f"  % opt, fontsize = 14)
    p.set_ylabels("Estimated Cost (in USD)", fontsize = 14)
    plt.axvline(opt, 0, 12000, ls = '--', color = '#EE6677')

    return(p)

Now let’s set up a hypothetical scenario. We can assume that it costs $25 to evaluate a claim, and that on average, missing a fradulent transaction costs us $122. We can then iterate over probability thresholds for our model and see what the optimal cutpoint is, given these costs.

Code
"""
let's assume the following:
    it costs $25 to audit a fraud claim
    if the claim is not flagged as fradulent, we lose $122 on average
    
What is the tradeoff in costs?
"""

plt1 = cost_thresh(ypred = pred_df['lgbm'].tolist(), ytest = y_test, c_audit = 25, c_loss = 122)
plt.gcf().set_size_inches(3.75, 3)
plt.show(plt1)

What happens if the cost of auditing a fraud claim is twice as expensive?

Code
plt2 = cost_thresh(ypred = pred_df['lgbm'].tolist(), ytest = y_test, c_audit = 50, c_loss = 122)
plt.gcf().set_size_inches(3.75, 3)
plt.show(plt2)

Or if the cost of fraud is twice as expensive?

Code
plt3 = cost_thresh(ypred = pred_df['lgbm'].tolist(), ytest = y_test, c_audit = 25, c_loss = 244)
plt.gcf().set_size_inches(3.75, 3)
plt.show(plt3)

Summary

Regardless, we can see that the “optimal” decision threshold changes based on assumptions regarding the cost of doing (or not doing) something. Directly optimizing some metric like the F1 score is different than putting a model into use and dealing with the direct costs of its implementation.