Fraud Detection Using Machine Learning - Stanford University PDF Free Download

1y ago

19 Views

1 Downloads

677.75 KB

6 Pages

Report/dmca

Download PDF

Transcription

Fraud Detection using Machine LearningAditya Oza - aditya19@stanford.eduAbstract— Recent research has shown that machine learningtechniques have been applied very effectively to the problem ofpayments related fraud detection. Such ML based techniqueshave the potential to evolve and detect previously unseen patterns of fraud. In this paper, we apply multiple ML techniquesbased on Logistic regression and Support Vector Machine to theproblem of payments fraud detection using a labeled datasetcontaining payment transactions. We show that our proposedapproaches are able to detect fraud transactions with highaccuracy and reasonably low number of false positives.I. INTRODUCTIONWe are living in a world which is rapidly adopting digitalpayments systems. Credit card and payments companiesare experiencing a very rapid growth in their transactionvolume. In third quarter of 2018, PayPal Inc (a San Josebased payments company) processed 143 billion USD in totalpayment volume [4]. Along with this transformation, there isalso a rapid increase in financial fraud that happens in thesepayment systems.An effective fraud detection system should be able to detect fraudulent transactions with high accuracy and efficiency.While it is necessary to prevent bad actors from executingfraudulent transactions, it is also very critical to ensuregenuine users are not prevented from accessing the paymentssystem. A large number of false positives may translate intobad customer experience and may lead customers to taketheir business elsewhere.A major challenge in applying ML to fraud detection ispresence of highly imbalanced data sets. In many availabledatasets, majority of transactions are genuine with an extremely small percentage of fraudulent ones. Designing anaccurate and efficient fraud detection system that is low onfalse positives but detects fraudulent activity effectively is asignificant challenge for researchers.In our paper, we apply multiple binary classificationapproaches - Logistic regression , Linear SVM and SVMwith RBF kernel on a labeled dataset that consists of paymenttransactions.Our goal is to build binary classifiers which areable to separate fraud transactions from non-fraud transactions. We compare the effectiveness of these approaches indetecting fraud transactions.II. RELEVANT RESEARCHSeveral ML and non-ML based approaches have beenapplied to the problem of payments fraud detection. Thepaper [1] reviews and compares such multiple state of theart techniques, datasets and evaluation criteria applied tothis problem. It discusses both supervised and unsupervisedML based approaches involving ANN (Artificial NeuralNetworks), SVM (Support Vector machines) ,HMM (HiddenMarkov Models), clustering etc. The paper [5] proposes arule based technique applied to fraud detection problem. Thepaper [3] discusses the problem of imbalanced data that resultin a very high number of false positives and proposes techniques to alleviate this problem. In [2], the authors propose anSVM based technique to detect metamorphic malware. Thispaper also discusses the problem of imbalanced data sets fewer malware samples compared to benign files - and howto successfully detect them with high precision and accuracy.III. DATASET AND ANALYSISIn this project, we have used a Kaggle provided dataset[8] of simulated mobile based payment transactions. Weanalyze this data by categorizing it with respect to differenttypes of transactions it contains. We also perform PCA Principal Component Analysis - to visualize the variabilityof data in two dimensional space. The dataset contains fivecategories of transactions labeled as ’CASH IN’, ’CASHOUT’, ’DEBIT’, ’TRANSFER’ and ’PAYMENT’ - detailsare provided in table ITransaction TypeNon-fraud transactionsFraud transactionsTotalCASH INCASH 094143221514946362620TABLE I: Paysim dataset statisticsPaysim dataset consists of both numerical and categoricalfeatures like transaction type ,amount transferred, accountnumbers of sender and recipient accounts. In our experimentswe use the following features to train our models.1) Transaction type2) Transaction amount3) Sender account balance before transaction4) Sender account balance after transaction5) Recipient account balance before transaction6) Recipient account balance after transactionThe dataset consists of around 6 million transactions outof which 8312 transactions are labeled as fraud. It is highlyimbalanced with 0.13 percent fraud transactions. We displaythe result of performing two dimensional PCA on subsetsfor two transaction types that contain frauds - TRANSFERand CASH OUT transactions.

The PCA decomposition of TRANSFER transactionsshows a high variability across two principal componentsfor non-fraud and fraud transactions. This gave us confidencethat TRANSFER dataset can be linearly separable and ourchosen algorithms - Logistic regression and Linear SVM arelikely to perform very well on such a dataset. In the resultssection, we’ll see that this is indeed the case.interpreted as a probability of x as belonging to class 1. Thelogistic loss function with respect to parameters θ can begiven asJ(θ) m 1 Xlog 1 exp( y (i) θT x(i) )m i 1B. Support Vector MachineSupport vector machine creates a classification hyperplane in the space defined by input feature vectors. The training process aims to determine a hyper-plane that maximizesgeometric margin with respect to labeled input data. SVMsoptimization problem can be characterized bymX1minγ,w,b w 2 C i2i 1s.t. y (i) (wT x(i) b) 1 i , i 1, ., m i 0, i 1, ., m(a) TRANSFER transactionsIn this project we use two variants of SVM - linear SVMand SVM based on RBF kernel. An SVM based on RBFkernel can find a non-linear decision boundary in input space.The RBF kernel function on two vectors x and z in the inputspace can be defined as x z 2K(x, z) exp 2σ 2.C. Class weights based approach(b) CASH OUT transactionsFig. 1: PCA decomposition of Paysim dataIV. M ETHODOur goal is to separate fraud and non-fraud transactions byobtaining a decision boundary in the feature space definedby input transactions. Each transaction can be represented asvector of its feature values. We have built binary classifiersusing Logistic regression, linear SVM and SVM with RBFkernels for TRANSFER and CASH OUT sets respectively.A. Logistic RegressionLogistic regression is a technique used to find a lineardecision boundary for a binary classifier. For a given inputfeature vector x, a logistic regression model with parameter θclassifies the input x using the following hypothesis hθ (x) 1g(θT x) 1 e θT x where g is known as Sigmoid function.For a binary classification problem, the output hθ (x) can beWe assign different weights to samples belonging fraudand non-fraud classes for each of the three techniquesrespectively. Such an approach has been used to counter dataimbalance problem - with only 0.13 percent fraud transactions available to us. In a payments fraud detection system,it is more critical to catch potential fraud transactions thanto ensure all non-fraud transactions are executed smoothly.In our proposed approaches, we penalize mistakes made inmisclassifying fraud samples more than misclassifying nonfraud samples. We trained our models (for each technique)using higher class weights for fraud samples compared tonon-fraud samples.We fine tune our models by choosing class weights toobtain desired/optimal balance between precision and recallscores on our fraud class of samples. We have chosen classweights such that we do not have more than around 1 percentof false positives on CV set. This design trade off enables usto establish a balance between detecting fraud transactionswith high accuracy and preventing large number of falsepositives. A false positive, in our case, is when a non-fraudtransaction is misclassified as a fraudulent one.V. E XPERIMENTSIn this section,we describe our dataset split strategy andtraining ,validation and testing processes that we have implemented in this work. All software was developed usingScikit-learn [7] ML library.

A. Dataset split strategyWe divided our dataset based on different transaction typesdescribed in dataset section. In particular, we use TRANSFER and CASH OUT transactions for our experiments sincethey contain fraud transactions. For both types, we dividedrespective datasets into three splits - 70 percent for training,15 percent for CV and 15 percent for testing purposes.We use stratified sampling in creating train/CV/test splits.Stratified sampling allows us to maintain the same proportionof each class in a split as in the original dataset. Details ofthe splits are mentioned in tables II and III.SplitTRANSFERFraudNon 2528812Totalmodels by trying out multiple combinations of weights onour CV split.Overall, we observed that higher class weights gave ushigher recall at the cost of lower precision on our CV split.In figure 2, we show this observed behavior for CASH OUTtransactions.(a) Logistic Regression(b) Linear SVM3730367993679937532909TABLE II: Dataset split details(c) SVM with RBF kernelB. Model training and tuningWe employed class weight based approach as described inprevious section to train our each of our models. Each modelwas trained multiple times using increasing weights for fraudclass samples. At the end of each iteration, we evaluated ourtrained models by measuring their performance on CV split.For each model, we chose the class weights which gave ushighest recall on fraud class with not more than 1 percentfalse positives.Finally, we used the models trained using chosen set ofclass weights to make predictions on our test dataset split. Inthe next section, we elaborate on our choice of class weightsbased on their performance on CV set. We also discuss theirperformance on train and test sets.Fig. 2: CASH OUT - Precision,Recall, F1 trend for increasing fraud class weights(a) Logistic Regression(b) Linear SVMVI. R ESULTS AND DISCUSSIONIn this section, we discuss results obtained in training,validation and testing phases. We evaluated performance ofour models by computing metrics like recall, precision, f1score, area under precision-recall curve (AUPRC).(c) SVM with RBF kernelFig. 3: TRANSFER - Precision,Recall, F1 trend for increasing fraud class weightsA. Class weights selectionIn our experiments, we used increasing weights for fraudsamples. We initially considered making class weights equalto imbalance ratio in our dataset. This approach seemed togive good recall but also resulted in very high number offalse positives - 1 percent - especially for CASH OUT.Hence, we did not use this approach and instead tuned ourSplitCASH OUTFraudNon 50082233384Total15662503356253356252237500TABLE III: Dataset split detailsFor TRANSFER dataset, the effect of increasing weightsis less prominent, in particular for Logistic Regression andLinear SVM algorithms. That is, equal class weights forfraud and non-fraud samples give us high recall and precisionscores. Based on these results, we still chose higher weightsfor fraud samples to avoid over-fitting on CV set. Figure4 shows precision-recall curves obtained on CV set usingchosen class weights for all three algorithms. Table IVsummarizes these results via precision,recall,f1-measure andAUPRC scores. We chose to plot precision/recall curves(PRC) over ROC as PRCs are more sensitive to misclassifications when dealing with highly imbalanced datasets likeours. The final values of selected class weights are mentionedin table V.

(a) TRANSFER(a) TRANSFER(b) CASH OUT(b) CASH OUTFig. 4: CV set - Precision-Recall curveFig. 5: Train set - Precision-Recall curveTRANSFERRecallPrecisionAlgorithmLogistic RegressionLinear SVMSVM with RBF kernel0.99830.99830.9934Logistic RegressionLinear SVMSVM with RBF 0.73810.92480.91610.9855Logistic RegressionLinear SVMSVM with RBF 72350.67270.7598Logistic RegressionLinear SVMSVM with RBF kernel0.44160.44320.5871CASH .1315TABLE IV: Results on CV setTRANSFERAlgorithmnon-fraudLogistic regression1Linear SVM1SVM with RBF kernel1CASH OUTAlgorithmnon-fraudLogistic regressionLinear SVMSVM with RBF 0.44520.44310.6035CASH 0.7631TABLE VI: Results on Train setfraud703916fraud145132128TABLE V: Class weightsB. Results on train and test setsIn this section, we discuss results on train and test setsusing chosen class weights. Figure 5 and table VI summarizethe results on train set.Similarly, figure 6 and table VII summarize the resultson test set. We get very high recall and AUPRC scoresfor TRANSFER transactions with 0.99 recall score forall three algorithms. In particular, SVM with RBF kernelgives us the best AUPRC value because it has much higherprecision compared to the other two algorithms. Table VIIIdisplays corresponding confusion matrices obtained on testset of TRANSFER. We are able to detect more than 600fraud transactions for all three algorithms with less than 1percent false positives. TRANSFER transactions had shown ahigh variability across their two principal components whenwe performed PCA on it. This set of transactions seemedto be linearly separable - with all three of our proposedalgorithms expected to perform well on it. We can see thisis indeed the case.For CASH OUT transactions, we obtain less promisingresults compared to TRANSFER for both train and test sets.

Logistic regression and linear SVM have similar performance(and hence similar linear decision boundaries and PR curves).SVM with RBF gives a higher recall but with lower precisionon average for this set of transactions. A possible reasonfor this outcome could be non-linear decision boundarycomputed using RBF kernel function. However, for all threealgorithms, we can obtain high recall scores if we are moretolerant to false positives. In the real world, this is purelya design/business decision and depends on how many falsepositives is a payments company willing to tolerate.TABLE VIII: Confusion matrices(a) Logistic Regression(b) Linear SVMPredTrue 785573 765612True Pred 78579 7433612(c) SVM with RBF kernelTrue Pred 78886 4367608VII. C ONCLUSION AND FUTURE WORK(b) CASH OUTIn fraud detection, we often deal with highly imbalanceddatasets. For the chosen dataset (Paysim), we show that ourproposed approaches are able to detect fraud transactionswith very high accuracy and low false positives - especiallyfor TRANSFER transactions. Fraud detection often involvesa trade off between correctly detecting fraudulent samplesand not misclassifying many non-fraud samples. This isoften a design choice/business decision which every digitalpayments company needs to make. We’ve dealt with thisproblem by proposing our class weight based approach.We can further improve our techniques by using algorithms like Decision trees to leverage categorical featuresassociated with accounts/users in Paysim dataset. Paysimdataset can also be interpreted as time series. We can leveragethis property to build time series based models using algorithms like CNN. Our current approach deals with entire setof transactions as a whole to train our models. We can createuser specific models - which are based on user’s previoustransactional behavior - and use them to further improve ourdecision making process. All of these, we believe, can bevery effective in improving our classification quality on thisdataset.Fig. 6: Test set - Precision-Recall curveAPPENDIX(a) TRANSFERGithub code link - hmLogistic RegressionLinear SVMSVM with RBF kernelAlgorithmLogistic RegressionLinear SVMSVM with RBF 4440.45160.5823CASH 1321f1-measureAUPRCVIII. 3f1-measureAUPRCI would like to thank Professor Ng and entire teachingstaff for a very well organized and taught class. In particular,I would like to thank my project mentor, Fantine, for hervaluable insights and guidance during the course of BLE VII: Results on Test setOverall, we observe that all our proposed approachesseem to detect fraud transactions with high accuracy andlow false positives - especially for TRANSFER transactions.With more tolerance to false positives, we can see that it canperform well on CASH OUT transactions as well.R EFERENCES[1] A Survey of Credit Card Fraud Detection Techniques: Data andTechnique Oriented Perspective - Samaneh Sorournejad, Zojah, Ataniet.al - November 2016[2] Support Vector machines and malware detection - T.Singh,F.Di Troia,C.Vissagio , Mark Stamp - San Jose State University - October 2015[3] Solving the False positives problem in fraud prediction using automated feature engineering - Wedge, Canter, Rubio et.al - October 2017[4] PayPal Inc. Quarterly results ird-quarter-2018-results[5] A Model for Rule Based Fraud Detection in Telecommunications Rajani, Padmavathamma - IJERT - 2012

[6] HTTP Attack detection using n gram analysis - A. Oza, R.Low,M.Stamp - Computers and Security Journal - September 2014[7] Scikit learn - machine learning library http://scikit-learn.org[8] Paysim - Synthetic Financial Datasets For Fraud im1