Ijpam.eu A Survey On Fraud A Nalytics Using Predictive M .

Transcription

International Journal of Pure and Applied MathematicsVolume 114 No. 7 2017, 755-767ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issueijpam.euA Survey on Fraud Analytics Using PredictiveModel in Insurance Claims11K. Ulaga Priya and 2S. PushpaDept of Computer Science and Engineering,St.Peters University.ulagapriya@gmail.com2Dept of Computer Science and Engineering,St.Peters ce Industry is a rapidly growing fast industry in terms of largeamount of data. The most critical issue in insurance industry is fraudulentclaims. Fraud is nothing but wrongful or criminal trick planned to result infinancial or personal gains. As the size of data increases, the traditionalapproach will not work and it will be tedious job to identify the fraudulentclaims. Moreover, new types of claim will emerge and hence it will bedifficult to predict the fraudulent claims. This paper depicts an overview ofFraud analytics, prediction, and Data Science algorithms based predictionsin insurance industry.755

International Journal of Pure and Applied MathematicsSpecial Issue1. IntroductionFraud analytics is a type of data analytics where data analysis is done on thefraudulent behaviour. There are several domains where fraud may happen likeCredit card fraud, telecommunication fraud, Insurance fraud, Healthcare fraud,tax evasion etc. Credit card fraud is one of the fraud types which is surveyedwidely in the domain of fraud detection.[34],[35,[36].Due to the popular modeof payment transaction, both online and offline, the fraud associated with it isalso increasing. There are multiple techniques to detect credit card fraud likeNeural Network [10-11],Group Method of Data Handling [4-5], Bagging[6].Some other popular models of credit card fraud detection are Hidden MarkovModel [2-3], Bayesian learning [7-9], K-means Clustering [1].The credit cardfraud was categorised [35] as two categories namely behavioural frauds andApplication frauds. Application frauds happen whenever fraudsters[33] acquirenew cards by providing false data to issuing companies[33]. Behavioural fraudsinclude four types: mail theft, fake cards, stolen/lost cards. Several algorithms[43] in credit card fraud prediction were compared and derived that Baggingensemble classifier is the best method.Telecommunication fraud is rapidly increasing due to the growth of recenttechnology and global communication which results in considerable losses inbusiness. There are two categories of telecommunication fraud: subscriptionfraud and super imposed fraud. Subscription fraud is nothing but claiming falseidentity for getting service and elude payment. Superimposed fraud happenswhenever the service is used without having relevant rights and is usuallydetected by the appearance of 'phantom' call on a bill. Various techniques usedin telecommunication fraud detection[12] are Neural Networks, VisualizationMethods and Rule-based Approach.Insurance fraud is defined [37] as fraud in the insurance industry as perceptivelycreating a fabricated claim, bloating a claim or adding further items to a claim,or being in any way deceitful with the intention of getting more than legitimateprivilege. The insurance fraud types include exaggerated claims, fabricatedmedical history, post-dated policies, faked damage etc. [30] This emphasize thedifferent types of fraud in health insurance sector. There are different techniquesfor health insurance fraud detection[22]. This paper concentrates on InsuranceFraud and its data analytics. The National Healthcare Anti-Fraud Association(NHCAA) evaluated the health care claims and announced that 10 percent ofhealth care claims contain some element of fraud [38][39]. Insurance protectsthe customer from monetary loss. Insurance Policy is a legal agreement betweenthe Policy holder and insurance [23] company which specifies the claim amountwhich the Policy Holder needs to pay. Insurance claim is nothing but, the policyholder request the claim amount from the insurance company based on theinsurance policy. Insurance domain can be categorised as (i)Health Insurance.(ii) Travel Insurance (iii) Auto insurance (iv) Life insurance756

International Journal of Pure and Applied MathematicsSpecial IssueThe section2 describes about Bigdata analytics in identifying fraudulent claims.Section3 describes one of the Bigdata Analytics type which is predictiveAnalytics. This section also discusses the various types of Machine learningalgorithms, Section4 discusses the merits and demerits of the algorithm, it alsoexplains the Fraud analytics process model. Section5 discusses the performancebenchmark of different types of fraud. Section6 depicts the conclusion andSection7 holds the references.2. Big Data AnalyticsFraud detection in insurance is a potential area in insurance where big dataplays a major role. However, many insurers remain unknown about the powerof data analytics. According to the survey conducted almost 80% of insurer isunaware about the power of Big Data Analytics. Let us examine a few dataanalytic models that can help insurers strengthen their fraud detectioncapabilities.i) Descriptive – Analysing the data on what was already happened. Generally,reports were generated with past data and analysis is done on that data. Forexample, to identify the sales distribution that has happened in previous year.ii) Diagnostic – Based on the previous data, data analysis will be done on why itis has happened. Identifying and analysing the reason for poor sales in theprevious year is an example of diagnostic data analysis.iii) Predictive–This type of analysis will suggest[27] what will happen in thefuture. It predicts the futuristic scenario based on past historical data. Forexample, identifying the area that is likely to perform better sales in the currentyear based on past data.iv)Prescriptive – This type of analysis will suggest what action should be taken.Basically, how we can make it happen. It gives recommendation on what needsto be done. For example, how to achieve the best outcome in sales, and strategyto retain key customers.3. Predictive AnalyticsThis paper discusses on predictive analytics and the techniques used forprediction. Supervised learning and Unsupervised learning are the [28]techniques used for predictive analytics. Supervised Learning will have a target757

International Journal of Pure and Applied MathematicsSpecial Issuevariable. Target variable is the output that is predicted using other relevantfeatures. Unsupervised Learning does not have a target variable. Followingsupervised learning techniques are used for predicting fraudulent claims since ithas a target variable. Decision tree Random Forest Support Vector Machines Neural Network XGBoostThese techniques are used in solving data analytics problems.Decision TreeDecision tree gives a visualisation view in the form of graph. The sample set isdivided into subset of trees which represent choices and their results. Each nodeof a tree represents a choice and the edges represent the decision. The sampledataset is categorised into training dataset and test dataset. A model is createdwith training dataset which gives the prediction accuracy. This model is appliedon the test dataset and the accuracy of prediction is validated. For each predictorvariable, this model can be used to decide on the category(Yes/No, Spam/notspam-) of the data.Decision tree can deal with continuous data through variousmethod of decision tree like ID3 method and C4.5.Decision tree is used in various fraud detection and prediction applications.Some of the fraudulent problem areas where decision tree is used are credit cardfraud, Energy fraud etc. Credit card fraud detection [40] uses a cost sensitivedecision tree approach. Decision tree is also effectively used in Energy Frauddetection. This technique is widely used for classification and regression. M5PDecision tree is used for energy fraud detection which is a modified version ofQuinlan’s [12] M5 algorithm. Following is the general algorithm.Input: Training datasetOutput :To create a decision Tree.Step 1: Identify the best attribute of the dataset which need to be placed at theroot of the tree.Step 2: Divide the training set into subsets. Each subset should contain datawith the same value such that each subset is created for an attribute.Step 3: Till you find leaf nodes step1 and step2 is to be repeated on eachsubset in all the branches of the tree.758

International Journal of Pure and Applied MathematicsSpecial IssueRandom ForestIn the random forest technique, multiple decision trees are created. A randomsubset of the training data is used to create a single decision tree. [16] Thecommon result of each random subset is taken as the final tree output. A newstudy is fed into all the trees and majority vote for each classification was takenin this model. Missing values and outliers are taken care in random forestmodel.The predictive algorithm which uses this technique will try to imitate therelationship between input and output variable. This algorithm providesexcellent accuracy and it runs very effectively on large datasets. Thisalgorithm[14] is widely used for large number of input. Moreover, it hasmethods for maintaining balance for the unbalanced datasetsIt is identified that for the aggregated model random forest gives better resultsthan Naïve Bayes. Where as in the personalized models Naïve Bayes givesbetter results. In online shopping [15] when large number of discounts areannounced, it paves way for unusual activities in purchasing products andservices. This paper uses random forest algorithm to detect faults using Rlanguage. Prediction can be done using Random Forest technique to identifycustomer’s preference regarding the choice of insurance policy options. [12].Following is the algorithm:Input: Training datasetOutput: To create “n” of TreesStep 1: Randomly select “k” features from total “m” features Where k mStep 2: Among the “k” features, calculate the node “d” using the best split point.Step3: Split the node into daughter nodes using the best split.Step4: Repeat 1 to 3 steps until “l” number of nodes has been reached.Step 5: Build forest by repeating steps 1 to 4 for “n” number times to create “n” number oftrees.Neural NetworksThe fundamental element of computation in neural network is the neuron whichis also called as node or unit. The input from other nodes is computed andproduces an output. Basically, it converts the input from multiple sources tooutput. Whereas in human brains has a distinct feature of creating transientstates through neurons in between sensory organ and brain which is the decision759

International Journal of Pure and Applied MathematicsSpecial Issuetaking unit.To detect and predict the risk of fraudulent financial reporting, a MultilayerPerceptron (MLP) [17] Artificial Neural Network model was proposed.Weatherford suggests, artificial immune systems, recurrent neural networks,back propagation neural networks for fraud detection. A neural networkapproach is identified to detect management fraud. The management fraud isdetected [18] using the Adaptive Logic Network and generalized adaptiveneural network.[42] A three-layer was used with feed-forward Radial BasisFunction (RBF) neural network which will produce in every two hours for newcredit card transactions. This also propose fuzzy neural networks on parallelmachines which rises the rule production for customer-specific credit card frauddetection. Neural Network gave better results for prediction when compared toLogistic Regression ad Decision Tree[18]. A case study was done with 5strategies to audit the auto insurance claims[9].Input: Training datasetOutput: To create data model.Step 1:Assign random weights to all the linkages to start the algorithmStep 2: Using the inputs and the (input-hidden node) linkages find the activation rateof hidden nodesStep 3: Using the activation rate of hidden nodes and linkages to output, find theactivation rate of output nodesStep 4: Find the error rate at the output node and recalibrate all linkages betweenhidden nodes and output nodesStep 5: Using the weights and error found at the output node, cascade down the errortohidden nodesStep 6:Recalibrate the weights between hidden node and the input nodes repeat theprocess till the convergence criterion is metStep 7: Using the final linkage weights score the activation rate of the output nodesXGBoostXGBoost is a short form for Extreme Gradient Boosting. Boosting is asequential process. Multiple trees are created and the information of the firsttree is fed as input to the second tree so that it improves the prediction insubsequent iterations. Basically it is a additive tree model where it add newtrees that complement the already built ones. XGBoost handles missing valuesand it works only for numeric data.Support Vector MachineSupport Vector Machines (SVM) is also a supervised learning algorithm usedfor regression and classification problems. In general, it creates a hyper plane in760

International Journal of Pure and Applied MathematicsSpecial Issuen dimensional space to classify the data based on target class. The SVMseparates into different classes through a hyperplane or multiple hyperplane.The hyperplane separates the data points and sometimes it is difficult to separatethe data point through a single hyperplane. The distance between the data pointand hyperplane represents a margin.This enables to perform classification or regression also. Since it has manyfeatures SVM becomes a promising technique in prediction. [25]Basically,SVM works on the principle that data points are segregated throughhyperplanes. This subsequently maximizes the distance between data points,and the hyper plane is constructed with the help of support vectors. A Turkishinsurance company database [19] was taken for research. SVM technique wasapplied to this data. SVM is basically a classification technique that identifieseach record as anomalous or normal record. Subsequently every record ischecked with margin and based on that the record is treated as normal oranomalous. SVM is a kernel based [19] algorithm where kernel transmutes theinput data points to a high-dimensional space so that the problem is solved. [25]There are different applications which detect fraud through SVM. The topmanagement fraud is detected using SVM, to create the Fraudulent Financialstatement. [20]4. DiscussionA comparative study is done on the Supervised Technique. Each technique hasits own merits and demerits. Based on the application area and data techniquecan be chosen and analytics can be done on that. The merits and demerits arediscussed below as follows:Fraud Analytics Process ModelAs a first step the business problem must be clearly identified. Next step is toidentify the data source which is a very important task in data analysis model.[29] Then subsequently all the data is gathered in one single area which couldbe a data mart or data warehouse. Then the data is cleaned up re, inconsistent,761

International Journal of Pure and Applied MathematicsSpecial Issuemissing and duplicate values are removed. Additional data transformation isdone like data type conversion etc. In analytics phase, data model is built anddata is analysed with the newly created model. Once data analytics is done, thiswill be examined by functional experts.[30] During the analytics phase, therequirement of additional data may be identified. This triggers the need foranother round of data cleaning and transformation. The Pre-processing phase ismost time consuming[31].5. Performance Benchmark for Different Typesof FraudThe following Scatter plot shows [39] unique fraud types which were discussedand published in various fraud detection papers These were some of thecommon fraud types highlighted in the Scatter Plot.Different Types of FraudThe following table provides references with performance metric of differentFraud Types. For better comparison of different types of fraud the area underReceiver operating characteristic(ROC) curve are only included,ReferenceFraud TypeDataset Size Used8,819PERCENTAGEOFClass Distribution5%PerformanceMetricArea under the curveAUC 74%Ortega Figueroraet al,(2006)Subelj Furlan etal.(2011)Battacharyya Jhaet al(2011)Medical InsuranceAutomobile InsurancefraudCredit card fraud3.4511.3%AUC 71%-92%0.005%AUC 90.8% -95.3%Credit Card Fraud50 million transactions on about1 million credit cards from asingle country33,000 -36,000 activity recordsWhitrow , Handet al.(2009)Van )Van VlasselaerMeskens et al(2013)0.1%Credit Card Fraud3.3 million transactions 1% Gini 85%( AUC 92,5%)AUC 98.6%TelecommunicationFraudSocial Security Fraud809,395 calls fromaccounts2000 observations0.024AUC 99.5%1%AUC 80%-85%7621,067

International Journal of Pure and Applied MathematicsSpecial Issue6. ConclusionLike Insurance fraud detection, several fraudulent behaviours are available likeIntrusion detection fraud, credit card fraud, telecommunication fraud etc. It isprominent that health insurance[21] fraud is viable since it brings heavy lossoverall. By integrating big data technology these claims can be predicted forlarge volume of data as well as different variety of data .References[1]Srivastava A., Kundu A., Sural S., Majumdar A., Credit CardFraud Detection Using Hidden Markov Model, IEEE TransactionsOn Dependable And Secure Computing 5(1) (2008), 37-48.[2]Bhusari V., Patil S., Study of Hidden Markov Model in CreditCard Fraudulent Detection, International Journal of ComputerApplications 20(5) (2011).[3]Ivakhnenko A.G., The group method of data handling inprediction problems, Sov Autom Control 9(6) (1976), 21–30.[4]Mueller J.A., Lemke F., Self-organising data mining: an intelligentapproach to extract knowledge from data, Script SoftwareInternational, Berlin (2009).[5]Singh S.P., Shukla S.S.P., Rakesh N., Tyagi V., ProblemReduction In Online Payment System Using Hybrid Model,International Journal of Managing Information Technology 3(3)(2011).[6]Zreapoor M., Shamsolmoali P., Application of Credit Card FraudDetection: Based on Bagging Ensemble Classifier, InternationalConference on Computer, Communication and Convergence(2015).[7]Benson Edwin Raj S., Annie Portia A., Analysis on Credit CardFraud Detection Methods, International Conference onComputer, Communication and Electrical Technology (2011).[8]Panigrahi S., Kundu A., Sural S., Majumdar A.K., Credit cardfraud detection: A fusion approach using Dempster-Shafer theoryand Bayesian learning, Special Issue on Information Fusion inComputer Security 10(4) (2009), 354-363[9]Chang R.I., Lai L.B., Su W.D., Wang J.C., Kouh, J.S., IntrusionDetection by Backpropagation Neural Networks with SampleQuery and Attribute-Query, Research India Publications (2006).[10]Patidar R., Sharma L., Credit Card Fraud Detection Using NeuralNetwork, International Journal of Soft Computing andEngineering 1 (2011).763

International Journal of Pure and Applied MathematicsSpecial Issue[11]Guo T., Li G.Y., Neural Data Mining For Credit Card Frauddetection, Proceedings of the Seventh International Conferenceon Machine Learning and Cybernetics (2006).[12]Lata L.N., Koushika I.A., Hasan S.S., A Comprehensive Surveyof Fraud Detection Techniques, International Journal of AppliedInformation Systems 10(2) (2015).[13]Quinlan J., Learning with continuous classes, 5th Australian jointconference on artificial intelligence 92 (1992).[14]Alshamsi A.S., Predicting car insurance policies using randomforest, 10th International Conference on Innovations inInformation Technology (2014), 128-132.[15]Viaenea S., Auto claim fraud detection using Bayesian learningneural networks, Elsevier (2005).[16]Eesha Goel, Abhilasha, Ankit Agarwal, Fraud Detection UsingRandom Forest Algorithm, International Journal of ComputerScience Engineering 5(05) (2016).[17]Salama A.S., Omar A.A., A Back Propagation Artificial NeuralNetwork based Model for Detecting and Predicting FraudulentFinancial Reporting, International Journal of ComputerApplications 106(2) (2014).[18]Fanning K., Cogger K.O., Srivastava R., Detection ofmanagement fraud: A neural network approach. IntelligentSystems in Accounting, Finance and Management 4(2) (1995),113-126.[19]Kirlidog M., Asuk C., A fraud detection approach with data miningin health insurance, Procedia-Social and Behavioral Sciences 62(2012), 989-994.[20]Pai P.F., A support vector machine-based model for detectingtop management fraud, Knowledge-Based Systems 24 (2011),314–321.[21]Rawte V., Anuradha G., Fraud Detection in Health Insuranceusing Data Mining Techniques, Communication, Information &Computing Technology (2015).[22]Peng Y., Kou G., Sabatka A., Chen Z., Khazanchi D., Shi Y.,Application of clustering methods to health insurance frauddetection, International Conference on Service Systems andService Management 1 (2006), 116-120.[23]Thornton D., Mueller R.M., Schoutsen P., van Hillegersberg J.,Predicting healthcare fraud in medicaid: a multidimensional datamodel and analysis techniques for fraud detection, Procediatechnology 9 (2013), 1252-1264.764

International Journal of Pure and Applied MathematicsSpecial Issue[24]Lin F., Yeh C.C., Lee M.Y., The use of hybrid manifold learningand support vector machines in the prediction of business failure,Knowl. based Syst. (2010), 95–101.[25]Tang X., Zhuang L., Cai J., Li C., Multi-fault classification basedon support vector machine trained by chaos particle swarmoptimization, Knowl. based Syst. 23(5) (2010), 486–490.[26]Wan S., Lei, T.C., A knowledge-based decision support systemto analyze the debris-flow problems at Chen-Yu-Lan River,Taiwan, Knowledge-Based Systems 22(8) (2009), 580-588.[27]Hafiz K.T., Aghili S., Zavarsky P., The use of predictive analyticstechnology to detect credit card fraud in Canada, 11th IberianConference on Information Systems and Technologies (2016),1-6.[28]Alfred R., The rise of machine learning for big data analytics, 2ndInternational Conference on Science in Information Technology(2016).[29]Banarescu A., Detecting and Preventing Fraud with DataAnalytics, Elsevier (2015).[30]Thornton D., Brinkhuis M., Amrit C., Aly R., Categorizing andDescribing the Types of Fraud in Healthcare, Procedia ComputerScience 64 (2015), 713-720.[31]Lata L.N., Koushika I.A., Hasan S.S., A Comprehensive Surveyof Fraud Detection Techniques, International Journal of AppliedInformation Systems (2015).[32]Dal Pozzolo A., Caelen O., Le Borgne Y.A., Waterschoot S.,Bontempi G., Learned lessons in credit card fraud detection froma practitioner perspective, Expert systems with applications41(10) (2014), 4915-4928.[33]Mahmoudi N., Duman E., Detecting credit card fraud by ModifiedFisher Discriminant Analysis, Expert Systems with Applications42(5) (2014), 2510-2516.[34]Chan P.K., Fan W., Prodromidis A.L., Stolfo S.J., Distributeddata mining in credit card fraud detection, IEEE IntelligentSystems and Their Applications 14(6) (1999), 67-74.[35]Bolton R., Hand D., Unsupervised Profiling Methods for FraudDetection, Credit Scoring and Credit Control VII (2001).[36]Brause R.W., Langsdorf T.S., Hepp H.M., Credit card frauddetection by adaptive neural data mining, Internal Report 7/99 (J.W. Goethe-University, Computer Science Department, Frankfurt,Germany) (1999).765

International Journal of Pure and Applied MathematicsSpecial Issue[37]Gill K.M., Woolley, K.A., Gill M., Insurance fraud: The businessas a victim. In M. Gill (Ed.), Crime at work, Leicester: PerpetuityPress (1994).[38]Frieden J., Fraud Squads Target Suspect Claims, Business &Health 9(4) (1991), 21-33.[39]Guzzi R., Furious About Fraud, Best's Review-Life/HealthInsurance Edition (1989).[40]Sahin Y., Bulkan S., Duman E., A cost-sensitive decision treeapproach for fraud detection, Expert Systems with Applications40(15) (2013), 5916-5923.[41]Vapnik V.N., Estimation of Dependences Based on EmpiricalData, Addendum 1, New York: Springer-Verlag (1982).[42]Reilly D.L., Cooper L.N., Elbaum C., A neural model for categorylearning, Biological Cybernetics 45(1) (1982), 35-41.[43]Zareapoor M., Application of Credit Card Fraud Detection: Basedon Bagging classifier, Elsevier (2015).766

767

768

A Survey on Fraud A nalytics Using Predictive M odel in Insurance Claims 1K. Ulaga Priya and 2S. Pushpa 1Dept of C omputer S cience and E ngineering , St.Peters University . ulagapriya@gmail.com 2Dept of