Automating Interpretable Machine Learning Scorecards

Transcription

ANALYSIS04 SEPTEMBER, 2020Automating InterpretableMachine Learning ScorecardsPrepared byOlga rectorNatchie Subramaniam omSenior EconomistVera TolstovaVera.Tolstova@moodys.comEconomistContact UsINTRODUCTIONScorecard quality depends on not only model performance but also its interpretability. In thispaper, we use our toolbox to build and compare the performance of four scorecard models.The benchmark model leverages a modified logistic regression with constraints imposed viasupervised binning and variable selection. Three challenger models are built using decisiontree, random forest and gradient boosting methods. We demonstrate that the interpretablebenchmark model sacrifices little predictive power compared to the unconstrained challengermodels. Meanwhile, the constraints are frequently violated by the challenger models, causingcounterintuitive results for scorecards where the interpretation is critical.Emailhelp@economy.comU.S./Canada 1.866.275.3266EMEA 44.20.7772.5454 (London) 420.224.222.929 (Prague)Asia/Pacific 852.3551.3077All Others s.comMOODY’S ANALYTICSAutomating Interpretable Machine Learning Scorecards1

Automating InterpretableMachine Learning ScorecardsBY OLGA LOISEAU-ASLANIDI, NATCHIE SUBRAMANIAM THIAGARAJAH, AND VERA TOLSTOVAScorecard quality depends on not only model performance but also its interpretability. In this paper, weuse our toolbox to build and compare the performance of four scorecard models. The benchmark modelleverages a modified logistic regression with constraints imposed via supervised binning and variableselection. Three challenger models are built using decision tree, random forest and gradient boosting methods.We demonstrate that the interpretable benchmark model sacrifices little predictive power compared to theunconstrained challenger models. Meanwhile, the constraints are frequently violated by the challenger models,causing counterintuitive results for scorecards where the interpretation is critical.IntroductionInterpretation is a key requirement forrobust and tractable scorecard models for riskmanagement, regulatory compliance, strategy-setting, and product-marketing. Eachcharacteristic included in the model must notonly be a strong predictor that makes operational sense but also comply with a prioriexpectations or constraints.Such constraints represent desirablepatterns and relationships between thepredictors and the score, based on businessexperience, industry trends and regulatoryrequirements. For instance, everything elsebeing equal, higher-income customers areexpected to have lower default probability;the default probability is typically higheramongst unemployed individuals; and lower credit quality is associated with higherfrequency of late payment. Characteristicssuch as age, gender and country origin needto comply with the fair treatment principle,and monotonicity constraints can be used toachieve the desired pattern.Nevertheless, the inclusion of varioustypes of constraints is not a readily availableoption in rapidly evolving scorecard-buildingsetups leveraging various machine learningMOODY’S ANALYTICSmodels. Without flexible and customizableconstraints, counterintuitive or unexplainableresults may appear, and some groups of customers may be disadvantaged when determining their credit risk.In this paper, we use our toolbox thatfeatures an ML algorithm leveraging modified logistic regression with predefined constraints imposed via automated supervisedbinning and variable selection. We use thisalgorithm to build a benchmark scorecardmodel that is interpretable by design. Wethen compare this model with three challenger models built using other classifiers,decision tree, random forest and gradientboosting methods, in terms of model performance and interpretability. To assess thechallenger models’ interpretability, we usethe toolbox to identify customer characteristics that do not yield desired patterns andhence violate constraints.Our results demonstrate that there is nosignificant difference in performance betweeninterpretable benchmark model and challenger classifiers models. Using different sizedatasets for personal loans and credit cards,we find that the benchmark model performance is overall slightly inferior to challengerAutomating Interpretable Machine Learning Scorecardsmodels in model fit and discriminatory power. We show that while the gradient boostingand random forest models can provide superior fit to the benchmark model, they do it atthe cost of violating many constraints.The remainder of this paper is organizedas follows. In section two, we survey the mainapplications of machine learning methodsin scorecard-building. We then describe themethodology behind our algorithmic toolbox,which includes binning and variable selectionfor the benchmark model, and constraint violation assessment for the challenger models.In section three, we assess the performanceof benchmark and challenger models usingour toolbox in terms of their performanceand interpretability.Challenges of scorecard-buildingThe scorecard models are designed torank-order customers by condensing thevariety of variables into a single score. Scorecard types differ by the target variable, suchas application, behavioural and collectionscorecard. The models typically use the datafor customer and product characteristics,while alternative information such as transactional data, telecommunication, rental and2

utilities variables can add more insight andpredictive value. Such datasets vary by sizeand structure.Industry embraces ever-expanding andimproving ML techniques at various stages ofscorecard development, from data preparation and variable selection to model building,optimization and monitoring. Since computation power has increased dramatically overthe last decade, advanced machine learningmethods such as gradient boosting, neuralnetworks and random forests have madetheir way into credit-scoring models. Thesemethods have demonstrated their superiorityin the speed of the scorecard-building processand often predictive power compared to thetraditional logistic regression approach, seeEfron & Hastie (2016) and Alpaydin (2020),among others. More complex classifier methods have proven to be especially superior inlarge unstructured datasets with many predictors. Chart 1 summarises applications ofmachine learning in scorecard-building.However, many advanced machine learning methods suffer from a lack of tractabilityand interpretability of model structure andpredictions. The “interpretable” trend inML model development has been gainingmore attention, see, for example, Gilpin et al(2018) and Rudin (2019). Although there aresome methods that enable us to peek intothe black box, there is still no consensus onhow to assess the interpretation quality. 1 Asthe interpretability translates into a set ofconstraints, algorithmic solutions for morecomplex ML methodologies are challenging.These issues have been recognized by industry and regulators worldwide, who calledfor the responsible use of ML to ensure theprinciples of fairness, ethics, accountabilityand transparency when assessing customers’credit risk.Recognizing the need for a scorecard tobe interpretable, transparent, and able towithstand regulatory scrutiny, we have designed an automated algorithmic toolbox.The scorecard toolbox includes tools for dataanalysis, model development and assessment, model validation, model refinement,scoring, model-monitoring, and strategy-setting. At the core of this toolbox are the bin1For example, Lundberg and Lee (2017) developed ShapleyAdditive Explanations to interpret the output of machinelearning models, while Carvalho et al (2019) provide a review of machine learning models’ interpretability.MOODY’S ANALYTICSning and the variableChart 1: ML inselection algorithms5 Stagesthat solve optimization problems subjectto user-specified constraints required incredit applications.The scorecardtoolbox allows us touse traditional andenhanced regressionmethodologies aswell as alternativeML methods. SuchSource: Moody’s Analyticsan approach allowsus to focus on keydecisions, while relegating the tedious, complex tasks of binning, variable selection, andthe constraints violations assessment toautomated algorithms.Dataset and TargetVariableInternal and externaldatasetsExplanatory variablesData split into training,testing and validationdatasetTarget variableChosen based on thetargetScorecard-BuildingVariable PreProcessingVariable Selection1. Missing and NullvaluesReplace with mean ormost frequent valueK-Nearest NeighborsimputationNumerical outliers2.Binning of thevariablesSupervised discretizationStage: Multi FactorAnalysisDecision treeRecursive featureelimination3. TransformationNormalize or StandardizeBest subset»Flexibility to incorporate constraints. Binning can be formulatedwith various types of constraints.These constraints include binning sizeAutomating Interpretable Machine Learning ScorecardsML Models:Meta Ensemble:Basic average ofpredictions from variousmodels orNeural NetworksSupport Vector MachineMeta Machine LearningModelOptimization:Hyper-parameters tuningvia cross validationAssessment:ROC-AUCMonitoring:MSE; MAEK-S ScoreDashboard to detectanomaliesMinimize Cross ValidationErrorTrack model performancein short to long termMay 20201constraints, logical patterns, businessexpectations, and compliance withequal credit opportunity legal acts forcustomer age or gender. For example,a monotone or quadratic relationshipbetween binned predictors and defaultrate can be incorporated as a constraint when splitting variable valuesinto bins.Binning algorithmBinning is a first step in scorecard modeldevelopment. It transforms the values of various types of potential predictors into severalgroups, known as bins, according to specifiedcriteria. Binning is applied to numerical datasuch as customer age or income, categoricaldata such as loan purpose or property type,and ordinal data that has defined ordering,such as customer education or employment status. The result of binning is a set of“binned” variables for the next step in modeldevelopment, the variable selection.The key advantages of binning include» Simplicity and business tractability.Binning is used to simplify the modelpredictors by creating groups that haveexpected patterns and relationshipswith the target variable. For instance,low-income customers are expectedto have higher default rates than higher-income customers. Hence, it makessense to split the numerical incomevalues into several bins. Includingbinned variables allows us to evaluateonly a few logical conditions to calculate the score, instead of calculatingthe score for each possible combination of predictor values.Model Deploymentand MonitoringLinear Models andMultivariate AdaptiveRegression SplinesTree ModelsStage: Single FactorAnalysisDecision treeFeature importanceRandom ForestElastic Net RegularizationRandom ForestimputationK-Means0/1Machine LearningModel andOptimisation»Capture non-linear relationships.Binning allows us to capture non-linearities in a data-driven way, withoutmaking restrictive parametric assumptions. For example, account age mayhave a non-linear relationship with thedefault rate.»Model accuracy by handling outliersand missing values. Binning mitigatesthe impact of outliers and missingvalues by grouping observations.Grouping of similar attributes withsimilar predictive strengths increasesthe model’s accuracy. For example,the procedure extracts informationfrom such observations into a separate bin and uses it to predict thetarget variable.In practice, binning procedures vary depending on data and model characteristics.Binning can be based on expert opinion, utilize unsupervised or supervised algorithmswith quantitative optimization techniques,or use a combination of these. Typically,many manual interventions and visual assessments of the binning solution’s qualityare required.3

Chart 2: Constraints for Ordinal DataChart 3: Constraints for Ordinal DataAssumptionSolution540%35%30%Default rateOrder43215%5% 100100 to 500 500 to 1,0000% 1,000Source: Moody’s AnalyticsThe main idea of supervised binning isto find optimal cut-off points to define binssubject to various types of constraints. Someconstraints are required to ensure eachbin strikes a balance between being “wideenough” and “narrow enough” by havingdistinctly different risk characteristics withminimum information loss. Examples includecontrolling the number of bins, the numberof observations in each bin, and non-overlapping confidence intervals for default ratesof each bin. In addition, a critical aspect ofbinning is the enforcement of various typesof constraints representing requirements thatcertain patterns must emerge when calculating the scores.Our toolbox automates the tedious aspects of supervised binning, allowing theanalyst to specify options and preferences.The supervised binning algorithm solves anoptimization problem with user-defined constraints, while controlling the number anddiscriminatory power of resulting bins. Theprocedure significantly reduces the time costsof generating predictive characteristics.The key ingredients of our supervised binning algorithm include» Maximize binned variable’s predictive power. User-defined performancemetrics such as information value, Giniand chi-square statistics to assess variable’s predictive power.Comply with constraints. User-defined constraints and label-orderingfor ordinal variables are incorporatedinto the optimization algorithm. Theseconstraints represent expected trendsin the default rates across bins. TheMOODY’S ANALYTICS 100 100Source: Moody’s AnalyticsMay 2020»20%10%1025%»2May 20203toolbox implements both monotoneof confidence interval to assess the(such as decreasing, increasing) anddiscriminatory power of the selectednon-monotone (such as u-shaped,bins. To improve the quality of binning,hump-shaped) types of relationshipthe algorithm considers not only thebetween the binned variable and targetpoint estimate but also a confidencevariable. Moreover, the toolbox allowsinterval for the default rate of each bin.for the incorporation of the constraintsFor example, when confidence interfor ordinal variables based on the usvals for each bin overlap, the chosener-provided order of labels.bins may not have enough power toCharts 2 and 3 illustrate an examplediscriminate between defaulted andof implementing constraints for ordinon-defaulted observations.nal data for available savings. Chart 2As illustrated in Charts 4 and 5,shows the preferred order of the catethe algorithm combines such bins togories specified by the user: The defaultachieve an optimal solution. Chart 4rates follow a non-increasing trendshows an example binning solutionacross categories as lower savings arewith overlapping confidence interassociated with worse credit quality.vals for the 3rd and 4th bins. Chart 5Chart 3 demonstrates the solutiondemonstrates the toolbox solution thatby the supervised binning algorithmsatisfies the constraint on non-overobtained in line with this imposedlapping intervals. In this example,assumption. The categories “100 this is achieved by merging intervalsto 500”, “500 to 1,000” and “ “10,847 to 14,803” and “ 14,803” into1,000” are merged, but the desired“ 10,847”.ordering is preserved as theChart 4: Confidence Intervals Mergingtrend in defaultDefault raterates is indeed100%non-increasing.Overlapping confidence intervals80%Control number, size and60%discriminatory40%power for20%selected bins.0%Users can control the numberand size ofbins as wellas specify theSource: Moody’s Analyticsthreshold levelAutomating Interpretable Machine Learning ScorecardsMay 202044

Chart 6: A Simple Decision Tree ExampleChart 5: Confidence Intervals MergingDefault rate100%Duration in mo 12280%60%YesNo40%20%Credit amount 34,2220%Yes0.250.34Yes0.46No0.39Source: Moody’s AnalyticsSource: Moody’s AnalyticsMay 2020Variable selection algorithmAs a next step of the model-buildingprocess, the variable selection algorithm isused to identify the characteristics to include into the regression model. The binnedcandidate variables obtained at the previousbinning stage can be used as inputs into thevariable selection procedure to define themodel specification.In practice, various methodologies areused for the variable selection in varioustypes of risk models. Brute-force algorithmsexhaustively evaluate all possible combinations of candidate predictors to find thebest subset. These algorithms can be modified to incorporate constraints. Dynamiccredit risk models with linkages to macroeconomic as well as portfolio characteristicsare good examples when such procedureswork well, see Licari, Loiseau-Aslanidi andVikhrov (2017).Alternative variable selection proceduresare preferred when the number of candidatecharacteristics and number of observationsis so large that it makes the exhaustivesearch’s computational cost prohibitivelyhigh. Stepwise algorithms such as forwardstepwise have the advantage of relativelyhigh execution speed, as they rely on a sequence of nested models as opposed to thebrute-force exhaustive search. Not surprisingly, forward stepwise regression is a workhorse variable selection procedure in creditrisk scoring.Nevertheless, classical stepwise methods require enhancements to improve theirefficiency and applicability for scorecards.First, the stepwise algorithm is not robustto variable ordering. The order in whichMOODY’S ANALYTICSNoCredit amount 14,0025variables enter the model has a significantimpact on the final model that may result inoverfitting, dependence on training sampleselection, or that may eliminate variablesthat would provide additional information and improve model performance, seeAltman & Anderson (1989) and Audrino& Kanus (2016) among others. Second,stepwise regression does not consider thepossible correlations between the variables.Nor does this algorithm consider constraintssuch as logical patterns based on businessexperience, industry trends, or legallyrequired relationships.In our toolbox, we enhance the stepwisealgorithm by offering capabilities to specify user preferences on dependencies. Suchconstraints may include the expected relationships between characteristics and thetarget variable, statistical significance of thevariables, and maximum allowed value ofpairwise correlation. These constraints maybe imposed either on coefficients’ estimatessigns or their order. After the model is built,the validation is performed, and an iterativemodel refinement algorithm sequentially excludes variables that do not complywith the constraints from a list of initialpotential drivers.Assessing constraints violation forchallenger modelsThe scorecard model designed usingthe steps outlined above is designed tosatisfy user-defined constraints imposedat the supervised binning and the variableselection stages. In contrast, decision tree,random forest and gradient boosting challenger models need an additional analysisAutomating Interpretable Machine Learning ScorecardsMay 20206to assess models’ tractability. Our toolboxprovides several options to identify theconstraints violations for these machinelearning models.In the case of decision tree, it is feasibleto extract all tree nodes. We calculate theaverage probability of default for each variable interval determined by tree cut points.If the variable appears at different treebranches several times, the average defaultprobability is calculated based on all internaland leaf nodes, taking into consideration interaction terms. This procedure is analogousto evaluating the type of the relationshipin the case of binning and is, therefore,straightforward to use for evaluation of theconstraint’s violation. A simple tree example illustrating the procedure is shown inChart 6.In the case of random forest and gradientboosting, the extraction of all tree nodes andsplits is not the best solution because of theircomplicated model structure. We rely onthe Shapley values approach used to assessthe marginal contributions of various driversinto predicted probability of default values.Because of the potentially very large numberof cut points for continuous variables, theevaluation of trend monotonicity based onthe average default rates for each bin maynot be applicable.To facilitate comparability of the resultswith a logistic regression model, we focus onevaluation of the constraints only for ordinal variables and calculate average Shapleyvalues for each category. Additionally, thetoolbox provides a standard heat map ofShapley values for various realizations ofeach driver.5

Models AssessmentData and methodologyTo assess and compare the performanceof several ML models, we conduct an empirical study using two datasets that differby product type, geography and size. Bothdatasets cover consumer credit portfoliosfrom the UCI Machine Learning Repository2,which is frequently used in studies on performance evaluation of machine learningand data mining algorithms. The first dataset covers a German fixed-term portfoliofor personal loans, while the second datasetcovers a credit cards portfolio in Taiwan (seeTable 1).We begin by splitting each dataset intodevelopment (train) and holdout (test) ina standard proportion 70:30. To mitigate asample-dependency bias for model performance measures, especially for the Germandataset consisting of only 1,000 observations,we generate 100 train and test subsets realizations without replacement.2This is a real-life credit scoring dataset publicly available atthe UCI repository at http://kdd.ics.uci.edu/.For each realization of theTable 1: Summary of Datasettrain dataset, we build four alterCountryGermanyTaiwannative models. The first modelProducttypePersonalloansCreditcardsis a benchmark built using theNumberofobservations1,00030,000algorithmic supervised binningNumber of characteristics2423and modified weight-of-evidenceNumber of defaults3006,636logistic regression with exampleconstraints presented in Table 2.Sources: UCI Machine Learning Repository, Moody’s AnalyticsNext, we use decision tree, random forest and gradient boostingof decision trees with least-squares loss func- Model performance and interpretabilityassessmenttion as three challenger models.For the latter models, it is crucial toWe use standard measures to evaluateproperly tune hyper-parameters to preventthe models’ performance, and we assessoverfitting. We tune hyper-parametersmodels’ interpretability by looking intothrough stratified k-fold cross validation withconstraints violations. The accuracy of theapplication of exhaustive grid search overmodel fit is measured by the Brier score,various parameter combinations to maximize while the discriminatory power is assessedthe average accuracy ratio on validationthrough Gini or the area under the curvesubsamples. For the RF model, the procedure(see Table 3).optimizes the maximum depth of trees andWe observe that enforcing constraintsnumber of estimators. For the GB model, theon the benchmark model has little impactset of optimized parameters is broader, andon the performance, and for brevity we doalong with maximum depth and number ofnot report the results of the benchmarktrees it includes the learning rate and subsamodel without constraints. Moreover, themple size to be selected for the estimation ofbenchmark regression model with supervisedeach tree.binning and constraints demonstrates similarTable 2: Selected Example Constraints for Model Interpretation EvaluationVariableDuration in moProduct typePersonal loansVariable typeNumericalTrendMonotonicSavings account amountCredit amountPersonal loansPersonal loansOrdinalNumericalNegativePositivePresent employment since (employment longevity)Installment rateOther debtorsPresent residence sinceHousing typeNumber of existing creditsPersonal loansPersonal loansPersonal loansPersonal loansPersonal loansPersonal veJobNumber of people being liablePersonal loansPersonal nt of given creditPersonal loansCredit cardsOrdinalNumericalPositivePositiveEducationPast payment status in Apr-Sep 2005Amount of bill statement (balance) in Apr-Sep 2005Amount of previous payment in Apr-Sep 2005Credit cardsCredit cardsCredit cardsCredit iveMonotonicMonotonicOrder of labels for ordinal variables 100, 100 . 500, 500 . 1000, 1000Unemployed, . 1 yr, 1 . 4, 4 . 7, 7 yrsNone, co-applicant, guarantorFor free, rent, ownUnemployed, unskilled, unskilled resident,skilled employee, official, management,self-employed, highly qualified employee,officerYes, registered under the customer’s name,noneNA, high school, others, university, graduateschoolSource: Moody’s AnalyticsMOODY’S ANALYTICSAutomating Interpretable Machine Learning Scorecards6

Table 3: Summary of Model PerformanceWOE logistic with supervised binning with constraintsDecision treeRandom forestGradient boostingPersonal loansGini Brier t cardsGini Brier score0.5250.1390.410.140.5670.1320.5620.133Source: Moody’s Analyticsperformance to the challenger ML models.3The GB model performs slightly better forpersonal loans, while for credit cards theRF model outperforms the others. The DTmodel’s performance is inferior for bothproduct types.To evaluate the models’ interpretability,the violation of constraints is evaluated forthe challenger DT, RF and GB models (see Table 4). The benchmark model satisfies all theconstraints by design. In the case of personal3We compare performance based on the accuracy of theirpredictions on the test dataset. Model performance on traindatasets cannot be used for an objective model evaluationsince that dataset was used for model development.loans, the constraints are violated for six outof 10 variables that appear in the DT model.For example, the credit amount and the number of existing credits do not have the expected positive relationship with the default rate.Similarly, the employment longevity and jobdescription ordering are counterintuitive, resulting in the unemployed customers havinglower default rates than those with years ofemployment. For the RF and GB models, allthe constraints are violated, with the employment-related variables having the highestpercentage of violation cases.In the case of credit cards, the selectedvariables represent the most recent (Sep-tember and August) past payment statusand previous payments. The rest of the lagsare not selected by the models because ofthe absorbing properties of the most recentobservations. As expected, the constraintsare not violated in the DT model because ofthe relatively shallow trees produced by thealgorithm, while the constraints are violatedfor the RF and GB models.The illustration of constraints violationsfor employment status is depicted in Charts7-10. In contrast to the benchmark modelwith supervised binning, where lower defaultrates are associated with longer duration ofemployment, the models using the DT, RFand GB methods show counterintuitive relationships. The DT model predicts a lower default rate for the categories with unemployedand shorter employment duration than forcategories with longer employment duration.Both the RF and GB models predict lowerdefault rate for the “Unemployed” versus the1-year employment duration categories. Additionally, the RF model predicts a somewhatlower default rate for four to seven years thanfor seven years or more.Table 4: Frequency of the Example Constraints ViolationConstraint variablesDataDuration in moPersonal loansCredit amountPersonal loansPresent employment sincePersonal loansOther debtorsPersonal loansSavings account amountPersonal loansInstallment ratePersonal loansPresent residence sincePersonal loansHousing typePersonal loansJobPersonal loansTelephonePersonal loansNumber of existing creditPersonal loansNumber of people being liablePersonal loansEducationCredit cardsPast payment status in Sep 2005Credit cardsPast payment status in Aug 2005Credit cardsPast payment status in Jul 2005Credit cardsPast payment status in Jun 2005Credit cardsPast payment status in May 2005Credit cardsPast payment status in Apr 2005Credit cardsAmount of previous payment in Sep 2005Credit cardsAmount of previous payment in Aug 2005Credit cardsAmount of previous payment in Jul 2005Credit cardsAmount of previous payment in Apr-Jun 2005Credit cardsAmount of bill statement (balance) in Apr-Sep 2005 Credit cardsVariable type % of appearance, umerical1Numerical0Numerical0% of violated % of violated % of violatedcases, DTcases, RFcases, -Source: Moody’s AnalyticsMOODY’S ANALYTICSAutomating Interpretable Machine Learning Scorecards7

Chart 8: Default Rates vs. EmploymentChart 7: Default Rates vs. EmploymentDefault rate, constraint logitDefault rate, decision tree40%80%30%60%20%40%10%20%0%Unemployed, 4 yrs0% 4 yrsSource: Moody’s AnalyticsUnemployed, 7 yrs 7 yrsSource: Moody’s AnalyticsMay 20207May 2020Chart 9: Shapley Values vs. EmploymentChart 10: Shapley Values vs. EmploymentAvg Shapley value, random forestAvg Shapley value, gradient urce: Moody’s AnalyticsSource: Moody’s AnalyticsMay 2020ConclusionUsing our automated toolbox, we designed and compared several models in termsof their performance and interpretability. Theconsidered models include the benchmarkmodel leveraging mod

ting. At the core of this toolbox are the bin - 1 For example, Lundberg and Lee (2017) developed Shapley Additive Explanations to interpret the output of machine learning models, while Carvalho et al (2019) provide a re-view of machine learning models' interpretability. ning and the variable selection algorithms that solve optimiza -