IBM SPSS Modeler Server Performance And Optimization

Transcription

IBM SoftwareBusiness AnalyticsIBM SPSS Modeler Serverperformance and optimizationImprove performance and scalability in high-volume environmentsSPSS Modeler Server

2IBM SPSS Modeler Server performance and optimizationContents2 Introduction3 Performance and scalability19 Optimizing performanceIBM SPSS Modeler is a data mining workbench that enablesimproved decision-making with quick development ofpredictive models and quick deployment of these models intobusiness operations. SPSS Modeler: 27 Advanced performance optimization 29 Scoping and sizing SPSS Modeler Server 30 ConclusionIntroductionPredictive analytics offers organizations the ability to addpredictive intelligence to the process of making decisions.Predictive intelligence improves the decisions made byindividuals, groups, systems and organizations in multiplebusiness areas such as customer analytics, operational analyticsand proactive risk and fraud mitigation. Data mining is at thecore of predictive analytics because it helps organizationsunderstand the patterns in their data. As a result, organizationscan make the smart decisions that drive superior outcomes. Works in a variety of operating environmentsCan scale from a single desktop to an enterprise-widedeploymentSupports virtually any data source (including Hadoop whenused with IBM SPSS Analytics Server)Provides the ability to incorporate structured andunstructured data.It is available in three editions: IBM SPSS Modeler Professional uncovers hidden patternsin structured data with advanced algorithms, datamanipulation and automated modeling and preparationtechniques.IBM SPSS Modeler Premium adds the ability to use naturallanguage processing and sentiment analysis on text data aspart of a predictive analytics project. Entity analyticsdisambiguates identities, and social network analysisidentifies influencers in social networks.IBM SPSS Modeler Gold includes the full range ofpredictive capabilities for structured and unstructured data.Users can combine, optimize and deploy predictive modelsand business rules to an organization’s processes andoperational systems to provide recommended actions at thepoint of impact. As a result, people and systems can makethe right decision every time.

IBM SoftwareAll editions of SPSS Modeler use a client/server architecture.The client provides the visual workbench for predictiveanalytics. The server adds increased performance andefficiency, along with features that support additional scale.IBM SPSS Modeler Server is designed to improveperformance by minimizing the need to move data in the clientenvironment and by pushing memory-intensive operationssuch as scoring and data preparation to the server. SPSSModeler Server also provides support for SQL push-back andin-database modeling capabilities so users can take betteradvantage of their existing infrastructure and data warehouseand further improve overall performance.This paper highlights the capabilities and possibilities of SPSSModeler Server, and it serves as a guide to understanding andmaximizing SPSS Modeler Server performance. Initial sectionsprovide performance benchmarking results for IBM SPSSModeler Professional and IBM SPSS Modeler Premium ratherthan the performance of models post-deployment. Subsequentsections describe performance optimization and sizingrecommendations.Many of the results provided in this document address SPSSModeler Server performance as it relates to issues of scalability.By utilizing options only available with the server (such as SQLpushback/generation, in-database mining, scoring adapters andmore), users are able to fully exploit the client/serverarchitecture to improve performance and deliver a quickerreturn on IT investment.3Performance and scalabilitySPSS Modeler Server has been designed and developed toprovide high performance and scalability for all data miningtasks. For example, SQL generation and parallel processing areautomatic. As a result, SPSS Modeler users do not need tomake any changes to the way they work to get consistentlyhigh performance.To benchmark performance, IBM measured the ability of SPSSModeler Server to carry out the common tasks of datapreparation, model building and model scoring. IBM used avariety of operating environments and altered the size of thedata files. Data mining involves more than simply modelbuilding and model scoring. Data preparation is a majorcomponent of the process. So, IBM’s tests also evaluated theperformance of common steps such as reading, sorting,aggregating, merging and writing data.Reading and writing dataTimes have been recorded for reading the data sets in SPSSModeler with the stream shown in Figure 1. The Sample nodein these test streams means that IBM was able to record onlythe time taken to read the data in (no data write time wasadded).

4IBM SPSS Modeler Server performance and optimization–Figure 1: Test stream to read the data into SPSS ModelerFigure 2: Test stream to write from SPSS Modeler to the file formats testedduring benchmarking.To obtain the results for writing to the various formats, thestream in Figure 2 was used. To improve performance, SPSSModeler executes the reading and writing operations at thesame time; the data writing operation starts before all the datahas been read in. Therefore, when measuring how fast SPSSModeler writes data, the time includes both reading andwriting. For the most frequently used data formats (CSV,Database Table, SPSS Statistics files [.sav]), a million records isread in less than 30 seconds and written back to the source inless than a minute.Benchmarking results show that, as the number of records in adataset increase, so does the processing time for reading andwriting (Figure 3). Overall, performance is slower for XMLand Excel when compared to other formats tested (CSV,database table and SPSS .sav). For CSV, database and SPSS .sav, read performance time doubles when the number ofrecords is increased from 100,000 to 1 million. The processingtime for those formats remains below 25 seconds at 1 millionrecords.

IBM SoftwareFigure 3: Execution of data read from along with read from and written to is shown for various data formats. Results for CSV, SPSS .sav and DB2 tables areplotted by seconds on the left axis and those for XML and Excel by minutes on the right axis to better highlight the differences and similarities in performancebetween data sources.Results also show that performance is consistent for variousenvironments for reading and writing datasets in all the variousfile formats.5

6IBM SPSS Modeler Server performance and optimizationSorting dataThe sorting test involved sorting the data sets by a singlecolumn with the stream shown in Figure 4. For a more realisticreflection of a customer scenario, the times include how long ittook to read the data and the time taken to sort.Figure 5 shows that the sorting performance of SPSS ModelerServer is linear as the number of records sorted is increasedthroughout increasingly powerful operating environments.The test results also show that the use of SQL pushbackfunctionality provides a significant increase in performance forthe sorting operation. By enabling SQL pushback in a stream,the SQL instructions are pushed back and executed in thedatabase itself. This means that performance depends on theoperating environment rather than on SPSS Modeler.Data aggregationFigure 4: Test stream for sorting data with SPSS Modeler.A 5 million record data set was used to test aggregating withSPSS Modeler. The number of unique values that appeared inthe designated field was scaled with the stream in Figure 6. Forthe test results to reflect how it would be used in an operatingenvironment, the times measured for the SQL pushback alsoinclude the time it takes to read in the data.Figure 5: Sorting data with SPSS Modeler. SQL pushback improves the performance when a database is used. The CSV file was stored locally on the serversystem and the IBM DB2 database was running on a remote system.

IBM SoftwareThe test results show that the aggregation operation scales wellas the number of unique categories to be aggregated increases.Figure 7 highlights the performance times. Note the dramaticimprovement in the SQL pushback functionality when adatabase table is used as the source data for aggregation ascompared to a CSV file. The process was complete in almosthalf the time of the CSV file. This result was consistent in allthe operating environments used for testing.Figure 6: Test stream to aggregate data with SPSS ModelerMerging dataTo check the performance of the merge operation, the streamin Figure 8 was used. Times were recorded as the size of thedata sets increased. An inner join used a unique “ID” column,which means that the merge was one-to-one. Every record inthe first data set had only one match in the second data set.Figure 7: Aggregating data with SPSS Modeler. Using SQL pushback within DB2 was almost twice as fast as a CSV file.7

8IBM SPSS Modeler Server performance and optimizationText analyticsFigure 8: Test stream to merge data with SPSS Modeler.The test results show that the merge operation scales relativelylinearly in relation to the number of records being merged(Figure 9). Yet again, the improvement in time with the use ofa database and SQL pushback is evident. With SQL pushback,the merge has already taken place before the data is read out ofthe database and then brought into SPSS Modeler. Theincrease in performance is most notable at scale. The timerecorded in these results includes the time that was taken forthe data to be read into SPSS Modeler.The ability to structure text is an important capability of IBMSPSS Modeler Premium. Including concepts derived from textincreases modeling accuracy. For example, when predictingcustomer purchase propensity for a product, customer attitudesand preferences are often derived from surveys, call centernotes and social media to augment behavioral anddemographic data. Building a text model provides a way toapply a structure to new text based on the analysis done onhistorical or existing text. The text analytics capabilities ofIBM SPSS Modeler can use a variety of data sources. Fortesting purposes, however, IBM used email. The testing fortext analytics performance used the following input data: Approximately 500,000 emailsAverage number of words per email: 446Average number of characters per email: 2720Figure 9: Merging data with SPSS Modeler. With SQL pushback, the merge has already taken place before the data is read out of the database into SPSS Modeler.

IBM SoftwareText analytics model buildingText analytics model scoringTests were run to measure the performance of building anon-interactive (automatically derived) Text Mining conceptmodel from Basic Resources and Opinions with the stream inFigure 10.After concepts are extracted, SPSS Modeler creates a textmodel that can be used in predictive streams. Scoring againstthe text model means that new text is categorized with thepatterns established during the model building process. IBM’stests assessed the speed of scoring new records against anexisting model with the stream in Figure 12.Figure 10: Test stream for text analytics model buildingThe tests showed that, after initial training (which uses moreoverhead), performance accelerates (Figure 11).Figure 12: Test stream for model scoringFigure 11: Initially, training time uses more overhead. After the training is complete, performance accelerates.9

10IBM SPSS Modeler Server performance and optimizationThe test results showed that scoring performance is linear inrelation to the number of records (Figure 13).Cube Complexity Level(1 simple, 5 complex)Number of fieldswhen data is viewedin SPSS ModelerNumber ofDimensionsNumber ofmeasuresCube 114510Cube 217810Cube 3201110Cube 4231410Cube 5261710TM1 tests: Cube complexity levelsFigure 13: Scoring performance is linear in relation to the number of records.TM1 IntegrationSPSS Modeler supports data imports from and data exports toIBM Cognos TM1 . These operations are controlled byCognos TM1 process scripts in the Cognos TM1 server. Whenan SPSS Modeler TM1 import or export operation is executed,SPSS Modeler runs these process scripts first (alongside anynative SPSS Modeler processing that is required).The cube complexity levels defined in the tests are based ontest cubes created by the Cognos TM1 team. The objectivewas to best represent the different levels of complexity that aCognos TM1 user might have in a cube. The following tableshows how the complexity levels are defined.TM1 importTM1 Import works by passing a view from Cognos TM1 toSPSS Modeler for additional analysis (Figure 14). To achievethe best performance, users are encouraged to define CognosTM1 views that are as specific as possible to reduce theoverhead of moving large data files between Cognos TM1 andSPSS Modeler Server systems.

IBM Software11read 1,000 records into SPSS Modeler, the size of the datasetpassed over the network from Cognos TM1 to SPSS ModelerServer is actually 10,000 records. SPSS Modeler Serverprocesses the full set and then filters this 10,000 record data setto the 1,000 records required for display. The largest datasettested has 1 million records, and the size of the dataset passedover the network from Cognos TM1 to SPSS Modeler (beforefiltering) is 10 million records.Figure 14: Cognos TM1 import that indicates that no records are output.To represent a real user scenario, the cubes for the CognosTM1 import test contained more information (a factor of 10 inrelation to number of records) in the view than would berequired based on the settings of the Cognos TM1 Importnode. For example, when you import the simple cube 1 andTests were run by importing data from TM1 with scaling byboth cube complexity and cube size (number of records). Thegraphs in Figure 15 show how the SPSS Modeler and CognosTM1 integration scales linearly in relation to both aspects.

12IBM SPSS Modeler Server performance and optimizationFigure 15: Cognos TM1 import test results. The integration scales linearly in relation to both aspects.

IBM Software13TM1 exportTests were run by exporting data from Cognos TM1 (Figure16) and scaling by both cube size (number of records) and cubecomplexity.Figure 16: Export to Cognos TM1 streamFigure 18: Cognos TM1 export test results for cube complexity.The graphs in Figure 17 and Figure 18 show that the SPSSModeler and Cognos TM1 integration scales linearly for bothscaling aspects.Figure 17: Cognos TM1 export test results for number of cubes

14IBM SPSS Modeler Server performance and optimizationModel buildingFigure 19 shows a stream that was used to test the modelbuilding execution in SPSS Modeler. The test used datasetswith 100,000 records, 500,000 records and 1 million records.Results show that performance is linear related to the size ofdata. Results are shown by model type to ease analysis and aregrouped as follows: Classification modelsSegmentation modelsAssociation modelsAutomated modelsClassification models use the values of one or more input fieldsto predict the value of one or more output or target fields (forexample, logistic regression or a decision tree). Neural Net hadthe slowest performance because of the sophistication andlearning that the technique requires. However, all of thetechniques built models in less than 3 minutes for a set of 1million records (Figure 20).Segmentation models divide the data into segments or clustersof records that have similar patterns or characteristics, such asKMeans clustering. They can also identify patterns that areunusual, or anomaly detection. The KNN technique isincluded with this set, although it is typically used forclassification. KNN classifies cases based on similarity to othercases nearby, which mirrors the computations that are done byclassical segmentation models. Because more computation isused, its performance lags behind that of the other techniquesshown. Anomaly, Kmeans and Two Step were the quickest.Two Step completed within 1 minute for 1 million records.Figure 21 shows the results.Figure 19: Test stream used for model building. In this case, it was a C5.0model.Figure 20: Model building times for classification models. Most of the models completed in less than 30 seconds for 250,000 records, within 1 minute for 500,000records and all completed in less than 3 minutes for 1 million records.

IBM SoftwareFigure 21: Model building times for segmentation models in SPSS Modeler. The Two Step, KMeans and Anomaly were the quickest of the five models.Association models are used to find patterns in data where oneor more entities (such as events, purchases or attributes) areassociated with one or more other entities. The modelsconstruct rule sets that define these relationships. For example,these techniques are used for Market Basket Analysis, whichmodels the next likely purchase for a customer based on theirprevious purchases and identifies products that are typicallybought together or at a certain sequence. Figure 22 shows thatboth the Carma and Apriori models were built in less than 30seconds for a dataset with 1 million records.15

16IBM SPSS Modeler Server performance and optimizationFigure 22: Model building times for association models in SPSS Modeler. At 1,000,000 records, Apriori completed in 24 seconds and Carma completed in lessthan 17 seconds.The automated models (Auto Classifier, Auto Cluster and AutoNumeric) estimate, compare and combine multiple modelingtechniques in a single run. Automated models eliminate the needfor users to sequentially test multiple techniques individually.They are designed to make modeling easier for those usersunfamiliar with all of the underlying algorithms that IBM SPSSModeler supports. Although ALM (Automated LinearModeling) does not use multiple algorithms to build a model, itdoes have an automated data preparation step that transformsthe target and predictor variable automatically to maximize thepredictive power of the model it creates. Figure 23 shows thatthe performance of the automated techniques is directlyproportional to the size of the dataset. All complete within 8minutes for 500,000 records and within 15 minutes for 1 millionrecords. ALM completes within 2 minutes for 1 million records,which reflects the speed of the automatic data preparation.

IBM SoftwareFigure 23: Model building times for the automated models in SPSS Modeler.Model scoringScoring is defined as applying a created model to new data.This process generates new data, which is typically a prediction(score). Multiple fields are typically calculated and appended tothe records. Scoring can occur in batch or in real-time. Batchscoring is done as an event. For example, you can scorecustomers each month whose contract is up for renewal againsta model that calculates whether and how likely they are tocancel. An example of time scoring in real time is calculatingand providing a likelihood of fraud score to an agent recordingan insurance claim as the agent gathers data. Scoring inreal-time is provided in SPSS Modeler Gold and is used byorganizations that are integrating predictive intelligence intooperational systems. IBM’s test recorded the results for batchscoring that used a data set with 10,000 rows and 20 columns.The resulting model was then used in a stream (Figure 24) andfiles of various sizes were then scored.Figure 24: Test stream used for model scoring. In this case, it was a C5.0model.The results showed that, as the number of records beingscored increased, the performance of many models increasedto a point and then remained constant. This increase isrelated to the fact that there is an initial fixed overheadrelated to the scoring process. This overhead is not related tothe number of rows scored, rather a one-off cost. Therefore,the one-off cost becomes less important as the number ofrows to be scored increases.17

18IBM SPSS Modeler Server pe

Text analytics model scoring After concepts are extracted, SPSS Modeler creates a text model that can be used in predictive streams. Scoring against the text model means that new text is categorized with the patterns established during the model building process. IBM’s t