RapidMiner

Transcription

RapidMinerOperator Reference Manual

2014 by RapidMiner. All rights reserved.No part of this publication may be reproduced, stored in a retrieval system, ortransmitted, in any form or by means electronic, mechanical, photocopying, orotherwise, without prior written permission of RapidMiner.

PrefaceWelcome to the RapidMiner Operator Reference, the final result of a long working process. When we first started to plan this reference, we had an extensivediscussion about the purpose of this book. Would anybody want to read thehole book? Starting with Ada Boost and reading the description of every singleoperator to the X-Validation? Or would it only serve for looking up particularoperators, although that is also possible in the program interface itself?We decided for the latter and with growing progress in the making of this book,we realized how fuitile this entire discussion has been. It was not long until thebook reached the 600 pages limit and now we nearly hit the 1000 pages, what isfar beyond anybody would like to read entirely. Even if there would be a greatstructure, explaining the usage of single groups of operators as guiding transitionsbetween the explanations of single operators, nobody could comprehend all that.The reader would have long forgotten about the Loop Clusters operator until heget’s to know about cross validation. So we didn’t dump any effort in that andhence the book has become a pure reference. For getting to know RapidMineritself, this is not a suitable document. Therefore we would rather recommendto read the manual as a starting point. There are other documents available forparticular scenarios, like using RapidMiner as a researcher or when you want toextend it’s functionality. Please take a look at our website rapidminer.com toget an overview, which documentations are available.From that fact, we can draw some suggestions about how to read this book:Whenever you want to know about a particular operator, just open the indexat the end of this book, and directly jump to the operator. The order of theV

operators in this book is determined by the group structure in the operator tree,as you will immediately see, when taking a look at the contents. As operators forsimilar tasks are grouped together in RapidMiner, these operators are also nearto each other in this book. So if you are interested in broading your perspectiveof RapidMiner beyond an already known operator, you can continue reading afew pages before and after the operator you picked from the index.Once you read the description of an operator, you can jump to the tutorial process, that will explain a possible use case. Often the functionality of an operatorcan be understood easier with a context of a complete process. All these processes are also available in RapidMiner. You simply need to open the descriptionof this operator in the help view and scroll down. After pressing on the respectivelink, the process will be opened and you can inspect the details, execute it andanalyse the results from break points. Apart from that, the explanation of theparameters will give you a good insight of what the operator is capable of andwhat it can be configured for.I think there’s nothing left to say except wishing you a lot of illustrative encounters with the various operators. And if you really read it from start to end, pleasetell us, as we have bets running on that. Of course we will verify that by checkingif you found all the easter eggs. . .Sebastian LandVI

Contents1 Process ControlRemember . . . . .Recall . . . . . . . .Multiply . . . . . . .Join Paths . . . . .Handle Exception .Throw Exception . .1.1 Parameter . . . . . . . . . .Set Parameters . . .Optimize ParametersOptimize Parameters1.2 Loop . . . . . . . . . . . . .Loop . . . . . . . . .Loop Attributes . .Loop Values . . . . .Loop Examples . . .Loop Clusters . . . .Loop Data Sets . . .Loop and Average .Loop Parameters . .Loop Files . . . . . .X-Prediction . . . .1.3 Branch . . . . . . . . . . . .Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .(Grid) . . . . .(Evolutionary). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11471012151717212632323642464951545762657171VII

Contents1.4Select SubprocessCollections . . . . . . .Collect . . . . . .Select . . . . . .Loop Collection .7579798285Subprocess . . . . . . . . . .Macros . . . . . . . . . . . . . . . . .Set Macro . . . . . . . . . . .Set Macros . . . . . . . . . .Generate Macro . . . . . . .Extract Macro . . . . . . . .Logging . . . . . . . . . . . . . . . .Log . . . . . . . . . . . . . .Provide Macro as Log Value .Log to Data . . . . . . . . . .Execution . . . . . . . . . . . . . . .Execute Process . . . . . . .Execute Script . . . . . . . .Execute SQL . . . . . . . . .Execute Program . . . . . . .Files . . . . . . . . . . . . . . . . . .Write as Text . . . . . . . . .Copy File . . . . . . . . . . .Rename File . . . . . . . . .Delete File . . . . . . . . . .Move File . . . . . . . . . . .Create Directory . . . . . . .Data Generation . . . . . . . . . . .Generate Data . . . . . . . .Generate Nominal Data . . .Generate Direct Mailing DataGenerate Sales Data . . . . 1511531561581611631631651671692 Utility2.12.22.32.42.5VIII

Contents2.6Add Noise . . . .Miscellaneous . . . . . .Materialize DataFree Memory . .1711771771793 Repository Access183Retrieve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1854 Import4.1 Data . . . . . . . . . . .Read csv . . . .Read Excel . . .Read SAS . . . .Read Access . . .Read AML . . .Read ARFF . . .Read Database .Stream DatabaseRead SPSS . . .4.2 Models . . . . . . . . . .Read Model . . .4.3 Attributes . . . . . . . .Read Weights . 2272292292322362402442482485 Export5.15.2Write . . . . . . . . .Data . . . . . . . . . . . . . .Write AML . . . . . .Write Arff . . . . . . .Write Database . . . .Update Database . . .Write Special FormatModels . . . . . . . . . . . . .Write Model . . . . .IX

Contents5.35.45.5Write Clustering . .Attributes . . . . . . . . . .Write Weights . . .Write ConstructionsResults . . . . . . . . . . . .Write Performance .Other . . . . . . . . . . . .Write Parameters . .Write Threshold . .6 Data Transformation6.1 Name and Role Modification . .Rename . . . . . . . . . .Rename by Replacing . .Set Role . . . . . . . . . .Exchange Roles . . . . . .6.2 Type Conversion . . . . . . . . .Numerical to Binominal .Numerical to PolynominalNumerical to Real . . . .Real to Integer . . . . . .Nominal to Binominal . .Nominal to Text . . . . .Nominal to Numerical . .Nominal to Date . . . . .Text to Nominal . . . . .Date to Numerical . . . .Date to Nominal . . . . .Parse Numbers . . . . . .Format Numbers . . . . .Guess Types . . . . . . .6.2.1 Discretization . . . . . . .Discretize by Size . . . . .Discretize by Binning . 368

ContentsDiscretize by Frequency . . . . . . . . . . . . . . . . . . . . 375Discretize by User Specification . . . . . . . . . . . . . . . . 382Discretize by Entropy . . . . . . . . . . . . . . . . . . . . . 3896.3Attribute Set Reduction and Transformation . . . . . . . . . . . . 3956.3.1Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 395Generate ID . . . . . . . . . . . . . . . . . . . . . . . . . . . 396Generate Empty Attribute. . . . . . . . . . . . . . . . . . 397Generate Copy . . . . . . . . . . . . . . . . . . . . . . . . . 400Generate Attributes . . . . . . . . . . . . . . . . . . . . . . 402Generate Concatenation . . . . . . . . . . . . . . . . . . . . 417Generate Aggregation . . . . . . . . . . . . . . . . . . . . . 419Generate Function Set . . . . . . . . . . . . . . . . . . . . . 425Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 429Optimize by Generation (YAGGA) . . . . . . . . . . . . . . 4296.3.2Transformation . . . . . . . . . . . . . . . . . . . . . . . . . 434Principal Component Analysis . . . . . . . . . . . . . . . . 434Principal Component Analysis (Kernel) . . . . . . . . . . . 437Independent Component Analysis . . . . . . . . . . . . . . 442Generalized Hebbian Algorithm . . . . . . . . . . . . . . . . 445Singular Value Decomposition . . . . . . . . . . . . . . . . . 448Self-Organizing Map . . . . . . . . . . . . . . . . . . . . . . 4526.3.3Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . 455Select Attributes . . . . . . . . . . . . . . . . . . . . . . . . 455Select by Weights . . . . . . . . . . . . . . . . . . . . . . . . 461Remove Attribute Range . . . . . . . . . . . . . . . . . . . 465Remove Useless Attributes . . . . . . . . . . . . . . . . . . 467Remove Correlated Attributes. . . . . . . . . . . . . . . . 472Work on Subset . . . . . . . . . . . . . . . . . . . . . . . . . 476Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 483Forward Selection . . . . . . . . . . . . . . . . . . . . . . . 483Backward Elimination . . . . . . . . . . . . . . . . . . . . . 487Optimize Selection . . . . . . . . . . . . . . . . . . . . . . . 492Optimize Selection (Evolutionary) . . . . . . . . . . . . . . 498XI

Contents6.46.56.66.76.8XIIValue Modification . . . . . . . . . .Set Data . . . . . . . . . . . .Declare Missing Value . . . .6.4.1 Numerical Value ModificationNormalize . . . . . . . . . . .Scale by Weights . . . . . . .6.4.2 Nominal Value Modification .Map . . . . . . . . . . . . . .Replace . . . . . . . . . . . .Cut . . . . . . . . . . . . . .Split . . . . . . . . . . . . . .Merge . . . . . . . . . . . . .Remap Binominals . . . . . .Data Cleansing . . . . . . . . . . . .Replace Missing Values . . .Fill Data Gaps . . . . . . . .6.5.1 Outlier Detection . . . . . . .Detect Outlier (Distances) . .Detect Outlier (Densities) . .Detect Outlier (LOF) . . . .Detect Outlier (COF) . . . .Filtering . . . . . . . . . . . . . . . .Filter Examples . . . . . . . .Remove Duplicates . . . . . .Filter Example Range . . . .6.6.1 Sampling . . . . . . . . . . .Sample . . . . . . . . . . . .Sample (Stratified) . . . . . .Sample (Bootstrapping) . . .Split Data . . . . . . . . . . .Sorting . . . . . . . . . . . . . . . . .Sort . . . . . . . . . . . . . .Rotation . . . . . . . . . . . . . . . .Pivot . . . . . . . . . . . . 604

ContentsDe-Pivot. . . . . . . . . . . . . . . . . . . . . . . . . . . . 608Transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6126.9Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615Aggregate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6156.10 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622Append . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624Set Minus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633Superset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6377 Modeling7.1641Classification and Regression . . . . . . . . . . . . . . . . . . . . . 6417.1.1Lazy Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 641Default Model . . . . . . . . . . . . . . . . . . . . . . . . . 641K-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6437.1.2Bayesian Modeling . . . . . . . . . . . . . . . . . . . . . . . 651Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . 651Naive Bayes (Kernel) . . . . . . . . . . . . . . . . . . . . . . 6557.1.3Tree Induction . . . . . . . . . . . . . . . . . . . . . . . . . 659Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 659ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665CHAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669Decision Stump . . . . . . . . . . . . . . . . . . . . . . . . . 673Random Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 675Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . 6807.1.4Rule Induction . . . . . . . . . . . . . . . . . . . . . . . . . 684Rule Induction . . . . . . . . . . . . . . . . . . . . . . . . . 684Subgroup Discovery . . . . . . . . . . . . . . . . . . . . . . 687Tree to Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 6927.1.5Neural Net Training . . . . . . . . . . . . . . . . . . . . . . 694Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . 694Neural Net . . . . . . . . . . . . . . . . . . . . . . . . . . . 696XIII

Contents7.1.67.2XIVFunction Fitting . . . . . . . . . . . . .Linear Regression . . . . . . . . . . . . .Polynomial Regression . . . . . . . . . .7.1.7 Logistic Regression . . . . . . . . . . . .Logistic Regression . . . . . . . . . . . .7.1.8 Support Vector Modeling . . . . . . . .Support Vector Machine . . . . . . . . .Support Vector Machine (LibSVM) . . .Support Vector Machine (Evolutionary)Support Vector Machine (PSO) . . . . .7.1.9 Discriminant Analysis . . . . . . . . . .Linear Discriminant Analysis . . . . . .7.1.10 Meta Modeling . . . . . . . . . . . . . .Vote . . . . . . . . . . . . . . . . . . . .Polynomial by Binomial Classification .Classification by Regression . . . . . . .Bayesian Boosting . . . . . . . . . . . .AdaBoost . . . . . . . . . . . . . . . . .Bagging . . . . . . . . . . . . . . . . . .Stacking . . . . . . . . . . . . . . . . . .MetaCost . . . . . . . . . . . . . . . . .Attribute Weighting . . . . . . . . . . . . . . .Weight by Information Gain . . . . . . .Weight by Information Gain Ratio . . .Weight by Correlation . . . . . . . . . .Weight by Chi Squared Statistic . . . .Weight by Relief . . . . . . . . . . . . .Weight by SVM . . . . . . . . . . . . .Weight by PCA . . . . . . . . . . . . . .Weight by Component Model . . . . . .Data to Weights . . . . . . . . . . . . .Weights to Data . . . . . . . . . . . . .7.2.1 Optimization . . . . . . . . . . . . . . .Optimize Weights (Evolutionary) . . . 795

Contents7.37.47.57.67.7Clustering and Segmentation . . . . . . . . .K-Means . . . . . . . . . . . . . . . .K-Means (Kernel) . . . . . . . . . . .K-Medoids . . . . . . . . . . . . . . .DBSCAN . . . . . . . . . . . . . . . .Expectation Maximization ClusteringSupport Vector Clustering . . . . . . .Agglomerative Clustering . . . . . . .Top Down Clustering . . . . . . . . .Extract Cluster Prototypes . . . . . .Association and Item Set Mining . . . . . . .FP-Growth . . . . . . . . . . . . . . .Create Association Rules . . . . . . .Generalized Sequential Patterns . . .Correlation and Dependency Computation . .Correlation Matrix . . . . . . . . . . .Similarity Computation . . . . . . . . . . . .Data to Similarity . . . . . . . . . . .Cross Distances . . . . . . . . . . . . .Model Application . . . . . . . . . . . . . . .Apply Model . . . . . . . . . . . . . .Group Models . . . . . . . . . . . . .7.7.1 Thresholds . . . . . . . . . . . . . . .Find Threshold . . . . . . . . . . . . .Create Threshold . . . . . . . . . . . .Apply Threshold . . . . . . . . . . . .7.7.2 Confidences . . . . . . . . . . . . . . .Drop Uncertain Predictions . . . . . .8 Evaluation8.1 Validation . . . . . . . . . . . . .Split Validation . . . . . .X-Validation . . . . . . .Bootstrapping 91891896903XV

Contents8.28.38.4XVIPerformance Measurement . . . . . . . . . . . .Performance . . . . . . . . . . . . . . .Extract Performance . . . . . . . . . . .Combine Performances . . . . . . . . . .8.2.1 Classification and Regression . . . . . .Performance (Classification) . . . . . . .Performance (Binominal Classification)Performance (Regression) . . . . . . . .Performance (Costs) . . . . . . . . . . .8.2.2 Clustering . . . . . . . . . . . . . . . . .Cluster Distance Performance . . . . . .Cluster Density Performance . . . . . .Item Distribution Performance . . . . .Significance . . . . . . . . . . . . . . . . . . . .T-Test . . . . . . . . . . . . . . . . . . .ANOVA . . . . . . . . . . . . . . . . . .Visual Evaluation . . . . . . . . . . . . . . . . .Create Lift Chart . . . . . . . . . . . . .Compare ROCs . . . . . . . . . . . . . 58958962

1Process ControlRememberThis operator stores the given object in the object store of the process.The stored object can be retrieved from the store by using the Recalloperator.DescriptionThe Remember operator can be used to store the input object into the objectstore of the process under the specified name. The name of the object is specifiedthrough the name parameter. The io object parameter specifies the class of theobject. The stored object can later be restored by the Recall operator by using thesame name and class (i.e. the name and class that was used to store it using theRemember operator). There is no scoping mechanism in RapidMiner processestherefore objects can be stored (using Remember operator) and retrieved (usingRecall operator) at any nesting level. But care should be taken that the executionorder of operators is such that the Remember operator for an object alwaysexecutes before the Recall operator for that object. The combination of thesetwo operators can be used to build complex processes where an input object isused in completely different parts or loops of the processes.1

1. Process ControlDifferentiationRecall The Remember operator is always used in combination with the Recalloperator. The Remember operators stores the required object into the objectstore and the Recall operator retrieves the stored object when required. Seepage 4 for details.Input Portsstore (sto) Any object can be provided here. This object will be stored in theobject store of the process. It should be made sure that the class of this objectis selected in the io object parameter.Output Portsstored (sto) The object that was given as input is passed without changing tothe output through this port. It is not compulsory to attach this port to anyother port, the object will be stored even if this port is left without connections.Parametersname (string) The name under which the input object is stored is specifiedthrough this parameter. The same name will be used for retrieving this objectthrough the Recall operator.io object (selection) The class of the input object is selected through this parameter.2

Related DocumentsRecall (4)Tutorial ProcessesIntroduction to Remember and Recall operatorsThis process uses the combination of the Remember and Recall operators todisplay the testing data set of the Split Validation operator. The testing data setis present in the testing subprocess of the Split Validation operator but it is notavailable outside the Split Validation operator.The 'Golf' data set is loaded using the Retrieve operator. The Split Validationoperator is applied on it. The test set size parameter is set to 5 and the trainingset size parameter is set to -1. Thus the test set in the testing subprocess willbe composed of 5 examples. The Default Model operator is used in the trainingsubprocess to train a model. The testing data set is available at the tes portof the testing subprocess. The Remember operator is used to store the testingdata set into the object store of the process. The name and io object parametersare set to 'Testset' and 'ExampleSet' respectively. The Apply Model and Performance operator are applied in the testing subprocess later. In the main process,the Recall operator is used to retrieve the testing data set. The name and ioobject parameters of the Recall operator are set to 'Testset' and 'ExampleSet'respectively to retrieve the object that was stored by the Remember operator.The output of the Recall operator is connected to the result port of the process.Therefore the testing data set can be seen in the Results Workspace.3

1. Process ControlRecallThis operator retrieves the specified object from the object store ofthe process. The objects can be stored in the object store by usingthe Remember operator.DescriptionThe Recall operator can be used for retrieving the specified object from theobject store of the process. The name of the object is specified through thename parameter. The io object parameter specifies the class of the requiredobject. The Recall operator is always used in combination with the operatorslike the Remember operator. For Recall operator to retrieve an object, first it isnecessary that the object should be stored in the object store by using operatorslike the Remember operator. The name and class of the object are specified whenthe object is stored using the Remember operator. The same name (in nameparameter) and class (in io object parameter) should be specified in the Recalloperator to retrieve that object. The same stored object can be retrieved multiplenumber of times if the remove from store parameter of the Recall operator is notset to true. There is no scoping mechanism in RapidMiner processes thereforeobjects can be stored (using Remember operator) and retrieved (using Recalloperator) at any nesting level. But care should be taken that the execution4

order of operators is such that the Remember operator for an object alwaysexecutes before the Recall operator for that object. The combination of thesetwo operators can be used to build complex processes where an input object isused in completely different parts or loops of the processes.DifferentiationRemember The Recall operator is always used in combination with the Remember operator. The Remember operators stores the required object into theobject store and the Recall operator retrieves the stored object when required.See page 1 for details.Output Portsresult (res) The specified object is retrieved from the object store of the processand is delivered through this output port.Parametersname (string) The name of the required object is specified through this parameter. This name should be the same name that was used while storing the objectin an earlier part of the process.io object (selection) The class of the required object is selected through thisparameter. This class should be the same class that was used while storing theobject in an earlier part of the process.remove from store (boolean) If this parameter is set to true, the specified objectis removed from the object store after it has been retrieved. In such a case the object can be retrieved just once. If this parameter is set to false, the object remainsin the object store even after retrieval. Thus the object can be retrieved multiplenumber of times (by using the Recall operator multiple number of times).5

1. Process ControlRelated DocumentsRemember (1)Tutorial ProcessesIntroduction to Remember and Recall operatorsThis process uses the combination of the Remember and Recall operators todisplay the testing data set of the Split Validation operator. The testing data setis present in the testing subprocess of the Split Validation operator but it is notavailable outside the Split Validation operator.The 'Golf'' data set is loaded using the Retrieve operator. The Split Validationoperator is applied on it. The test set size parameter is set to 5 and the trainingset size parameter is set to -1. Thus the test set in the testing subprocess willbe composed of 5 examples. The Default Model operator is used in the trainingsubprocess to train a model. The testing data set is available at the tes port ofthe testing subprocess. The Remember operator is used to store the testing dataset into the object store of the process. The Apply Model and Performance operator are applied in the testing subprocess later. In the main process, the Recalloperator is used to retrieve the testing data set. The name and io object parameters of the Recall operator are set to 'Testset' and 'ExampleSet' respectively toretrieve the object that was stored by the Remember operator. The output ofthe Recall operator is connect to the result port of the process. Therefore thetesting data set can be seen in the Results Workspace.6

MultiplyThis operator copies its input object to all connected output ports. Itdoes not modify the input object.DescriptionThe Multiply operator copies the objects at its input port to the output portsmultiple number of times. As more ports are connected, more copies are generated. The input object is copied by reference; hence the underlying data of theExampleSet is never copied (unless the Materialize Data operator is used). Ascopy-by-reference is usually lighter than copy-by-value, copying objects is cheapthrough this operator. When copying ExampleSets only the references to attributes are copied. It is very important to note here that when attributes arechanged or added in one copy of the ExampleSet, this change has no effect onother copies. However, if data is modified in one copy, it is also modified in theother copies generated by the Multiply operator.7

1. Process ControlInput Portsinput (inp) It can take various kinds of objects as input e.g. an ExampleSet oreven a model.Output Portsoutput (out) There can be many output ports. As one output port is connected, another output port is created for further connections. All ports deliverunchanged copies of the input object.Tutorial ProcessesMultiplying data setsIn this Example Process the Retrieve operator is used to load the Labor-Negotiationsdata set. A breakpoint is inserted after this operator so that the data can beviewed before applying the Multiply operator. You can see that this data sethas many missing values. Press the green-colored Run button to continue theprocess.4 copies of the data set are generated using the Multiply operator. The ReplaceMissing Values

Preface Welcome to the RapidMiner Operator Reference, the nal result of a long work-ing process. When we rst started to plan this reference, wehad an extensive