Oracle Data Mining

Transcription

Oracle Data MiningConcepts12c Release 1 (12.1)E17692-19May 2017

Oracle Data Mining Concepts, 12c Release 1 (12.1)E17692-19Copyright 2005, 2017, Oracle and/or its affiliates. All rights reserved.Primary Author: Sarika SurampudiThis software and related documentation are provided under a license agreement containing restrictions onuse and disclosure and are protected by intellectual property laws. Except as expressly permitted in yourlicense agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license,transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverseengineering, disassembly, or decompilation of this software, unless required by law for interoperability, isprohibited.The information contained herein is subject to change without notice and is not warranted to be error-free. Ifyou find any errors, please report them to us in writing.If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it onbehalf of the U.S. Government, then the following notice is applicable:U.S. GOVERNMENT END USERS: Oracle programs, including any operating system, integrated software,any programs installed on the hardware, and/or documentation, delivered to U.S. Government end users are"commercial computer software" pursuant to the applicable Federal Acquisition Regulation and agencyspecific supplemental regulations. As such, use, duplication, disclosure, modification, and adaptation of theprograms, including any operating system, integrated software, any programs installed on the hardware,and/or documentation, shall be subject to license terms and license restrictions applicable to the programs.No other rights are granted to the U.S. Government.This software or hardware is developed for general use in a variety of information management applications.It is not developed or intended for use in any inherently dangerous applications, including applications thatmay create a risk of personal injury. If you use this software or hardware in dangerous applications, then youshall be responsible to take all appropriate fail-safe, backup, redundancy, and other measures to ensure itssafe use. Oracle Corporation and its affiliates disclaim any liability for any damages caused by use of thissoftware or hardware in dangerous applications.Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks oftheir respective owners.Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks areused under license and are trademarks or registered trademarks of SPARC International, Inc. AMD, Opteron,the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced MicroDevices. UNIX is a registered trademark of The Open Group.This software or hardware and documentation may provide access to or information about content, products,and services from third parties. Oracle Corporation and its affiliates are not responsible for and expresslydisclaim all warranties of any kind with respect to third-party content, products, and services unlessotherwise set forth in an applicable agreement between you and Oracle. Oracle Corporation and its affiliateswill not be responsible for any loss, costs, or damages incurred due to your access to or use of third-partycontent, products, or services, except as set forth in an applicable agreement between you and Oracle.

ContentsPreface . xiAudience . xiRelated Documentation . xiOracle Data Mining Resources on the Oracle Technology Network.xiiApplication Development and Database Administration Documentation .xiiDocumentation Accessibility .xiiConventions.xiiChanges in This Release for Oracle Data Mining Concepts Guide . xvChanges in Oracle Data Mining 12c Release 1 (12.1). xvNew Features. xvDesupported Features. xviOther Changes. xviPart I1IntroductionsWhat Is Data Mining?What Is Data Mining? .1-1Automatic Discovery .1-1Prediction.1-2Grouping.1-2Actionable Information .1-2Data Mining and Statistics .1-2Data Mining and OLAP.1-2Data Mining and Data Warehousing.1-3What Can Data Mining Do and Not Do? .1-3Asking the Right Questions .1-3Understanding Your Data.1-4The Data Mining Process .1-4Problem Definition .1-4Data Gathering, Preparation, and Feature Engineering .1-5iii

2Model Building and Evaluation .1-5Knowledge Deployment .1-6Introduction to Oracle Data MiningAbout Oracle Data Mining .2-1Data Mining in the Database Kernel.2-1Data Mining in Oracle Exadata.2-2Interfaces to Oracle Data Mining.2-3PL/SQL API .2-3SQL Functions.2-4Oracle Data Miner .2-4Predictive Analytics .2-5Overview of Database Analytics .2-63 Oracle Data Mining BasicsMining Functions.3-1Supervised Data Mining.3-1Unsupervised Data Mining.3-2Algorithms .3-3Oracle Data Mining Supervised Algorithms.3-4Oracle Data Mining Unsupervised Algorithms.3-4Data Preparation .3-5Oracle Data Mining Simplifies Data Preparation .3-6Case Data .3-6Text Data.3-7In-Database Scoring.3-7Parallel Execution and Ease of Administration .3-7SQL Functions for Model Apply and Dynamic Scoring.3-7Part II45ivMining FunctionsRegressionAbout Regression.4-1How Does Regression Work? .4-2Testing a Regression Model .4-4Regression Statistics .4-4Regression Algorithms .4-5ClassificationAbout Classification .5-1Testing a Classification Model .5-2Confusion Matrix.5-2Lift.5-3

6789Receiver Operating Characteristic (ROC).5-4Biasing a Classification Model .5-5Costs .5-6Priors and Class Weights .5-8Classification Algorithms .5-8Anomaly DetectionAbout Anomaly Detection.6-1One-Class Classification .6-1Anomaly Detection for Single-Class Data .6-2Anomaly Detection for Finding Outliers .6-2Anomaly Detection Algorithm .6-3ClusteringAbout Clustering .7-1How are Clusters Computed? .7-1Scoring New Data.7-2Hierarchical Clustering.7-2Evaluating a Clustering Model.7-2Clustering Algorithms .7-2AssociationAbout Association .8-1Association Rules .8-1Market-Basket Analysis.8-1Association Rules and eCommerce .8-2Transactional Data .8-2Association Algorithm .8-3Feature Selection and ExtractionFinding the Best Attributes.9-1About Feature Selection and Attribute Importance .9-2Attribute Importance and Scoring .9-2About Feature Extraction.9-2Feature Extraction and Scoring .9-3Algorithms for Attribute Importance and Feature Extraction .9-3Part III10AlgorithmsAprioriAbout Apriori. 10-1Association Rules and Frequent Itemsets . 10-2Antecedent and Consequent. 10-2v

Confidence. 10-2Data Preparation for Apriori. 10-2Native Transactional Data and Star Schemas . 10-2Items and Collections. 10-2Sparse Data. 10-3Calculating Association Rules . 10-3Itemsets . 10-3Frequent Itemsets . 10-4Example: Calculating Rules from Frequent Itemsets. 10-4Evaluating Association Rules. 10-6Support. 10-6Confidence. 10-7Lift. 10-711Decision TreeAbout Decision Tree. 11-1Decision Tree Rules. 11-1Advantages of Decision Trees . 11-3XML for Decision Tree Models. 11-3Growing a Decision Tree . 11-3Splitting. 11-4Cost Matrix . 11-4Preventing Over-Fitting. 11-4Tuning the Decision Tree Algorithm . 11-5Data Preparation for Decision Tree . 11-512 Expectation MaximizationAbout Expectation Maximization. 12-1Expectation Step and Maximization Step . 12-1Probability Density Estimation . 12-1Algorithm Enhancements. 12-2Scalability. 12-2High Dimensionality. 12-3Number of Components. 12-3Parameter Initialization . 12-3From Components to Clusters. 12-3Configuring the Algorithm . 12-4Data Preparation for Expectation Maximization . 12-413Generalized Linear ModelsAbout Generalized Linear Models . 13-1GLM in Oracle Data Mining. 13-2Interpretability and Transparency . 13-2vi

Wide Data . 13-2Confidence Bounds . 13-2Ridge Regression . 13-3Scalable Feature Selection. 13-4Feature Selection. 13-4Feature Generation. 13-5Tuning and Diagnostics for GLM. 13-5Build Settings . 13-5Diagnostics . 13-6Data Preparation for GLM. 13-7Data Preparation for Linear Regression. 13-7Data Preparation for Logistic Regression . 13-8Missing Values. 13-8Linear Regression . 13-9Coefficient Statistics for Linear Regression . 13-9Global Model Statistics for Linear Regression . 13-9Row Diagnostics for Linear Regression . 13-10Logistic Regression . 13-10Reference Class . 13-10Class Weights. 13-11Coefficient Statistics for Logistic Regression. 13-11Global Model Statistics for Logistic Regression. 13-11Row Diagnostics for Logistic Regression. 13-1214k-MeansAbout k-Means. 14-1Oracle Data Mining Enhanced k-Means . 14-1Centroid . 14-2Scoring. 14-2k-Means Algorithm Configuration . 14-2Data Preparation for k-Means. 14-315Minimum Description LengthAbout MDL. 15-1Compression and Entropy . 15-1Model Size . 15-2Model Selection. 15-2The MDL Metric . 15-3Data Preparation for MDL. 15-316Naive BayesAbout Naive Bayes . 16-1Advantages of Naive Bayes . 16-2vii

Tuning a Naive Bayes Model . 16-3Data Preparation for Naive Bayes . 16-317Non-Negative Matrix FactorizationAbout NMF. 17-1Matrix Factorization. 17-1Scoring with NMF . 17-2Text Mining with NMF. 17-2NMF for Text Mining . 17-2Tuning the NMF Algorithm . 17-3Data Preparation for NMF. 17-318O-ClusterAbout O-Cluster. 18-1Partitioning Strategy . 18-1Active Sampling . 18-2Process Flow. 18-2Scoring. 18-3Tuning the O-Cluster Algorithm. 18-3Data Preparation for O-Cluster. 18-3User-Specified Data Preparation for O-Cluster . 18-419 Singular Value DecompositionAbout Singular Value Decomposition. 19-1Matrix Manipulation.

Oracle Data Mining Concepts is intended for anyone who wants to learn about Oracle Data Mining. Related Documentation Oracle Data Mining, a component of Oracle Advanced Analytics, is documented on the Data Warehousing and Business Intelligence page of the Oracle Database online documentation library: