Rattle: R For Data Mining - Australian National University

Transcription

Rattle: R for Data MiningExperiences in Government and IndustryGraham WilliamsSenior Director and Principal Data MinerAustralian Taxation OfficeAdjunct Professor, University of Canberra and ANUFellow, Institute of Analytics Professionals of ng.togaware.comCopyright c 2008 Graham.Williams@togaware.com

OverviewSetting the ContextBackgroundAustralian Taxation OfficeTooling up for Data MiningTechnologiesCommodity and Open SourceDelivering OutcomesCopyright c 2008 Graham.Williams@togaware.com

OverviewSetting the ContextBackgroundAustralian Taxation OfficeTooling up for Data MiningTechnologiesCommodity and Open SourceDelivering OutcomesCopyright c 2008 Graham.Williams@togaware.com

Data is FundamentalSherlock Holmes:“It is a capital mistake to theorize before one has data.Insensibly, one begins to twist facts to suit theories,instead of theories to suit facts.”A Scandal in Bohemia (1891)Arthur Conan DoyleData Mining is fundamentally about delivering novel and actionableknowledge from mountains of data.Copyright c 2008 Graham.Williams@togaware.com

Data is FundamentalSherlock Holmes:“It is a capital mistake to theorize before one has data.Insensibly, one begins to twist facts to suit theories,instead of theories to suit facts.”A Scandal in Bohemia (1891)Arthur Conan DoyleData Mining is fundamentally about delivering novel and actionableknowledge from mountains of data.Copyright c 2008 Graham.Williams@togaware.com

An Australian JourneyData Mining Research - CSIRO 1995Data Mining Practise - Health Insurance Commission 1995A Taste of Data Mining:Esanda FinanceNRMAMount StromloHealth Insurance CommissionCommonwealth BankDepartment of HealthAustralian Taxation OfficeAustralian Customs ServiceDepartment of Veteran Affairs.Copyright c 2008 Graham.Williams@togaware.com

An Australian JourneyData Mining Research - CSIRO 1995Data Mining Practise - Health Insurance Commission 1995A Taste of Data Mining:Esanda FinanceNRMAMount StromloHealth Insurance CommissionCommonwealth BankDepartment of HealthAustralian Taxation OfficeAustralian Customs ServiceDepartment of Veteran Affairs.Copyright c 2008 Graham.Williams@togaware.com

An Australian JourneyData Mining Research - CSIRO 1995Data Mining Practise - Health Insurance Commission 1995A Taste of Data Mining:Esanda FinanceNRMAMount StromloHealth Insurance CommissionCommonwealth BankDepartment of HealthAustralian Taxation OfficeAustralian Customs ServiceDepartment of Veteran Affairs.Copyright c 2008 Graham.Williams@togaware.com

An Australian JourneyData Mining Research - CSIRO 1995Data Mining Practise - Health Insurance Commission 1995A Taste of Data Mining:Esanda FinanceNRMAMount StromloHealth Insurance CommissionCommonwealth BankDepartment of HealthAustralian Taxation OfficeAustralian Customs ServiceDepartment of Veteran Affairs.Copyright c 2008 Graham.Williams@togaware.com

Digital FootprintsWe leave behind us, every day, a growing digital footprint.Store Purchase - loyalty cards and credit cardsBuilding AccessComputer LogineToll RecordsMobile PhoneCameras with sophisticated image recognitionWe need due diligence in collection and analysis of data, for thebetterment of society and in the service of society — privacyprotocols.Copyright c 2008 Graham.Williams@togaware.com

Digital FootprintsWe leave behind us, every day, a growing digital footprint.Store Purchase - loyalty cards and credit cardsBuilding AccessComputer LogineToll RecordsMobile PhoneCameras with sophisticated image recognitionWe need due diligence in collection and analysis of data, for thebetterment of society and in the service of society — privacyprotocols.Copyright c 2008 Graham.Williams@togaware.com

Digital FootprintsWe leave behind us, every day, a growing digital footprint.Store Purchase - loyalty cards and credit cardsBuilding AccessComputer LogineToll RecordsMobile PhoneCameras with sophisticated image recognitionWe need due diligence in collection and analysis of data, for thebetterment of society and in the service of society — privacyprotocols.Copyright c 2008 Graham.Williams@togaware.com

Digital FootprintsWe leave behind us, every day, a growing digital footprint.Store Purchase - loyalty cards and credit cardsBuilding AccessComputer LogineToll RecordsMobile PhoneCameras with sophisticated image recognitionWe need due diligence in collection and analysis of data, for thebetterment of society and in the service of society — privacyprotocols.Copyright c 2008 Graham.Williams@togaware.com

Digital FootprintsWe leave behind us, every day, a growing digital footprint.Store Purchase - loyalty cards and credit cardsBuilding AccessComputer LogineToll RecordsMobile PhoneCameras with sophisticated image recognitionWe need due diligence in collection and analysis of data, for thebetterment of society and in the service of society — privacyprotocols.Copyright c 2008 Graham.Williams@togaware.com

Digital FootprintsWe leave behind us, every day, a growing digital footprint.Store Purchase - loyalty cards and credit cardsBuilding AccessComputer LogineToll RecordsMobile PhoneCameras with sophisticated image recognitionWe need due diligence in collection and analysis of data, for thebetterment of society and in the service of society — privacyprotocols.Copyright c 2008 Graham.Williams@togaware.com

Digital FootprintsWe leave behind us, every day, a growing digital footprint.Store Purchase - loyalty cards and credit cardsBuilding AccessComputer LogineToll RecordsMobile PhoneCameras with sophisticated image recognitionWe need due diligence in collection and analysis of data, for thebetterment of society and in the service of society — privacyprotocols.Copyright c 2008 Graham.Williams@togaware.com

Digital FootprintsWe leave behind us, every day, a growing digital footprint.Store Purchase - loyalty cards and credit cardsBuilding AccessComputer LogineToll RecordsMobile PhoneCameras with sophisticated image recognitionWe need due diligence in collection and analysis of data, for thebetterment of society and in the service of society — privacyprotocols.Copyright c 2008 Graham.Williams@togaware.com

Australian Taxation Office - Case StudyEmploys 22,000 staff Australia wideRevenue Collection and Refund ManagementCompliance and Risk Modelling12M Individuals, 450B Income, 100B Tax2M Companies., 1800B Income, 40B TaxPAYG 100B, GST 40B, Excise 20BTax payer’s charter:Fair but firm; Protect privacy; Assume honestService standards — turn around refundsWhilst protecting integrity of revenue collectionCopyright c 2008 Graham.Williams@togaware.com

Australian Taxation Office - Case StudyEmploys 22,000 staff Australia wideRevenue Collection and Refund ManagementCompliance and Risk Modelling12M Individuals, 450B Income, 100B Tax2M Companies., 1800B Income, 40B TaxPAYG 100B, GST 40B, Excise 20BTax payer’s charter:Fair but firm; Protect privacy; Assume honestService standards — turn around refundsWhilst protecting integrity of revenue collectionCopyright c 2008 Graham.Williams@togaware.com

Australian Taxation Office - Case StudyEmploys 22,000 staff Australia wideRevenue Collection and Refund ManagementCompliance and Risk Modelling12M Individuals, 450B Income, 100B Tax2M Companies., 1800B Income, 40B TaxPAYG 100B, GST 40B, Excise 20BTax payer’s charter:Fair but firm; Protect privacy; Assume honestService standards — turn around refundsWhilst protecting integrity of revenue collectionCopyright c 2008 Graham.Williams@togaware.com

ATO Analytics - Deploying Data MiningEstablished as a national capability in 2003Team has been built up to 16 data mining specialistsSupport 120 analysts throughout the organisationSpread new technology throughout the whole organisation through acentral R&D capabilityProvide an over-arching framework for Risk ManagementHow: Analytics Community of Practise and roll out of Training CourseCopyright c 2008 Graham.Williams@togaware.com

ATO Analytics - Deploying Data MiningEstablished as a national capability in 2003Team has been built up to 16 data mining specialistsSupport 120 analysts throughout the organisationSpread new technology throughout the whole organisation through acentral R&D capabilityProvide an over-arching framework for Risk ManagementHow: Analytics Community of Practise and roll out of Training CourseCopyright c 2008 Graham.Williams@togaware.com

OverviewSetting the ContextBackgroundAustralian Taxation OfficeTooling up for Data MiningTechnologiesCommodity and Open SourceDelivering OutcomesCopyright c 2008 Graham.Williams@togaware.com

TechnologiesOriginally tooled up with commercial, expensive,data mining tools (SAS/EM, Teradata WarehouseMiner) and hardware (Big Iron MS/Windows 32bit).But data mining needs skilled people, not off theshelf solutions (yet).Also data mining technology is rapidly developing,and commercial vendors have difficulty keeping up.Copyright c 2008 Graham.Williams@togaware.com

TechnologiesOriginally tooled up with commercial, expensive,data mining tools (SAS/EM, Teradata WarehouseMiner) and hardware (Big Iron MS/Windows 32bit).But data mining needs skilled people, not off theshelf solutions (yet).Also data mining technology is rapidly developing,and commercial vendors have difficulty keeping up.Copyright c 2008 Graham.Williams@togaware.com

TechnologiesOriginally tooled up with commercial, expensive,data mining tools (SAS/EM, Teradata WarehouseMiner) and hardware (Big Iron MS/Windows 32bit).But data mining needs skilled people, not off theshelf solutions (yet).Also data mining technology is rapidly developing,and commercial vendors have difficulty keeping up.Copyright c 2008 Graham.Williams@togaware.com

New Approaches EnsemblesCommercial software is lagging behind advances in Data MiningCurrent best off the shelf technologyincludes random forests, boosting andsupport vector machines - SAS/EM?Open source solutions allowinvestment in people, not software.Copyright c 2008 Graham.Williams@togaware.com

New Approaches EnsemblesCommercial software is lagging behind advances in Data MiningCurrent best off the shelf technologyincludes random forests, boosting andsupport vector machines - SAS/EM?Open source solutions allowinvestment in people, not software.Copyright c 2008 Graham.Williams@togaware.com

New Approaches EnsemblesCommercial software is lagging behind advances in Data MiningCurrent best off the shelf technologyincludes random forests, boosting andsupport vector machines - SAS/EM?Open source solutions allowinvestment in people, not software.Copyright c 2008 Graham.Williams@togaware.com

Hardware Platform - AnalyticsNetBuild a network of DataMining Nodes:1 CPU (2 Cores), AMD64, 16GBRAM, 300GB Disk4 CPU (8 Cores), AMD64, 32GBRAM, 1TB Disk (Optimal)8 CPU (16 Cores), AMD64, 128GBRAM, 10TB Disk (Near Term)Best of class open source operating system (Debian GNU/Linux)Open Source data mining tools R, Rattle, Weka, AlphaMinerOpen Source does deliver quality softwareData Warehouse (Netezza/SQLite) as the workhorse data serverCopyright c 2008 Graham.Williams@togaware.com

Hardware Platform - AnalyticsNetBuild a network of DataMining Nodes:1 CPU (2 Cores), AMD64, 16GBRAM, 300GB Disk4 CPU (8 Cores), AMD64, 32GBRAM, 1TB Disk (Optimal)8 CPU (16 Cores), AMD64, 128GBRAM, 10TB Disk (Near Term)Best of class open source operating system (Debian GNU/Linux)Open Source data mining tools R, Rattle, Weka, AlphaMinerOpen Source does deliver quality softwareData Warehouse (Netezza/SQLite) as the workhorse data serverCopyright c 2008 Graham.Williams@togaware.com

Hardware Platform - AnalyticsNetBuild a network of DataMining Nodes:1 CPU (2 Cores), AMD64, 16GBRAM, 300GB Disk4 CPU (8 Cores), AMD64, 32GBRAM, 1TB Disk (Optimal)8 CPU (16 Cores), AMD64, 128GBRAM, 10TB Disk (Near Term)Best of class open source operating system (Debian GNU/Linux)Open Source data mining tools R, Rattle, Weka, AlphaMinerOpen Source does deliver quality softwareData Warehouse (Netezza/SQLite) as the workhorse data serverCopyright c 2008 Graham.Williams@togaware.com

Hardware Platform - AnalyticsNetBuild a network of DataMining Nodes:1 CPU (2 Cores), AMD64, 16GBRAM, 300GB Disk4 CPU (8 Cores), AMD64, 32GBRAM, 1TB Disk (Optimal)8 CPU (16 Cores), AMD64, 128GBRAM, 10TB Disk (Near Term)Best of class open source operating system (Debian GNU/Linux)Open Source data mining tools R, Rattle, Weka, AlphaMinerOpen Source does deliver quality softwareData Warehouse (Netezza/SQLite) as the workhorse data serverCopyright c 2008 Graham.Williams@togaware.com

Hardware Platform - AnalyticsNetBuild a network of DataMining Nodes:1 CPU (2 Cores), AMD64, 16GBRAM, 300GB Disk4 CPU (8 Cores), AMD64, 32GBRAM, 1TB Disk (Optimal)8 CPU (16 Cores), AMD64, 128GBRAM, 10TB Disk (Near Term)Best of class open source operating system (Debian GNU/Linux)Open Source data mining tools R, Rattle, Weka, AlphaMinerOpen Source does deliver quality softwareData Warehouse (Netezza/SQLite) as the workhorse data serverCopyright c 2008 Graham.Williams@togaware.com

OverviewSetting the ContextBackgroundAustralian Taxation OfficeTooling up for Data MiningTechnologiesCommodity and Open SourceDelivering OutcomesCopyright c 2008 Graham.Williams@togaware.com

RattleInvest in expertise — tools follow.Free software for data mining based on R Weka, AlphaMiner, KNIME, RapidMiner, . . .Exploratory Data Analysis Mining: R is second to noneImportance of effectively communicating results.Copyright c 2008 Graham.Williams@togaware.com

Business Intelligence and Data MiningPress Release 2 Jun 2008 from Information Builders(BI Tool — WebFOCUS)Announced partnership to incorporate open source Rattle(as RStat) into WebFOCUS.Copyright c 2008 Graham.Williams@togaware.com

Analytics in ActionHigh Risk Refunds (HRR) identified prior to issuing of refunds.Current rules identify too many “high risk” re

Rattle: R for Data Mining Experiences in Government and Industry Author: Graham Williams Subject: Data Mining, Linux, Open Source Created Date: 6/26/2008 9:03:14 PM