CS 478 - Tools For Machine Learning And Data Mining

Transcription

CS 478 - Tools for Machine Learning and DataMiningIntroduction to MetalearningNovember 11, 2013Introduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

The Shoemaker’s Children SyndromeIEveryone is using Machine Learning!IIEveryone, that is .Except ML researchers!IApplied machine learning is guided mostly by hunches,anecdotal evidence, and individual experienceIIf that is sub-optimal for our “customers,” is it not alsosub-optimal for us?IShouldn’t we look to the data our applications generate togain better insight into how to do machine learning?IIf we are not quack doctors, but truly believe in our medicine,then the answer should be a resounding YES!Introduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

A Working Definition of MetalearningIWe call metadata the type of data that may be viewed asbeing generated through the application of machine learningIWe call metalearning the use of machine learning techniquesto build models from metadataIHence, metalearning is concerned with accumulatingexperience on the performance of multiple applications of alearning systemIHere, we will be particularly interested in the importantproblem of metalearning for algorithm selectionIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Theoretical and Practical ConsiderationsINo Free Lunch (NFL) theorem / Law of Conservation forGeneralization Performance (LCG)ILarge number of learning algorithms, with comparatively littleinsight gained in their individual applicabilityIUsers are faced with a plethora of algorithms, and withoutsome kind of assistance, algorithm selection can turn into aserious road-block for those who wish to access thetechnology more directly and cost-effectivelyIEnd-users often lack not only the expertise necessary to selecta suitable algorithm, but also the availability of manyalgorithms to proceed on a trial-and-error basisIAnd even then, trying all possible options is impractical, andchoosing the option that “appears” most promising is likely toyield a sub-optimal solutionIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

DM PackagesICommercial DM packages consist of collections of algorithmswrapped in a user-friendly graphical interfaceIFacilitate access to algorithms, but generally offer no realdecision support to non-expert end-usersINeed an informed search process to reduce the amount ofexperimentation while avoiding the pitfalls of local optimaIInformed search requires metaknowledgeIMetalearning offers a robust mechanism to buildmetaknowledge about algorithm selection in classificationIIn a very practical way, metalearning contributes to thesuccessful use of Data Mining tools outside the researcharena, in industry, commerce, and governmentIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Rice’s FrameworkI A problem x in problem space P is mapped via some feature extraction processto f (x) in some feature space F , and the selection algorithm S maps f (x) tosome algorithm a in algorithm space A, so that some selected performancemeasure (e.g., accuracy), p, of a on x is optimalIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Framework IssuesIThe following issues have to be addressed1. The choice of f ,2. The choice of S, and3. The choice of p.IIA is a set of base-level learning algorithms and S is itself alsoa learning algorithmMaking S a learning algorithm, i.e., using metalearning, hasfurther important practical implications about:1. The construction of the training metadata set, i.e., problems inP that feed into F through the characterization function f ,2. The content of A,3. The computational cost of f and S, and4. The form of the output of SIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Choosing Base-level LearnersINo learner is universalIEach learner has its own area of expertise, i.e., the set oflearning tasks on which it performs wellSelect base learners with complementary areas of expertiseIIIThe more varied the biases, the greater the coverageSeek the smallest set of learners that is most likely to ensure areasonable coverageIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Nature of Training MetadataIChallenge:IIITraining data at metalevel data about base-level learningproblems or tasksNumber of accessible, documented, real-world classificationtasks is smallTwo alternatives:IIAugmenting training set through systematic generation ofsynthetic base-level tasksView the algorithm selection task as inherently incrementaland treat it as suchIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Meta-examplesIMeta-examples are of the form f (x), t(x) , where t(x)represents some target value for xIBy definition, t(x) is predicated upon p, and the choice of theform of the output of SIFocusing on the case of selection of 1 of n:t(x) argmaxa A p(a, x)IMetalearning takes { f (x), t((x) : x P 0 P} as atraining set and induces a metamodel that, for each newproblem, predicts the algorithm from A that will perform bestIConstructing meta-examples is computationally intensiveIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Choosing fIAs in any learning task, the characterization of the examplesplays a crucial role in enabling learningIFeatures must have some predictive powerFour main classes of characterization:IIIIStatistical and ction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Statistical and Information-theoretic CharacterizationIExtract a number of statistical and information-theoreticmeasures from the labeled base-level training setITypical measures include number of features, number ofclasses, ratio of examples to features, degree of correlationbetween features and target, class-conditional entropy,skewness, kurtosis, and signal to noise ratioIAssumption: learning algorithms are sensitive to theunderlying structure of the data on which they operate, sothat one may hope that it may be possible to map structuresto algorithmsIEmpirical results do seem to confirm this intuitionIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Model-based CharacterizationIIExploit properties of a hypothesis induced on problem x as anindirect form of characterization of xAdvantages:1. Dataset is summarized into a data structure that can embedthe complexity and performance of the induced hypothesis,and thus is not limited to the example distribution2. Resulting representation can serve as a basis to explain thereasons behind the performance of the learning algorithmITo date, only decision trees have been considered, where f (x)consists of either the tree itself, if the metalearning algorithmcan manipulate it directly, or properties extracted from thetree, such as nodes per feature, maximum tree depth, shape,and tree imbalanceIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Landmarking (I)IIEach learner has an area of expertise, i.e., a class of tasks onwhich it performs particularly well, under a reasonablemeasure of performanceBasic idea of the landmarking approach:IIPerformance of a learner on a task uncovers information aboutthe nature of the taskA task can be described by the collection of areas of expertiseto which it belongsIA landmark learner, or simply a landmarker, a learningmechanism whose performance is used to describe a taskILandmarking is the use of these learners to locate the task inthe expertise space, the space of all areas of expertiseIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Landmarking (II)IIThe prima facie advantage of landmarking resides in itssimplicity: learners are used to signpost learnersNeed efficient landmarkersIIUse naive learning algorithms (e.g., OneR, Naive Bayes) or“scaled-down” versions of more complex algorithms (e.g.,DecisionStump)Results with landmarking have been promisingIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Computational CostINecessary price to pay to be able to perform algorithmselection learning at the metalevelITo be justifiable, the cost of computing f (x) should besignificantly lower than the cost of computing t(x)IThe larger the set A and the more computationally intensivethe algorithms in A, the more likely it is that the abovecondition holdsIIn all implementations of the aforementioned characterizationapproaches, that condition has been satisfiedICost of induction vs. cost of prediction (batch vs.incremental)Introduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Selecting on AccuracyIIPredictive accuracy has become the de facto criterion, orperformance measureBias largely justified by:IIINFL theorem: good performance on a given set of problemscannot be taken as guarantee of good performance onapplications outside of that setImpossibility of forecasting: cannot know how accurate ahypothesis will be until that hypothesis has been induced bythe selected learning model and tested on unseen dataQuantifiability: not subjective, induces a total order on the setof all hypotheses, and straightforward, throughexperimentation, to find which of a number of availablemodels produces the most accurate hypothesisIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Selecting on Other CriteriaIOther performance al complexityComprehensibilityEtc.IThese could be handled in isolation or in combination to buildmulti-criteria performance measuresITo the best of our knowledge, only computational complexity,as measured by training time, has been considered in tandemwith predictive accuracyIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Selection vs. RankingIStandard: single algorithm selected among n algorithmsIIFor every new problem, metamodel returns one learningalgorithm that it predicts will perform best on that problemAlternative: ranking of n algorithmIFor every new problem, metamodel returns set Ar A ofalgorithms ranked by decreasing performanceIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Advantages of RankingIIRanking reduces brittlenessAssume that the algorithm predicted best for some newclassification problem results in what appears to be a poorperformanceIIIIn the single-model prediction approach, the user has nofurther information as to what other model to tryIn the ranking approach, the user may try the second best,third best, and so on, in an attempt to improve performanceEmpirical evidence suggests that the best algorithm isgenerally within the top three in the rankingsIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

Metalearning-inspired SystemsIIAlthough a valid intellectual challenge in its own right,metalearning finds its real raison d’être in the practicalsupport it offers Data Mining practitionersSome promising implementations:IIIIMininMartData Mining AdvisorMETALAIntelligent Discovery AssistantIMostly prototypes, work in progress: characterization,multi-criteria performance measures, incremental systemsIExperimentDBIntroduction to MetalearningCS 478 - Tools for Machine Learning and Data Mining

I We call metalearning the use of machine learning techniques to build models from metadata I Hence, metalearning is concerned with accumulating experience on the performance of multiple applications of a learning system I Here, we will be particularly interested in the important problem of metalearning for algorithm selection Introduction to Metalearning CS 478 - Tools for Machine Learning .