Session S1J Teaching Data Mining By Coalescing Theory And Applications

Transcription

Session S1JTeaching Data Mining By Coalescing Theory andApplicationsNitesh V. ChawlaDepartment of Computer Science and Engineering,University of Notre Dame, IN 46556nchawla@cse.nd.eduAbstract - We report on our experience for the first timedepartmental offering of the data mining course in Spring2005. The course was cross-listed such that both the upperlevel undergraduates and graduate students could attend.However, the majority of the registered students wereundergraduates. Data mining, being a confluence ofmultiple fields, offers an interesting addition to thecomputer science curriculum. The main objective of thecourse was to provide grounding on both the theoreticaland practical aspects of data mining and machine learning.In addition, the course used concepts learned in variouscourses throughout the undergraduate degree. The courseutilized a machine learning toolkit, Weka, by theUniversity of Waikato, New Zealand. In this paper, wepresent the various components of the course, structure,innovative assignments and discussions, and the projectlife cycle.MachinelearningParallel anddistributedtiDatabasesData mining, being a confluence of multiple fields, offers aninteresting addition to the computer science curriculum. It isdefined as the “non-trivial process of identifying valid, novel,potentially useful, and ultimately understandable patterns indata” [1]. Data mining draws upon various concepts andcourses learned during an undergraduate curriculum, includingbut not limited to statistics, probability theory, linear algebra,databases, algorithms, and data structures, as shown in Figure1. This course on data mining was offered for the first time inSpring 2005 at the Department of Computer Science andEngineering at the University of Notre Dame [2]. The mainobjectives of the course were to: Motivate and provide an introduction to the keyprinciples and techniques of data mining; Familiarize the students with the complete life cycleof the knowledge discovery process (as typically usedin the industry); Focus on and encourage analytical and inductivethinking [4, 5]; Present and discuss some of the key issues inapplying data mining to the real-world gnitionProbabilityAlgorithms Index Terms – Data Mining, Innovative Curriculum,Undergraduate CurriculumINTRODUCTIONStatisticsApply some of the principles and/or techniqueslearned to different data sets and domains; andFIGURE 1: CONFLUENCE OF DATA MINING Discuss the social aspects of data mining such as theTotal Information Awareness [3] project.The course structure coalesced the theoretical and appliedaspects of the field, borrowing motivation from the “realworld” usage of data mining and delving into aspects of thedifferent courses the students have taken during theirundergraduate studies. We report on our experience with thiscourse offering, various components of the course, and thestudents’ feedbackTeaching data mining hasn’t received a significantamount of attention in the FIE conference series. We arefamiliar with one paper by Banks et al. [6] at FIE. Theyprovide their experience in offering data mining at twodifferent Universities --- Arizona State and Wayne StateUniversity. Their project on teaching data mining inundergraduate engineering was sponsored by the NationalScience Foundation. We believe and agree with them that datamining is an exciting addition to the curriculum at thesenior/graduate level as it offers not only a revision of some ofthe old concepts, but also offers a stream of computer scienceeducation well applied in various domains and applications.0-7803-9077-6/05/ 20.00 2005 IEEEOctober 19 – 22, 2005, Indianapolis, IN35th ASEE/IEEE Frontiers in Education ConferenceS1J-17

Session S1JCOURSE STRUCTUREThe course followed the key components of the knowledgediscovery process. Figure 2 shows the complete knowledgediscovery process as per the CRISP data mining standard [7].Given the applications of the material in the course, webelieve it is important to pace the course along a processcycle. This encouraged the students to think along and learnthe material as per the different stages in a data mining orknowledge discovery process or project. Establishing this inthe beginning of the course also helped the students to plan fortheir class projects. Figure 3 shows the break-up of the classproject.7. Ensemble techniques in classification8. Real-world applications and challenges; emerging andrelated areas (a discussion)9. Discussion of role of data mining in rablePresentationFIGURE 3: PROJECT BREAK-UP AND CORRESPONDING GRADE WEIGHTAGEFIGURE 2: CRISP-DM PROCESS CYCLEThe following are the key topics discussed in thecourse:1. Introduction to machine learning and data mininga. Introduction to data mining (why, how andthe inter-disciplinary fertilization).b. Brief introduction to machine learning2. Understanding the data (instances, features, class)3. Data preparation and preprocessing (missing values,feature selection, noise, etc.)4. Classification and regression methodsa. Nearest neighborb. Decision treesc. Rule learnersd. Bayesian learnerse. Neural networksf. Logistic regressiong. Linear regression and regression trees5. Unsupervised and semi-supervised learning6. Evaluationa. Concept of validation and testing datai. Bootstrapping, 10-fold CV, leaveone-outb. Performance metrics (error, ROC, lift, lossfunctions)c. Comparing and benchmarking techniquesAs evident by the topics covered, the course indeedinvoked various fields and is a compelling addition to theupper-level (junior or seniors) undergraduate electives. Theclass instruction was done primarily with the use ofPowerPoint slides. This allowed integration of multimedia inthe course discourse. We used the Data Mining textbook byWitten and Frank [8] as the required textbook for the course.A primary reason for adopting that textbook was theavailability of the freely-available and comprehensive toolcalled Weka [8]. Weka is becoming a popular tool of choicefor research and academic discourse. Weka allowed thestudents to experiment with and analyze different techniquesas discussed in the class. The students were favorable aboutthe use of this toolkit. We provide the students’ feedback inthe subsequent Sections. Moreover, the class projects gave thestudents freedom to program in their language of choice oradd more components to the Weka source code to suit theirpurpose.We developed the following course components to teachthe various topics: Class lectures: The class was primarily conductedusing PowerPoint slides, which were also providedonline as lecture notes. Given the inter-disciplinaryfoci of the field, no single textbook was sufficient. Sowe prepared all the lecture slides/notes with relevantreferences and class-handouts. Quizzes and Exams: There were four quizzes andone midterm, which evaluated the students on all thematerial taught in-class. The quizzes and midtermevaluated the theoretical understanding of variousconcepts. In-class Assignments: Given the confluence of datamining as shown in Figure 1, some of the0-7803-9077-6/05/ 20.00 2005 IEEEOctober 19 – 22, 2005, Indianapolis, IN35th ASEE/IEEE Frontiers in Education ConferenceS1J-18

Session S1J mathematical and statistical concepts were notimmediately clear to the students. So, we developedvarious in-class assignments, wherein the studentsworked together to apply an equation ormathematical formulation to a “toy” problem. Theconcept was explained at length first, and then thestudents were given a sheet to work out the toyproblem. If a student succeeded, he or she was askedto do it on the blackboard. We then provided thesolution to the students, and explained how theconcept worked when applied. This really helped thestudents understand, and develop problem solvingand analytical skills. Some of the in-classassignments included information gain, paired-t tests,Bayesian formulations, etc. This exercise promptedthe students to work together, apply the conceptsimmediately, and even encouraged attendance as theywere graded on this nments were divided among theoreticalconcepts, Weka, and a Matlab assignment. Wekaassignments immediately followed lectures ondifferent algorithms and techniques. Thus, practicalinferences succeeded the theoretical justifications inclass. The assignments incorporated experimentaldesign, comparison of different methods, statisticaltests of significance, and different types of data. Theone Matlab assignment provided an opportunity tolearn about some of the statistical and visualizationmethods in Matlab for analyzing data. These includedbox plots, discretization methods using percentilebased binning, histograms, computing sufficientstatistics such as mean, variance, etc.Presentation on a real-world application: Weprovided various applications of data mining andmachine learning to the students’ listserv. As part ofthe in-class assignment and participation, each one ofthem was asked to prepare any one application for a10 minute presentation. Each student was thenrequired to comment on each presentation and answerquestions such as: Why data mining was needed?How did it help? This assignment allowed thestudents to learn more about an application thatdeeply interests them, and also become aware ofvarious other applications. In addition, it offered anopportunity for practicing their presentation skills.Moreover, we delivered lectures on variousapplications, including one invited lecture onBioinformatics. We also prepared a lecture on creditscoring based on our experience that became amongthe most popular lectures.Class challenge assignment: We incorporated thisassignment fairly late in the semester. The purpose ofthe class challenge was to expose the students to thecomplete process of data mining, and identifying theoptimal methods for the dataset at hand. We realizedintroducing a competitive aspect, and making it require teamwork might just provide the rightmotivation for the students to give their best. Theywere simply given a classification and a regressiondataset. They were then asked to identify their choiceof methods on a validation set, and submitpredictions on a testing set. This mimicked a realworld deployment of a data mining system requiringnoise cleaning, feature selection, benchmarkingvarious methods on a validation set, and then pickingup the best one for the actual testing set predictions.The students were grouped in pairs for thisassignment. Based on our experience and students’feedback, we are definitely going to make it apermanent fixture of our course and highlyrecommend the same for other Universities. Thischallenge forced the students to think about variousaspects of the data, methods, and concepts learned inthe class.Data mining and society: The course alsoencouraged in-class discussions of the role of datamining in society by providing relevant material fromacademic literature and popular media. Recent yearshave included a focus on privacy-preserving datamining due to a public concern about data miningbeing too invasive. We discussed those topics,including the Total Information Awareness initiativeof the Government [3]. We also provided some ofthe material from Congressional hearings on the datamining technology [12]. We believe it is important toinclude a discussion on ethics and social impacts incomputing [13]. In fact, one of the students iscontinuing for a degree in Law, and thought thislecture provided an interesting avenue for furtherexploration for his graduate degree.Class project: The class project made up 35% of theclass grade, as shown in Figure 3. It was subdividedinto a proposal, milestone assessment, finaldeliverable/term paper, peer-review process, and apresentation. The students were required to followthe ACM Knowledge Discovery and Data MiningConference paper format [9] for writing all thedocuments during the course of the project. Thisensured a standard among all the students, andfamiliarized them with the conference format of akey data mining conference.The students were required to submit a 2-pageproposal with what they expect to accomplish in theproject. They also included the potential milestonesin their proposals. This ensured personal benchmarks,and allowed the students to pace themselves as thesemester progressed. Moreover, any road-blockswere identified early on, and alternative paths wereconsidered in advance. For instance, a couple ofprojects required slight revisions in their goals basedon these milestones, and the students wereappreciative of this aspect. This can also be typical ofa real-world project --- the projects rely on timelines0-7803-9077-6/05/ 20.00 2005 IEEEOctober 19 – 22, 2005, Indianapolis, IN35th ASEE/IEEE Frontiers in Education ConferenceS1J-19

Session S1Jand milestones. The term paper followed a peerreview process among the students. The goal of thisexercise was to acquaint the students with the peerreview process. The students conducted a singleblind peer-review process. We provide students’feedback on the peer-review process in the nextSection. In addition, the final project presentationswere converted into a half-day Workshop. Webelieve the project structure better prepared thestudents for various components of academia andindustry. We encourage the incorporation of a similarproject cycle not only for this course but also forother courses with an applied flavor. Figure 3 showsthe distribution of the grade for the class project.We encouraged the students to consider classprojects that aligned well with their career or researchinterests. We individually defined and discussed eachtopic of interest, identified the work entailed, andwhether the undertaking was doable in a semester.This initiative encouraged a variety of class projectswith each student being very enthusiastic about his orher undertaking. In fact, a couple of juniors arecontinuing with their research projects in data miningover the Summer. A big achievement from the classprojects was the acceptance of two project papers(marked by *) at Conferences [14, 15]. To namesome class project topics:1)2)3)4)5)6)Spam detection.Intrusion detection.Activity mining in open source software. *Personalized music recommender system.Data mining stock indices.Meta-learning to select classifiers fromensembles. *Class compositionTable 1 gives the class composition in terms of number ofjuniors, seniors, and graduate students. In addition the majorof each student is also indicated (CS: Computer Science; CE:Computer Engineering; CSE: Computer Science andEngineering). There were a total of 11 (9 for credit and 2 foraudit) students in the class.Juniors2 (CS)TABLE 1: CLASS COMPOSITIONSeniorsGraduate Students3 (CS)4 (2 for credit and1 (dual CS MBA)2 for audit)1 (dual CS Film)Both the juniors motivated by the course decided topursue research in data mining (and have published papersfrom their class projects). One of the graduate students utilizedthe class to propel his research in intrusion detection, and iscontinuing the research during Summer. In addition, both thedual major students have noted the class to be very useful fortheir career objectives in consulting firms. Thus, the studentsfound the course very useful for their varied interests either inacademia or industry.COURSE EVALUATIONIn this Section, we present the course evaluation as per theinputs from the students. We will evaluate the following: WekaDiscussion of ApplicationsIn-class examples/assignmentsRole of applications’ discussions in-classProject life cycleWeka is a freely available machine learning toolkit fromthe University of Waikato [8]. It is a JAVA based toolkit thatallows for both the command line interface and the graphicaluser-interface. Weka was easy to install and learn to use. Mostof the students installed Weka on their personal PC’s., thussaving them trips to the departmental labs for finishing theirassignmentsWeka made it easier to conduct homework assignmentson different algorithms and techniques, including the classchallenge. Moreover, more components can be added to Wekaand it can be customized for a particular requirement. Weposited the following question to the students for theevaluation of Weka. Note that we only provide a sampling ofresponses for space reasons.Do you think Weka is a useful tool for this course? Wouldyou rather program different methodologies? “Weka is good; I think its better for us to spend ourtime gaining a general knowledge of manyalgorithms than to spend hours becoming intimatelyfamiliar with one.” “I think Weka is a great toolkit, because it allowsstudents to begin exploring without getting in overtheir heads.” “Weka is a very useful tool as long as thetheories/algorithms are understood.” “Weka is very useful. I’d rather not do programmingassignments -- I’d rather master Weka so that I canapply these skills to applications.” “Weka is very useful, doing programming forces youto work more time programming and debuggingrather than learning and applying new knowledge.” “Its good to use a toolkit like Weka; it might beinteresting to do some programming, but it would bea lot on top of our research assignment.”Summarizing the students’ responses, it is clear that thestudents prefer using a toolkit that allows them to experimentand analyze different techniques for assignments. Theywanted to develop an intuition behind different techniques andhow they might be used for their jobs and/or research. Froman instructor’s viewpoint, Weka indeed helps as some of theclass discussions can be easily translated into homeworkassignments not requiring a significant amount of time.However, the students were free to implement the class projectin the programming language or environment of their choice.0-7803-9077-6/05/ 20.00 2005 IEEEOctober 19 – 22, 2005, Indianapolis, IN35th ASEE/IEEE Frontiers in Education ConferenceS1J-20

Session S1JSo that gave them an opportunity to independently developtheir ideas and programs.Our next point of evaluation is the usefulness ofapplication discussions in class. We brought in our real-worldexperience of applying data mining for various problems. Thestudents found these discussions very useful, and wereattentive and keen to understand how the problems weresolved. We also provided weekly or biweekly nuggets of somecool applications of machine learning and data mining to thestudents. We again provide the (undergraduate) responsesfrom the students’ survey:Do you feel the discussions of the real-world perspective andapplications are useful? “Very much so. I would be a little lost w/o somegrounding in real-world applications.” “Definitely, they motivate learning and many people(myself included) learn best from analyzingapplications.” “Very, Professor has good real-world knowledge.” “Yes – It provides a good understanding of how theconcepts can be used.” “Yes, the professor’s real-world experience allowshim to point out the practical aspect of data miningalong with common pitfalls.” “Yes – it is difficult to be enthusiastic about a subjectwith little real-world application.”Summarizing the students’ responses, the applicationsplayed an important role in their understanding of the conceptsand piqued their interest in the class. We also found that thestudents were very receptive of the applications, and itallowed them to keep pace with the evolving nature of themachine learning and data mining field. Applications havebeen a cornerstone in machine learning, as they haveidentified some of the key challenges and research issues inthe area [10].The next point of course evaluation is the usefulness ofin-class assignments. Various data mining and machinelearning concepts can be mathematically involved, andconfuse the students. We prepared in-class assignments,whenever possible and relevant, for the students to apply theequations to immediately after presentation in the class. Thestudents worked through it together or with hints from theProfessor. A complete solution was provided in the end or ifone of the students solved it completely, he or she was askedto write it on the black-board for all the students. We provide afeedback from students on that:Do you find the in-class assignments/examples useful? “In-class assignments help very much inunderstanding the material.”“Very useful, help nailing down concepts.”“The examples are very helpful in illustrating theconcepts. Applications discussions are very helpful,especially the Professor’s real-world experience.”“Yes, they are great”. “Yes, I’ve done most of the assignments by reworking the in-class examples.”From the instructor’s perspective, we also found solvingan example in class that breaks down the equation into piecesa useful exercise in explaining a concept. It resulted in moretime preparing the class material but less time in-classexplaining a concept. However, it might not always bepossible to do that. So, we noticed that the students were morewary (for some of the techniques) whenever we went “underthe-hood” and provided a bunch of equations. Another keypoint noticed by the students was that the course requiredthem to synthesize various concepts that were learned over theundergraduate years. Another concept that was slightlydifficult for the students was the extensive statisticaldiscussions at times [11].We also asked the students to evaluate the projectbreakdown into proposal, milestones, etc. as stated before.Do you find the project structure as in this Course useful? “Yes, I think it will provide good insight intoapplying DM to an area that we have personalinterest in.” “Yes, even more so than the tests or quizzes.” “Yeah, it’s the same kind of breakdown our projectsin real life are going to have.” “Yes, CS is a major in which understanding theprocess of a project helps prepare for the future inmany ways.” “Yes, lets us put our education to use with a realworld application.” “Absolutely, it mirrors professional life so we shouldbe doing more of that.” “Yes, the deadlines will keep me honest and ontrack.”Thus, the students were indeed appreciative of thisformat. Moreover, the upper-class composition of the classmay help from a research and real-world exposure to preparethem for their respective career paths. The process of writingproposals, meeting timelines, making presentations, writingpapers, and doing peer review will be more than likely presentin any career path one chooses. The students were greatlyappreciative of the review process as well (4 of themresponded that it was very helpful; 3 said it was moderatelyhelpful; and only 1 said that it was of little help).We believe integrating the various components in theproject and spreading them across the entire semester,prepares the students for both academia and industry. We thenasked students the question, whether the course was useful intheir career plan. Most of them responded with an interest incontinuing research and/or job in this area. A few of them alsosaid that this course provided them with something in their CSdegree that they could easily apply to the business world.0-7803-9077-6/05/ 20.00 2005 IEEEOctober 19 – 22, 2005, Indianapolis, IN35th ASEE/IEEE Frontiers in Education ConferenceS1J-21

Session S1JSUMMARYWe developed a Data Mining course that successfully thecourse successfully theory and applications. We believe ourreal-world experience helped in understanding some of theproblems when applying data mining to the real-world tasks,and incorporating those in lecture discourse and notes. Thecourse invoked analytical and creative thinking among thestudents. The course material included both relevant researchpapers provided as handouts in class and various applications’email nuggets. The students found the use of the applicationsalong with the theoretical framework very relevant. Webelieve the course can become a very important addition as anundergraduate elective, particularly because of the applied andinter-disciplinary foci. The course synthesized variousconcepts that students have learned over their undergraduatestudies as indicated in Figure 1. As one of the students pointedout, the course made him go back and brush up on all theprobability and statistics concepts that he never thought he’duse again. Another student pointed that this course was one ofthe best he took, as it helped him easily visualize where andhow something can be applied. Given the ubiquity ofapplications of data mining, it helps the students to be exposedto various methodologies before proceeding for graduateschool and/or jobs.We also asked the students about the lectures mostpreferred. Most of the students enjoyed ensembles because ofof being able to combine multiple learning algorithms, andimprove the overall performance. Most of the students usedensemble based methods for the challenge assignment. Thelecture on credit scoring was one of their favorites. Theyfound neural networks’ lectures to be the most difficult due tothe various derivations. In addition, the incorporation of realworld applications, examples, project, and the class challengemade the class exciting, although time consuming.FUTURE WORKWe recommend the usage of Weka as it indeed allows thestudents an easy evaluation of various techniques. We alsorecommend an inclusion of the discussion of variousapplications as it helps the students to attach a tangibleelement to the various theoretical concepts discussed in theclass. We also believe that the break-up of the class projectinto various components as described in this paper preparesthe students for future work. We are going to involve thisframework of projects with other course offerings in thesubsequent semesters. We believe the students should be readyfor the writing and peer-review process that is essential forany research. We hope other Universities include data miningas part of their undergraduate offerings to the upper-classmen.We hope our experience with the course is useful as they planthe course.We plan to continue usage of this baseline model for oursubsequent offerings of data mining, and gather students’responses over different semesters. We will, obviously,improvise the course as it progresses over semesters. Onething that we would like to do is use blackboard basedteaching when engaging the students through variousmathematical derivations and proofs, such as in neuralnetworks, regression, etc during the course. This will help inpacing the class. We believe a mix of power-point andblackboard teaching is important for a specialized course likethis one.In our subsequent offering, depending on the size ofenrollment, we plan to offer another small project as part ofthe class wherein students will be randomly paired-up.Hopefully, this will imitate the real-world setting of adaptingand working with people we have never worked before. And itshould also improve the overall chemistry of the class. Wewould also like to simultaneously conduct the course offeringalong with our peers at different Universities and jointlymonitor the progress of the course. We will share ourexperience with the community again at that point.ACKNOWLEDGMENTWe would like to thank the students of the Data Mining ClassCSE 498C/598C, Spring 2005 at the University of NotreDame for their enthusiasm and very useful feedback. We aregrateful to Gregory Piatetsky-Shapiro for providing some ofthe material and his very useful KDnuggets website. We thankthe anonymous reviewers for their very useful feedback andcomments. We would also like to thank Kevin Bowyer andLarry Hall for their comments on this paper and the coursesyllabus. We would also like to thank Joaquin Candela forproviding the classification dataset that was used in theChallenge.REFERENCES[1]Fayyad, U., Piatetsky, G., Smyth. P., “From Data Mining to KnowledgeDiscovery in Databases,” AI Magazine 17(3): Fall 1996, 37-54.[2]Data Mining, Spring 2005, University of Notre Total/Terrorism Information Awareness ations/iao pdf/slides/PoindexterIAO.pdf[4]Ford, N. “The growth of understanding in Information Science: Towardsa Deveopmental Model”, Journal of the American Society forInformation Science, Vol. 50, No. 12, 1999, pp. 1141-1152.[5]Trochim, W. M. K., “Deductive and Inductive Thinking”, Deduction &Induction, http://trochim. human.cornell.edu/kb/dedind.htm, 2002.[6]Banks, D. L., Dong, G., Liu, H., and Mandvikar, A. “TeachingUndergraduates Data Mining Engineering Programs,” 34th ASEE/IEEEFrontiers in Education Conference, 2004.[7]CRoss Industry Standard Process for Data Mining, http://www.crispdm.org/[8]Witten, I. , and Frank, E., Data Mining: Practical Machine LearningTools and Techniques with Java Implementations, Morgan Kaufmann,1999.[9]ACM SIG Proceedings plate.html[10] Provost, F. and Kohavi, R. “On Applied Research in MachineLearning.” Guest editorial in Machine Learning 30 (2/3) 1998.[11] Shield, M. “Statistical Literacy and Mathematical Thinking”, 2000Presentation at the ICME-9, Tokyo.0-7803-9077-6/05/ 20.00 2005 IEEEOctober 19 – 22, 2005, Indianapolis, IN35th ASEE/IEEE Frontiers in Education ConferenceS1J-22

Session S1J[12] Committee on Government Reform, “Data Mining: Current Applicationsand Future Possibilities”, March 2003, http://www.house.gov/reform.[13] Bowyer, K. W., Ethics and Computing: Living Responsibly in theComputer World, IEEE Press/Wiley Press, 2001.[14] Mack, D., Chawla, N. V., Madey, G., “Activity Mining in Open SourceSoftware,” Accepted in NAACSOS, 2005.[15] Sylvester, J., and Chawla, N. V. “Evolutionary Ensembles: Combininglearning agents using genetic algorithms,” Accepted in AAAI Workshopon Multi-Agent Learning, 2005.0-7803-9077-6/05/ 20.00 2005 IEEEOctober 19 – 22, 2005, Indianapolis, IN35th ASEE/IEEE Frontiers in Educa

Teaching data mining hasn't received a significant amount of attention in the FIE conference series. We are familiar with one paper by Banks et al. [6] at FIE. They provide their experience in offering data mining at two different Universities --- Arizona State and Wayne State University. Their project on teaching data mining in