Rosanne.liu@northwestern.edu’

Transcription

EECS 510: Social Media MiningSpring 2015DataMiningEssen,als2:DataMininginPrac,ce, withPythonRosanneLiurosanne.liu@northwestern.edu

Outline WhyPython?IntrotoPythonIntrotoScikit- ‐LearnUnsupervisedLearning– DemoonPCA,K- ‐Means SupervisedLearning– DemoonLinearRegression,LogisGcRegression

Outline WhyPython?IntrotoPythonIntrotoScikit- ‐LearnUnsupervisedLearning– DemoonPCA,K- ‐Means SupervisedLearning– DemoonLinearRegression,LogisGcRegression

WhyPython?What programming languagedo you use for data mining?Source from: http://www.kdnuggets.com/polls/index.html

How much is your salary asanalytics, data mining, datascience professionals?Source from: http://www.kdnuggets.com/polls/index.html

Should data scientist / dataminers be responsible fortheir predictions?Source from: http://www.kdnuggets.com/polls/index.html

WhyPython? WhyPython?NotThinkaboutthescien,st’sneeds:§ Getdata(simulaGon,experimentcontrol)§ Manipulateandprocessdata.§ Visualizeresults.tounderstandwhatwearedoing!§ icaGons,writepresentaGons.

WhyPython? WhyPython?Not– Easy Easytolearn,easilyreadable ScienGstsfirst,programmerssecond– Efficient Managingmemoryiseasy–ifyoujustdon’tcare– AsingleLanguageforeverything AvoidlearninganewsoXwareforeachnewproblem

MoretoTakeAway FreedistribuGonfromhZp://www.python.org ilyforyou,thiscommunitycanwrite Twopopularversions,2.7or3.x Asingle- ‐clickinstaller:EnthoughtCanopy PrepareyourselfforcodeindentaGonheaven

- ‐likefeatureshZp://ipython.orgScikit- ‐Learn,MLresourceandlibraryhZp://scikit- ‐learn.org/dev/index.html earn2/ More:mlpy,PyBrain,Orange,Scrapy,

Outline WhyPython?IntrotoPythonIntrotoScikit- ‐LearnUnsupervisedLearning– DemoonPCA,K- ‐Means SupervisedLearning– DemoonLinearRegression,LogisGcRegression

TheUseofPython:Simpledemos0–PythonIntro.ipynb

Outline WhyPython?IntrotoPythonIntrotoScikit- ‐LearnUnsupervisedLearning– DemoonPCA,K- ‐Means SupervisedLearning– DemoonLinearRegression,LogisGcRegression

WhatisScikit- ‐learn APythonMachineLearningLibrary Focusedonmodelingdata jectin2007. 2010. reFoundaGon. foreyoucanusescikit- ‐learn.

Outline WhyPython?IntrotoPythonIntrotoScikit- ‐LearnUnsupervisedLearning– DemoonPCA,K- ‐Means SupervisedLearning– DemoonLinearRegression,LogisGcRegression

TheuseofScikit- ‐Learn:unsupervisedlearningdemos

PCASummary oodfirstinsightintodataset IdenGfyimportantvariablesinprojecGonmatrixW:

1–PCA.ipynb

K- ‐MeansAlgorithm

2–kmeans.ipynb

Outline WhyPython?IntrotoPythonIntrotoScikit- ‐LearnUnsupervisedLearning– DemoonPCA,K- ‐Means SupervisedLearning– DemoonLinearRegression,LogisGcRegression,kNN

TheuseofScikit- ‐Learn:supervisedlearningdemos

LinearRegression1DTo find w and b, minimize the error:2D

pynb

Logis,cRegression

Logis,cRegression

4–LogisGcRegression.ipynb

NonlinearProblems able,but

KNearestNeighbors ClassificaGon:samesetupaslogisGcregression. ︎Verysimplebutpowerfulidea:Doasyourneighborsdo. threenearest,.)point(s)inthetrainingdataforalabel. ︎Usualdistancemeasure:Euclideandistance

SimpleAlgorithm Pickak,forexamplek 3.Wanttoclassifynewexamplex.Computedi d(xi,x),i.e.d(xi,x) xi x rsmostoXenamongyi0,yi1,yi2.

5–kNN.ipynb

Data Mining Essenals 2: Data Mining in Pracce, with Python Rosanne’Liu’ rosanne.liu@northwestern.edu’ E