SOFTWARE ENGINEERING AND MACHINE LEARNING

Transcription

INTRO REC. 1 REC. 2 REC. 3 REC. 4SOFTWARE ENGINEERING ANDMACHINE LEARNINGGerasimosgerasimos@pitt.eduUniversity of PittsburghInstructor Dr. S. K. ChangOctober 27, 20201 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4RECOMMENDATION2 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4FROM RANKING TO DL3 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4SIMPLE RECOMMENDER: SORT THEM FIRSTSimple recommenders are basic systems that recommend the topitems based on a certain metric or score. We will build a simplifiedclone of IMDB Top 250 Movies using metadata collected fromIMDB.The following are the steps involved:– Decide on the metric or score to rate movies on.– Calculate the score for every movie.– Sort the movies based on the score and output the topresults.4 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4DATASET: FIRST REVIEWThe dataset file- contains metadata for all 45,000 movies listed in the FullMovieLens Dataset.- consists of movies released on or before July 2017.- captures feature points like cast, crew, plot keywords,budget, revenue, posters, release dates, languages, productioncompanies, countries, TMDB vote counts, and vote averages.5 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4RECOMMENDER 1: SIMPLE STATISTICAL METRICThese feature points could be potentially used to train yourmachine learning models for content and collaborative filtering.USING RATING AS A METRIC. Why should we or shouldn’twe?6 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4NAIVE OUTPUT21642titleI c e Age Columbus : WhoWere t h e F i r s t A m e r i c a n s ?vote average10.015710I f God I s W i l l i n g andda C r e e k Don ’ t R i s e10.022396Meat t h e Truth10.022395Marvin H a m l i s c h :What He Did For Love10.07 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4RECOMMENDER 1: SIMPLE STATISTICAL METRICThe metric used is WeightedRating (WR) vm·R ·Cv mv m(1)– v is the number of votes for the movie (ALREADY HAVE IT);– m: minimum votes required to be listed (ALREADY HAVE IT);– R is the average rating of the movie;– C is the mean vote across the whole report.Determining an appropriate value for m is a hyperparameter thatyou can choose accordingly since there is no right value for m. Youcan consider it as a preliminary negative filter.The selectivity of your filter is up to your discretion.8 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4DATASET: MORE ON THISThe dataset columns look like this:TITLES: adult belongs to collection budget genres homepage idimdb id original language original title overview . release daterevenue runtime spoken languages status tagline title videovote average vote countVALUES: 0 False {’id’: 10194, ’name’: ’Toy Story Collection’, .30000000 [{’id’: 16, ’name’: ’Animation’}, {’id’: 35, ’.http://toystory.disney.com/toy-story 862 tt0114709 en Toy StoryLed by Woody, Andy’s toys live happily in his . . 1995-10-30373554033.0 81.0 [{’iso 639 1’: ’en’, ’name’: ’English’}] ReleasedNaN Toy Story False 7.7 5415.09 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4CODE Id e f w e i g h t e d r a t i n g ( x , m m, C C ) :v x [ ’ vote count ’ ]R x [ ’ vote average ’ ]r e t u r n ( v / ( v m) R) (m/ (m v ) C)i m p o r t p a nd a s a s pdmetadata pd . r e a d c s v ( ’ m o v i e s m e t a d a t a . c s v ’ ,low memory F a l s e )metadata . head ( 3 )C metadata [ ’ v o t e a v e r a g e ’ ] . mean ( )m metadata [ ’ v o t e c o u n t ’ ] . q u a n t i l e ( 0 . 9 0 )q m o v i e s metadata . copy ( ) . l o c[ metadata [ ’ v o t e c o u n t ’ ] m]q movies . shapep r i n t ( q movies . shape )metadata . s h a p e10 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4CODE IIq movies [ ’ score ’ ] q movies . apply( w e i g h t e d r a t i n g , a x i s 1)q movies q movies . s o r t v a l u e s ( ’ score ’ ,a s c e n d i n g F a l s e )q movies [ [ ’ t i t l e ’ , ’ vote count ’ , ’ vote average ’ ,’ s c o r e ’ ] ] . head ( 2 0 )metadata [ ’ o v e r v i e w ’ ] . head ( )11 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4OUTPUT: RECOMMENDATIONS ACCORDING SCOREtitle314 The ShawshankRedemption834 The G o d f a t h e 8.38.256385292 Pulp F i c t i o n8670.08.38.251406522 S c h i n d l e r ’ sList4436.08.38.20663910309 D i l w a l eDulhaniaLe J a y e n g e12481 The DarkKnight2843 F i g h t Club12 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4RECOMMENDER 2: NLP AS THE MACHINELEARNING METHODAside from voting, which other type of recommendation is there?Learn how to build a system that recommends movies that areSIMILAR to a particular movie. How to achieve this?– The plot description is available to you as the overview feature inyour metadata dataset.– compute pairwise cosine similarity scores for all movies based ontheir plot descriptions– recommend movies based on that similarity score threshold.13 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4RECOMMENDER 2: INPUT TO NLP SYSTEM0 Led by Woody , Andy ’ s t o y s l i v e h a p p i l y i n h i s.1 When s i b l i n g s Judy and P e t e r d i s c o v e r an e nc ha . . .2 A f a m i l y wedding r e i g n i t e s t h e a n c i e n t f e u d be . . .3 Ch e ated on , m i s t r e a t e d and s t e p p e d on , t h e wom . . .4 J u s t when Ge or g e Banks h a s r e c o v e r e d from h i s.Name : o v e r v i e w , d t y p e : o b j e c t14 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4RECOMMENDER 2: PRE PROCESSING FOR NLPALGORITHMWhich score to use?- Import the Tfidf module using scikit-learn (scikit-learn givesyou a built-in TfIdfVectorizer class that produces the TF-IDFmatrix);- Remove stop words like ’the’, ’an’, etc. since they do notgive any useful information about the topic;- Replace not-a-number values with a blank string;- Finally, construct the TF-IDF matrix on the data.15 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4CODE I#I m p o r t T f I d f V e c t o r i z e r from s c i k i t l e a r nfrom s k l e a r n . f e a t u r e e x t r a c t i o n . t e x timport T f i d f V e c t o r i z e r#D e f i n e a TF IDF V e c t o r i z e r O b j e c t . Remove a l lE n g l i s h s t o p words s u c h a s ’ the ’ , ’ a ’t f i d f T f i d f V e c t o r i z e r ( s t o p w o r d s ’ e n g l i s h ’ )#R e p l a c e NaN w i t h an empty s t r i n gmetadata [ ’ o v e r v i e w ’ ] metadata [ ’ o v e r v i e w ’ ] .f i l l n a ( ’ ’)#C o n s t r u c t t h e r e q u i r e d TF IDF m a t r i x by f i t t i n gand t r a n s f o r m i n g t h e d a t atfidf matrix tfidf . fit transform (metadata [ ’ o v e r v i e w16 ’/ 35])

INTRO REC. 1 REC. 2 REC. 3 REC. 4CODE II#Output t h e s h a p e o f t f i d f m a t r i xt f i d f m a t r i x . shape#A r r a y mapping from f e a t u r e i n t e g e r i n d i c e st o f e a t u r e name .t f i d f . get feature names ()[5000:5010]# Import l i n e a r k e r n e lfrom s k l e a r n . m e t r i c s . p a i r w i s e i m p o r tlinear kernel# Compute t h e c o s i n e s i m i l a r i t y m a t r i xcosine sim linear kernel ( tfidf matrix ,tfidf matrix )17 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4NAIVE OUTPUTg e t r e c o m m e n d a t i o n s ( ’ The Dark K n i g h t R i s e s ’ )1248115013281551158521194The Dark K n i g h tBatman F o r e v e rBatman R e t u r n sBatman : Under t h e Red HoodBatmanBatman Unmasked : The P s y c h o l o g y o f t h eDark Kn . . .9230 Batman Beyond : R e t u r n o f t h e J o k e r18035 Batman : Year One19792 Batman : The Dark K n i g h t R e t u r n s , P a r t 13095 Batman : Mask o f t h e Phantasm– GOAL ACHIEVED A decent job was done of finding movieswith SIMILAR plot descriptions18 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4NAIVE OUTPUTg e t r e c o m m e n d a t i o n s ( ’ The Dark K n i g h t R i s e s ’ )12481 The Dark K n i g h t150Batman F o r e v e r1328 Batman R e t u r n s15511 Batman : Under t h e Red Hood585Batman21194 Batman Unmasked : The P s y c h o l o g y o f t h eDark Kn . . .9230 Batman Beyond : R e t u r n o f t h e J o k e r18035 Batman : Year One19792 Batman : The Dark K n i g h t R e t u r n s , P a r t 13095 Batman : Mask o f t h e Phantasm– However the quality of recommend. is not great: ”The DarkKnight Rises” returns all Batman movies while it is more likelythat the people who liked that movie are more inclined to enjoyother Christopher Nolan movies.19 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4RECOMMENDER 2: NLPSo what did we just use?– As the name suggests, word vectors are vectorized representationof words in a document. The vectors carry a semantic meaningwith it. For example, man & king will have vector representationsclose to each other while man & woman would have representationfar from each other.– Compute Term Frequency-Inverse Document Frequency(TF-IDF) vectors for each document. This will give you a matrixwhere each row represents a word in the overview vocabulary (allthe words that appear in at least one document), and each columnrepresents a movie, as before.– The TF-IDF score: is the frequency of a word occurring in adocument, down-weighted by the number of documents in which itoccurs.– Reduces the importance of words that frequently occur in plotoverviews and, therefore, their significance in computing the finalsimilarity score.20 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4RECOMMENDER 2: NLP ALGORITHM COMPLETED– Get the index of the movie given its title.– Get the list of cosine similarity scores for that particular moviewith all movies. Convert it into a list of tuples where the firstelement is its position, and the second is the similarity score.– Sort the aforementioned list of tuples based on the similarityscores; that is, the second element.– Get the top 10 elements of this list. Ignore the first element as itrefers to self (the movie most similar to a particular movie is themovie itself).– Return the titles corresponding to the indices of the top elements.21 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4RECOMMENDER 2: NLP ALGORITHM IMPROVEDc r e d i t s pd . r e a d c s v ( ’ c r e d i t s . c s v ’ )k e y w o r d s pd . r e a d c s v ( ’ k e y w o r d s . c s v ’ )# Remove rows w i t h bad I D s .metadata metadata . d r o p ( [ 1 9 7 3 0 , 2 9 5 0 3 , 3 5 5 8 7 ] )# Convert IDs to i n t . Required f o r mergingkeywords [ ’ id ’ ] keywords [ ’ id ’ ] . astype ( ’ int ’ )c r e d i t s [ ’ id ’ ] c r e d i t s [ ’ id ’ ] . astype ( ’ int ’ )metadata [ ’ i d ’ ] metadata [ ’ i d ’ ] . a s t y p e ( ’ i n t ’ )22 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4RECOMMENDER 2: NLP ALGORITHM OUTPUTg e t r e c o m m e n d a t i o n s ( ’ The Dark K n i g h t R i s e s ’ ,cosine sim2 )12589The Dark K n i g h t10210Batman B e g i n s9311Shiner9874Amongst F r i e n d s7772Mitchell516Romeo I s B l e e d i n g11463The P r e s t i g e24090Quicksand25038Deadfall41063SaraName : t i t l e , d t y p e : o b j e c t23 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4RECOMMENDER 3: FM AS THE DEEP LEARNINGMETHODFigure: FM24 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4RECOMMENDER 3: FM AS THE DEEP LEARNINGMETHOD– Item-based (Filtering): similar to the content recommendationengine that we built. These systems identify similar items based onhow people have rated it in the past. E.g., if Alice, Bob, & Evehave given 5 stars to The Lord of the Rings & The Hobbit, thesystem identifies the items as similar. Therefore, if someone buysThe Lord of the Rings, the system also recommends The Hobbit tohim or her.Which other type of recommendation is there?– User-based (Filtering): recommends products that similar usershave liked.– E.g. Alice & Bob have a similar interest in books (i.e. theylargely like & dislike the same books). Now, a new book has beenlaunched into the market & Alice has read & loved it. It is, thus,likely that Bob will like it too. So the system recommends it.25 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4FMThe architecture of DeepFM is illustrated below.Figure: FM26 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4DEEP LEARNING RECOMMENDER 3: FM AS A MLMETHOD– Learning effective feature combinations is critical to the successof click-through rate prediction task. Factorization machines modelfeature interactions in a linear paradigm (e.g., bilinear interactions).– This is often insufficient for real-world data where inherentfeature crossing structures are usually very complex and nonlinear.– What’s worse, second-order feature interactions are generallyused in factorization machines in practice.– Modeling higher degrees of feature combinations withfactorization machines is possible theoretically but it is usually notadopted due to numerical instability and high computationalcomplexity. One effective solution is using deep neural networks.27 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4ANALYSIS OF FM AS DL SYSTEM– Deep neural networks are powerful in feature representationlearning and have the potential to learn sophisticated featureinteractions.– As such, it is natural to integrate deep neural networks tofactorization machines. Adding nonlinear transformation layers tofactorization machines gives it the capability to model bothlow-order feature combinations and high-order featurecombinations.– Moreover, non-linear inherent structures from inputs can also becaptured with deep neural networks. In this section, we willintroduce a representative model named deep factorizationmachines (DeepFM) [Guo et al., 2017] which combine FM anddeep neural networks.28 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4CODE Ifrom d 2 l i m p o r t mxnet a s d 2 lfrom mxnet i m p o r t i n i t , g l u o n , np , npxfrom mxnet . g l u o n i m p o r t nnimport osimport sysnpx . s e t n p ( )c l a s s DeepFM ( nn . B l o c k ) :defi n i t ( s e l f , f i e l d d i m s , n u m f a c t o r s , mlpinit()s u p e r ( DeepFM , s e l f ) .n u m i n p u t s i n t ( sum ( f i e l d d i m s ) )s e l f . embedding nn . Embedding ( n u m i n p u t s , ns e l f . f c nn . Embedding ( n u m i n p u t s , 1 )s e l f . l i n e a r l a y e r nn . Dense ( 1 , u s e b i a s Trinput dim s e l f . embed output dim l e n ( f i e l29 / 35s e l f . mlp nn . S e q u e n t i a l ( )

INTRO REC. 1 REC. 2 REC. 3 REC. 4CODE IIb a t c h s i z e 2048data dir d2l . download extract ( ’ ctr ’ )t r a i n d a t a d 2 l . CTRDataset ( o s . p a t h . j o i n ( d a t a d i r ,t e s t d a t a d 2 l . CTRDataset ( o s . p a t h . j o i n ( d a t a d i r , ’f e a t m a p p e r t r a i n d a t a . f ed e f a u l t s t r a i n d a t a . d e f afield dims train data . field dimst r a i n i t e r gluon . data . DataLoader (t r a i n d a t a , s h u f f l e True , l a s t b a t c h ’ r o l l o v e r ’n u m w o r k e r s d 2 l . g e t d a t a l o a d e r w o r k e r s ( ) )t e s t i t e r gluon . data . DataLoader (t e s t d a t a , s h u f f l e F a l s e , l a s t b a t c h ’ r o l l o v e r ’n u m w o r k e r s d 2 l . g e t d a t a l o a d e r w o r k e r s ( ) )devices d2l . t r y a l l g p u s ()n e t DeepFM ( f i e l d d i m s , n u m f a c t o r s 10 , m l p d i m s [n e t . i n i t i a l i z e ( i n i t . X a v i e r ( ) , c t x d e v i c e s )30 / 35l r , num epochs , o p t i m i z e r 0 . 0 1 , 3 0 , ’ adam ’

INTRO REC. 1 REC. 2 REC. 3 REC. 4OUTPUTTry t o i m p l e m e n t i t i n y o u r own d a t a s e t andcompare t o p r e v i o u s s y s t e m s .31 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4DEEP LEARNING RECOMMENDER 3: FMThe implementation of DeepFM is similar to that of FM. Wekeep the FM part unchanged and use an MLP block with reluas the activation function. Dropout is also used to regularizethe model. The number of neurons of the MLP can beadjusted with the mlp dims hyperparameter.The data loading process is the same as that of FM. We setthe MLP component of DeepFM to a three-layered densenetwork with the a pyramid structure (30-20-10). All otherhyperparameters remain the same as FM.32 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4DEEP LEARNING RECOMMENDER 4: CFIT IS FUN AND REQUIRES CODING. TRY IT AT HOME!33 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4WHY RECOMMENDER SYSTEMS?Are among the most popular applications of data sciencetoday.Are used to predict the ”rating” or ”preference” that a userwould give to an item.Almost every major tech company has applied them: Amazonuses it to suggest products to customers and Facebook uses itto recommend pages to like and people to follow.For some companies like Netflix, Amazon Prime, Hulu, andHotstar, the business model and its success revolves aroundthe potency of their recommendations.Popular systems for domains like restaurants, movies, andonline dating.Have also been developed to explore research articles &experts, collaborators & financial services.YouTube: uses it at a large scale to suggest you videos basedon your history. E.g. if you watch a lot of anime videos, itwould suggest similar ones.34 / 35

INTRO REC. 1 REC. 2 REC. 3 REC. 4AWARDSNetflix even offered a million dollars in 2009 to anyone whocould improve its system by 10%.35 / 35

585 Batman 21194 Batman Unmasked : The Psychology of the Dark Kn . . . 9230 Batman Beyond : Return of the Joker 18035 Batman : Year One 19792 Batman : The Dark Knight Returns , Part 1 3095 Batman : Mask of the Phantasm { However the quality of recommend. is not great: "The Dark Knight Rises" returns a