Come Hack With OpeNER! Workshop Programme

Transcription

Come Hack with OpeNER!Workshop Programme9:00 – 9:20 Introduction by Workshop Chair9:20 – 10:30 Tutorial: OpeNER technology10:30 – 11:00 Coffee break11:00 –11:45 Demo/PostersCarlo Aliprandi, Sara Pupi and Giulia di Pietro, Ent-it-UP: a Sentiment Analysis systembased on OpeNER cloud servicesJordi Atserias, Marieke van Erp, Isa Maks, German Rigau and J. Fernando Sánchez-Rada,EuroLoveMap: Confronting feelings from NewsEstela Saquete and Sonia Vázquez, Improving reading comprehension for hearing imparedstudents using Natural Language ProcessingAitor García Pablos, Montse Cuadros, Seán Gaines and German Rigau, OpeNER demo:Open Polarity Enhanced Named Entity RecognitionAndoni Azpeitia, Alexandra Balahur, Montse Cuadros, Antske Fokkens and RubenIzquierdo Bevia, The Snowball effect: following opinions on controversial topicsStefano Cresci, Andrea D'Errico, Davide Gazzé, Angelica Lo Duca, Andrea Marchetti andMaurizio Tesconi, Tour-pedia: a Web Application for Sentiment Visualization in TourismDomain12:00 –13:00 Lunch break12:00 – 16:00 Hackathon16:00 – 16:30 Coffee break16:30 – 17:30 Results presentation

EditorsSeán GainesMontse CuadrosVicomtech-IK4Vicomtech-IK4Workshop Organizers/Organizing CommitteeRodrigo AgerriMontse CuadrosFrancesca FrontiniSeán GainesRuben IzquierdoWilco van VUAOleryWorkshop Programme CommitteeCarlo AliprandiAndoni AzpeitiaAitor Garcia-PablosAngelica Lo DucaIsa MaksAndrea MarchettiMonica MonachiniGerman RigauPiek CNR-IITCNR-ILCEHU/UPVVUAii

Table of contentsEnt-it-UP: a Sentiment Analysis system based on OpeNER cloud services ,Carlo Aliprandi, SaraPupi and Giulia di Pietro . 1EuroLoveMap: Confronting feelings from News, Jordi Atserias, Marieke van Erp, Isa Maks,German Rigau and J. Fernando Sánchez-Rada. 5Improving reading comprehension for hearing impared students using Natural Language Processing, Estela Saquete and Sonia Vázquez. 8OpeNER demo: Open Polarity Enhanced Named Entity Recognition, Aitor García Pablos, MontseCuadros, Seán Gaines and German Rigau . 12The Snowball effect: following opinions on controversial topics, Andoni Azpeitia, AlexandraBalahur, Montse Cuadros, Antske Fokkens and Ruben Izquierdo Bevia, . 15Tour-pedia: a Web Application for Sentiment Visualization in Tourism Domain, Stefano Cresci,Andrea D'Errico, Davide Gazzé, Angelica Lo Duca, Andrea Marchetti and Maurizio Tesconi. 18iii

Author IndexAtserias, Jordi. 5Aliprandi, Carlo . 1Azpeitia, Andoni . 15Balahur, Alexandra . 15Cresci, Stefano . 18Cuadros, Montse . 12,15Di Pietro, Giulia . 1D’Errico, Andrea . 18Fokkens, Antske . 15Gaines, Seán . 12García-Pablos, Aitor . 12Gazzé, Davide . 18Izquierdo-Beviá, Ruben . 15Lo Duca, Angelica . 18Maks, Isa . 5Marchetti, Andrea . 18Pupi, Sara . 1Rigau, German . 5,12Saquete, Estela . 8Sánchez-Rada, J.Fernando . 5Tesconi, Maurizio . 18Van Erp, Marieke . 5Vázquez, Sonia . 8iv

Preface/IntroductionThe OpeNER team is delighted to present a Tutorial and a Hackathon together in a one-dayworkshop on multilingual Sentiment Analysis and Named Entity Resolution using the OpeNERNLP pipelines as web services in the Cloud.OpeNER hopes to repeat the success from the July 2013 Amsterdam /opener-hackathon-in-amsterdam/) in which a broadspectrum of real end user SMEs, Micro-SMEs, Freelancers and even a few from technology giants,built creative applications using the OpeNER webservices. For examples of the applications builtfollow the URL provided above.The proposed workshop will present briefly the project, and all the technology (http://openerproject.github.io/ ) multilingual NLP tools and resources created within the project. Additionally, itwill be a slot for presentations of demos created before the Hackathon and presented in the call forpapers.The workshop will be complemented by a half day Hackathon. The Hackathon will encourageparticipants to form ad hoc multidisciplinary teams, brainstorm an idea, implement it and present ademo from which a winner will be picked by popular vote. Most of the “core developers” of theOpeNER pipeline technology will be available to help you out and get started.All participants will be given access to the collateral needed such as NLP tools and resources in sixlanguages beforehand from publicly deployed web services. As of writing the initial versions of theservices are publically available at http://opener.olery.com. In order to present a demo or paper tothe workshop the only thing that needs to be added is imagination.v

Ent-it-UPA Sentiment Analysis system based on OpeNER Cloud ServicesSara Pupi, Giulia Di Pietro, Carlo AliprandiSynthema SrlVia Malasoma 2456121 Ospedaletto (Pisa) - Italy{sara.pupi, giulia.dipietro, carlo.aliprandi}@synthema.itAbstractIn this paper we present a web application that exploits OpeNER Cloud Services. Ent-it-UP monitors Social Mediaand traditional Mass Media contents, performing multilingual Named Entity Recognition and Sentiment Analysis.Since consumers tend to trust the opinion of other consumers, reviews and ratings on the internet are increasinglyimportant. Given the huge amount of data flowing in the web, it has become necessary to adopt an automatic dataanalysis strategy, in order to understand what people think about a certain product, brand or topic. The goal ofEnt-it-Up is to carry out statistics about retrieved entities and display results in a communicative, intuitive and userfriendly interface. In this way the final user can easily have a hint about people opinions without wasting too muchtime in analyzing the huge amount of User-Generated Content.Keywords: Reference Application, OpeNER, Named Entity Recognition and Classification, Sentiment Analysis,Social Media, User-Generated Content.1.and analysis levels. OpeNER aims to provideenterprise and society with online services forCrosslingual Named Entity Recognition andClassification and Sentiment Analysis.In the paper we will present a new multimedia webapplication, Ent-It-UP, developed leveraging onOpeNER Cloud Services1. This application is a rated Contents (UGCs) and video contents.IntroductionCustomer reviews and ratings on the internet areincreasingly important in the evaluation of productsand services by potential customers. In certain sectors,it is even becoming a fundamental variable in thepurchase decision. Consumers tend to trust the opinionof other consumers, especially those with priorexperience of a product or service, rather than trustcompany marketing opinions which are usuallybusiness oriented. Given the huge amount of dataflowing in the web, it has become necessary to adoptan automatic data analysis strategy. It gives thepossibility to understand what people think about acertain product, brand or topic without wasting toomuch time in exploring User-Generated Contents.On the other hand, traditional Mass Media still playan important role in the way people get information.Opinion Mining in Media is a pretty new – but alreadyconsolidated - field of research. People operating inthis sector aims to know who is speaking, about what,when and in what sense. Named Entity Recognitionand Classification (NERC) are important indetermining roles (who, what and when) whileSentiment Analysis (SA) is necessary to determine theattitude of a writer with respect to the overallcontextual polarity of the text (what sense).OpeNER has created base technologies forCrosslingual NERC and Sentiment Analysis that areenabling industry users both to implement andcontribute to a basic set of core technologies that allrequire and allow them to focus their efforts onproviding tailored and innovative solutions at the rules2.Ent-it-UP designEnt-it-UP is an application accessible from the Webthat provides users with a clear and effectivevisualization of the knowledge extracted from twodifferent sources: User-Generated Contents and thetranscriptions of videos. In the following sections wedescribe the necessary steps which will lead from thecollected data to their communicative and intuitivevisualization through the Ent-it-UP interface.1.1Data harvestingThe first thing that has to be done is to collect the dataand store them into a database.The data are taken from two different sources, in orderto have the possibility to look at the same thing fromtwo different point of view. In fact, the first source wetake our data from are Social Media (such as blogs,forums, Online Travelling Agencies and so on) - whichcan be taken into account to know what people think -,11http://opener.olery.com/

and the second one are international news programs –which can be taken into consideration to know whatnews say. The first dataset needs to be pre-processed inorder to delete noise and get clean text. On the otherhand the news programs, needs to be processed by theSAVAS Speech Recognition Engine2 in order to gettranscriptions of the recorded videos. The systemreturns both an XML file and a plain text file. TheXML contains information about words’ timestampand will be used to link transcribed text to the videoitself. The raw text will be taken as input by OpeNERtools. The same happens to the UGC text previouslycleaned.All the data retrieved so far are stored on a MongoDBmanagement system.1.2The result is a KAF (Knowledge AnnotationFramework) [1][2] file which has an XML-likestructure. It consists of several linguistic layers (Figure1).Data AnnotationThe raw text files obtained are processed by theOpeNER Cloud Services which consist of a series ofNLP tools, listed below. Figure 1. KAF Layers.Language IdentifierTokenizerTree TaggerPart-of-Speech TaggerPolarity TaggerProperty TaggerConstituent ParserKaf-Naf ParserNamed Entity RecognitionScorerNamed Entity DetectionOpinion DetectorThe annotated levels of the KAF that will be taken intoaccount from the Ent-it-UP system are the terms level(from which it gets the word polarity) and the namedentities level. These data are also added to theMongoDB database.1.3Once the raw texts have been transformed into KAF,they can be elaborated. Some PHP scripts performqueries to the MongoDB collections and returnquantitative results such as entity frequency, entityoccurrences and other metrics.It is possible to use only some of the NLP tools orall of them. Of course, some basic analysis is requiredto provide implementation of Named EntityRecognition and Sentiment Analysis. This basicanalysis can be performed by only two NLP tools,which are the Tokenizer (as far as the language of thetext is known, otherwise the Language Identifier isrequired too) and the Part-of-Speech Tagger.Thus, in order to implement Ent-it-UPfunctionalities, these are the four NLP tools that havebeen used: 1.4Data VisualizationThe above mentioned results have now to be shown.Some of the functionalities offered by Ent-it-UP are thefollowing.TokenizerPart-of-Speech TaggerNamed Entity RecognitionPolarity Tagger2Data Processinghttp://voiceinteraction.pt2

Figure 2. Ent-it-UP tagcloudFigure 4. Ent-it-UP statisticsThe user can also explore the transcription in whichentities are marked with different colors according totheir type (i.e. entities identified as people are coloredin orange, entities identified as location are coloredin green and so on). Terms with polarity (positive,negative or neutral) are also highlighted (respectivelygreen, red and grey). The user can choose to highlightonly entities (all types or just some), only sentiment, orboth. Next to the transcription there is a player of thevideo. If the user wants to listen to the point in which acertain word is spoken he can just click that word inthe text and the video will jump to that point (Figure5).The user has the possibility to explore a generalinteractive tagcloud of the most frequent entities(Figure 2).He can also explore an entity-focused report, whichcan be obtained by searching for a specific entity orchoosing one of those shown in the tagcloud. Thereport includes the occurrences of the entity into thevideos and its cross time frequency (Figure 3).Figure 3. Ent-it-UP timelineIf the user decide to focus his search ontranscriptions of videos he can also explore avideo-focused report choosing one specific videoamong those present in the collection. The reportincludes statistics about the entities composition(percentage of entities recognized in the videotranscription that has been identified as people,percentage of entities identified as organization,and so on). This report also provides a tagcloud ofthose entities found in the video. The user can furtherchoose to narrow down the tagcloud selecting the onlycategory of entities he is interested in (Figure 4).Figure 5. Ent-it-UP transcription3.Usage caseIn this section is presented a usage case in which bothdata sources are exploited. In the following case, infact, UGC and video contents are both useful to theuser, who can look at the same thing from two differentpoint of view.Suppose that the end user is interested in investigatingwhat people think about a certain city. For example hewants to know how Paris is perceived by people. Hecould be interested in knowing what areas or featuresare the most mentioned and whether people love themor do not. On the other hand, he could be interested inhaving an overall insight of the city news events.3

The user can access Ent-it-UP, search for the keywordParis using one or the other dataset. In this way, he canget two different kind of information about Paris. Infact, choosing to use the UGC source he wouldprobably get every-day-life information about Paris(what people think). On the other hand, choosing thevideo source, the user would probably get informationabout the facts happening in Paris (what news say).4.Subtitling: state of the art, expectations and currenttrends", NAB Broadcasting Conference, Las Vegas,Nevada, United States, April 2014 (forthcoming).ConclusionsThis paper has presented Ent-it-UP as referenceapplication of the OpeNER project. We have presentedhow Ent-it-UP monitors Media contents, performingmultilingual Named Entity Recognition and SentimentAnalysis on User-Generated Content and videotranscriptions. After a short introduction we havedescribed the Ent-it-UP design, identifying the mainsteps that leads from raw texts to some kind ofknowledge. We have reported a usage case in whichEnt-it-UP could be used to have an overall insight of aplace. However it could be used to discoverinformation also about a certain brand, person,organization and so on.Ent-it-UP allows the user to focus on other activitiesrather than spend time analyzing the raw languageresources.AcknowledgmentsThis work is part of the OpeNER project which isfunded by the European Commission 7th FrameworkProgramme (FP7), grant agreement no 296451.5.References[1] Tesconi M., Francesco Ronzano, SalvatoreMinutoli, Carlo Aliprandi and Andrea Marchetti:“KAFnotator: a multilingual semantic text annotationtool”: Proceedings of the 5th Joint ISO-ACL/SIGSEMWorkshop on Interoperable Semantic Annotation, inconjunction with the Second International Conferenceon Global Interoperability for Language Resources,(ICGL 2010) Hong Kong, January 15-17, 2010.[2] Wauter Bosma, Piek Vossen, Aitor Soroa,German Rigau, Maurizio Tesconi, Andrea Marchetti,Monica Monachini and Carlo Aliprandi: “KAF: ageneric semantic annotation format”, Proceedings ofthe 5th International Conference on GenerativeApproaches to the Lexicon GL 2009, Pisa, Italy,September 17-19, 2009.[3] Carlo Aliprandi, Cristina Scudellari, IsabellaGallucci, Nicola Piccinini, Matteo Raffaelli, Arantzadel Pozo, Aitor Álvarez, Haritz Arzelus, RenatoCassaca, Tiago Luis, Joao Neto, Carlos Mendes,Sérgio Paulo, Marcio Viveiros, "Automatic Live4

EuroLoveMap: Confronting feelings from NewsJordi Atserias1 , Marieke van Erp2 , Isa Maks2 , German Rigau3 , J. Fernando Sánchez-Rada4Yahoo Labs Barcelona, 2 VU University Amsterdam, 3 The University of Basque Country,4Universidad Politécnica de Madridjordi@yahoo-inc.com, {marieke.van.erp,e.maks}@vu.nl, german.rigau@ehu.es, jfernando@gsi.dit.upm.es1AbstractOpinion mining is a natural language analysis task aimed at obtaining the overall sentiment regarding a particular topic. This paperpresents a prototype that presents the overall sentiment of a topic based on the geographical distribution of the sources on this topic. Theprototype was developed in a single day during the hackathon organised by the OpeNER project in Amsterdam last year. The OpeNERinfrastructure was used to process a large set of news articles in four different languages. Using these tools, an overall sentiment analysiswas obtained for a set of topics mentioned in the news articles and presented on an interactive worldmap.Keywords: Opinion Mining, Visualisation, Hackathon1.Introductionwell as some metadata of the newspaper articles was obtained before the hackathon. For this prototype, we decidedto focus on English, Spanish, Italian and Dutch. For instance, the topic gay marriage was manually translated tothe four languages and news articles relevant to this topicwere collected and processed. An overall sentiment scorewas also obtained per language for each topic. Finally, theaggregated score for every topic-language pair was used forcolouring a world map.During the hackathon, we developed some software modules to process each news article through the OpeNER webservices. In the remainder of this section, we detail the different steps in the workflow.The OpeNER architecture consists of several Natural Language Processing (NLP) components. Each component isconfigured to take the information it requires to performa specific analysis. KAF (Bosma et al., 2009) is usedas linguistic representation. Each of the NLP processingpipelines is deployed as a Cloud Computing service usingAmazon Elastic Computing Cloud6 (Amazon EC2). Figure 1 presents an overview of the OpeNER components deployed as web services.At the end of the different natural language processingpipelines, the extracted information is combined to obtainpolarity clusters for the different topics selected.Language Identifier: This component is responsible fordetecting the language of an input news article and deliversit to the correct pipeline.Tokenizer: This component is responsible for tokenisingthe text on two levels; 1) sentence level and 2) word level.This component is crucial for the rest of NLP componentsand is the first component in each language processingpipeline.Part of Speech Tagger: This component is responsible forassigning to each token its morphological label, it also includes the lemmatisation of words. Combining the lemmaand morphological label, later modules will consult a sentiment lexicon in order to assign polarity values to the wordsappearing in the news being processed.Named Entity Recognition: This module provides NamedEntity Recognition (NER) for the six languages covered byDifferent topics are often presented in news from differentperspectives. These perspectives may differ between countries and cultures, and are brought to the fore through different communication outlets. We aim to detect these opinionsfrom news articles from different languages to compare thepolarity profiles in different countries with respect to a particular topic. Within NLP research, there is a fair bodyof work on opinion and sentiment analysis (Pang and Lee,2008; Liu, 2012). Several toolkits have been developed forthe detection of polarity in text, but full multilingual opinion detection which includes the holder of the opinion andthe target is still lagging. The OpeNER project plans todeliver an opinion detection tool that is trained on an annotated corpus of political news and aims at a sentence-baseddetection of opinion expressions with their holders and targets. For this demo, however, we use the rule-based opiniontagger that was available in June 2013.This paper presents a prototype developed in a single dayduring the June 2013 hackathon organised by the OpeNERproject (Agerri et al., 2013)1 in Amsterdam.2 OpeNERaims to detect and disambiguate entity mentions and perform sentiment analysis and opinion detection on the textsfor six different languages (Maks et al., 2014). TeamNAPOLEON used the OpeNER infrastructure3 and web services4 to obtain sentiment analyses for news articles in fourdifferent languages which were then aggregated into topicsper country and presented visually on a map.In the remainder of this contribution, we detail our systemin Section 2., and present some examples in Section 3. Weconclude with future work in Section 4.2.Mining feelings from news using OpeNERDuring the hackathon, we processed around 22,000 newsarticles in four different languages obtained from the RSSservice of the European Media Monitor.5 The content w.html265http://aws.amazon.com/ec2

Figure 1: Overview of the components of the OpeNER pipeline4.OpeNER and tries to recognize four types of named entities: persons, locations, organisations and names of miscellaneous entities that do not belong to the previous threegroups.Named Entity Linking: Once the named entities arerecognised they can be identified or disambiguated with respect to an existing catalogue. This is required because the“surface form” of a Named Entity can actually refer to several different things in the world. Wikipedia has become thede facto standard as named entity catalogue. In OpeNERthe NED component is based on the DBpedia Spotlight7which uses the DBpedia8 as the resource for disambiguation entities.Sentiment Analysis: The Opinion tagger we used is a ruleand dictionary based tagger. It detects positive and negative polarity words (such as ‘nice’ and ‘awful’), as well asintensifiers or weakeners (such as ‘very’ and ‘hardly’) andpolarity shifters (such as ‘not’). In addition, the module includes some simple rules that detect the holders and targetsof the opinions related to the positive and negative polaritywords.Finally, the processed news in KAF format are stored andindexed using Solr9 to easily query and retrieve the newsarticles about a selected topic. A web service was deployedto obtain json results grouping the scores detected by topicand language. The json results are then presented to theuser in a world map.3.Future WorkAs this is only a very first prototype built in a few hoursduring the previous OpeNER hackathon, there are severaldifferent avenues of research as well as engineering issuesthat spring from it.To make the prototype more informative and useful forusers interested in analysing trending opitions, possible extension to the prototype could be a trend line or the optionto look at different snapshots of the EuroLoveMap. Thiscould provide insights into how the opinions on the different topics evolve in different countries.For selecting the news sources, we currently use languageidentification, but one preferably uses the publisher information as there may be news sources aimed at expats inlanguages different from the country’s main language. Thiswould not only be more precise, but also give us access toa host of background information about these sources thatcan be mined in order to obtain more fine-grained information. Different publishers can for example be classified asmore left or right leaning. Having this information enablesus to present a more fine-grained analysis of the differentperspectives within a country. Information about the publisher or authors of the articles could be further mined tocreate authority and trust profiles using PROV-O(Moreauet al., 2012). Being able to bring up the actual text ofthe mined articles would make the EuroLoveMap a usefultool to for example communication scientists or anthropologists.For this prototype, we manually selected the topics andtranslated them. Ideally, a system picks up on trendingtopics, for example by plugging into the European MediaMonitor or Twitter trends and detecting which topics wouldbe interesting to analyse. To translate these topics automatically one could imagine using DBpedia or a similar resource.As processing the articles via the NLP pipelines is a timeconsuming process, we are currently working with a staticdump of processed articles. Research in for example theNewsReader12 architecture is underway to optimise NLPpipelines further, but until then the most viable option forupdating the demo would be with daily batches that are processed overnight.Topics on EuroLoveMapIn order to test the prototype we manually selected a smallnumber of topics in English, which were manually translated to Spanish, Italian and Dutch.10 Table 1 presents theEnglish topics and the corresponding translations in Spanish, Italian and Dutch used in the prototype11 .Figure 2 presents a screenshot of the EuroLoveMap demoshowing the extracted opinions on “gay ucene.apache.org/solr/10To scope the prototype, we decided to focus only on four outof the six project languages.11The resulting demo can be found at wsreader-project.eu

EnglishBerlusconiBostonNorth KoreaObamaPutinCIASnowdenSpainUnited States, USNetherlandsItalyGermanyGay marriage, homosexual marriageSpanishBerlusconiBostonCorea del NorteObamaPutinCIASnowdenEspañaEstados Unidos, E.E.U.U.HolandaItaliaAlemaniamatrimonio homosexual,matrimonio gayItalianBerlusconiBostonCorea del NordObamaPutinCIASnowdenSpagnaStati UnitiOlandaItaliaGermaniamatrimonio nowdenSpanjeVerenigde Staten van Amerika, VSNederland, HollandItaliëDuitslandhomohuwelijkTable 1: Topics and translationsFigure 2: Screenshot of the EuroLoveMap demo showing the extracted opinions on “gay marriage”AcknowledgementsMaks, Isa, Izquierdo, Ruben, Frontini, Francesca, Azpeitia,Andoni, Agerri, Rodrigo, and Vossen, Piek. (2014).Generating polarity lexicons with wordnet propagationin 5 languages. In In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland, May.Moreau, Luc, Missier, Paolo, Belhajjame, Khalid, B’Far,Reza, Cheney, James, Coppens, Sam, Cresswell,Stephen, Gil, Yolanda, Groth, Paul, Klyne, Graham,Lebo, Timothy, McCusker, Jim, Miles, Simon, Myers,James, Sahoo, Satya, and Tilmes, Curt. (2012)

Come Hack with OpeNER! Workshop Programme 9:00 - 9:20 Introduction by Workshop Chair 9:20 - 10:30 Tutorial: OpeNER technology 10:30 - 11:00 Coffee break 11:00 -11:45 Demo/Posters Carlo Aliprandi, Sara Pupi and Giulia di Pietro, Ent-it-UP: a Sentiment Analysis system based on OpeNER cloud services