A Digital Archive Of The Book Of Disquiet Search And . - ULisboa

Transcription

1A Digital Archive of the Book of DisquietSearch and RecommendationsAndré Filipe Braz dos Santosandrefbsantos@tecnico.ulisboa.ptInstituto Superior TécnicoAbstract—The Collaborative Digital Archive of the Bookof Disquiet is platform to show textual fragments of theBook of Disquiet, which aims to provide an archive to Fernando Pessoa’s work as a representation of the fragmentsand their interpretations. It also aims to be a platformto create new editions of the the Book of Disquiet in asocial and collaborative context. In this paper we introducethe development of search and recommendations, basedon a set of characteristics, which aims to provide a toolto help users explore the archive and deliver contentof the user’s interest. The search functionality providesa discovering tool allowing users to explore the archivebased on dynamic searches, defined by options selected andconfigured by them. The recommendation functionalityprovides a way of discovering the archive according tothe user’s taste, which means that recommendations aretailored to the users’ preferences. The recommendationfunctionality helps users navigate through similar fragments, based on their taste. Additionally it provides a toolto sort editions based on similarity, creating a semanticedition where fragments’ position express their degreeof similarity, based on criterias configured by the user.The recommendation follows a content based approach,supported by vector space model, where we translateuser’s preferences and fragment’s properties into vectorsand find the degree of interest with cosine similarity.Our implementation of these functionalities follows anapproach which not only implements the current searchoptions and recommendations criterias but also allows itsextension to support new options and new criterias withoutchanging the overall strategy implemented for search andrecommendations.Index Terms—Search, Recommendations, Fragments,Collaborative Digital Archive, Book of Disquiet, FernandoPessoaI. I NTRODUCTIONThe Book of Disquiet is an unpublished bookcomposed by a set of fragmentary text written, byBernardo Soares, one of Fernando Pessoa’s heteronyms, from the early 1910 to the year of itsdeath, in 1935.Until our days, four Book of Disquiet were published as a critical vision of the Book of Disquietfrom their editors, Jacinto Prado Coelho in 1982,Teresa Sobral Cunha in 1990-91, Richard Zenithin 1998 Jerónimo Pizarro em 2010 [1], [2]. Sinceit was a critical appreciation, they diverge in theamount of fragments, in their order of fragments,following their own thematic and chronological approach, their orthography, Coelho and Pizarro follow Pessoa’s orthography while Cunha and Zenithupdate Pessoa’s orthography to their period’s orthography. Even the authorship of the fragments isdisputed, Cunha presents some of the fragments asbeen written by Vicente Guedes, another of Pessoa’sheteronyms. The Book of Disquiet is a book withouta consensus about what was Pessoa’s vision for thisbook.The Collaborative Digital Archive of the Bookof Disquiet is an web application which aims towork as an archive for the fragments written forPessoa’s book and as a social platform to build newedition of the Book of Disquiet [3]. As an archiveintends to show the multiple faces of the fragments,from digitalization of the original texts, with itsmultiple transcriptions, written in the context of thecritical editions, codified in TEI [4]. As a socialplatform, aims to produce new editions of the Bookof Disquiet built by groups of users, allowing themto express their vision of what the book should be.The archive was envisioned with the entitiesshown in figure 1, here we introduce both entitiesand relations between them [5]. Where we cansee how fragments relate to interpretations andsources, heteronym and taxonomies are assigned tofragments and what comprises and edition. Thoseproperties will translate into search options, whereusers configure an option to find a specific propertyor a set of specific properties by composing different

2Fig. 1: Class diagram of domain.options.II. P ROBLEMThe team developing the Collaborative DigitalArchive of the Book of Disquiet envision a set offunctionalities which would help users exploring thearchive. Users should be able to explore the archiveby searching for fragments, as well as be able todiscover new content through recommendations thatthe user might like.The criterias presented to users to configure howsuggested fragments are found while they navigatethrough the archive are also based on the domainmodel. Those criterias are used to sort an edition,based on the similarity of each fragment to theselected fragment, building a semantic edition wherethe position of each fragment expresses his degreeof similarity to the other fragment. Sorting editionsis refined to not only sort editions but also create A. Searchsections which holds fragments with the same deIn the field of search, it was envisioned a simplegree of similarity based on a set of recommendationtextual search and and advanced search for keycriteria. Sections may also hold sections, expressingproperties of the fragments.a structure of the edition.Simple search handles the most usual and simpleOur content based recommendation system sup- way of searching, therefore it will provide a tool toported by vector space model [6], that models user’s search for fragments with a set of keywords.preferences as vectors. The degree of interest ofThe second level, also known as advanced searchthe user for an item is now expressed by cosine was envisioned as a way of combining several difsimilarity between the user’s preference vector and ferent criterias and build a custom search to providefragment’s properties vector.fragments capable of justifying the chosen criteria.The implementation of both system follows a set Users will arrange and configure several options toof software engineering good practices in terms of find fragments with those properties.modularity and flexibility, making the implementation open for extension, allowing future imple- B. Recommendationsmentations of both search options and recommenOn the other hand, recommendations intends todations’ criterias without changing the search or deliver fragments and allow users to discover therecommendation algorithms.archive according to ther taste.In section II we define the problem and the goalsThe archive allows users to navigate to otherof our work. In section III we present a solution fragments while consulting a fragment. The currentto address the previously defined problem, which navigation method is based on the position of thewill guide the implementation and integration on fragment in the context of editions where it wasthe archive related in section IV. In the last section, include, allow users to travel to the previous andwe reflect upon the work realized and how it can next fragment in those editions. We also intend toimplement an alternative form of navigation whereevolved.

3the next fragment is the most similar fragmentin the context of that edition instead of the fragment following the current fragment in the overallorder of the edition. Users should also be ableto express their set of criterias which guide therecommendation to find the most similar fragments.This features intends to allow users to explore theyarchive following a path build specially for them.It also intends to provide a tool to sort virtualedition by the degree of similarity of fragments toa reference fragment, building semantic edition bya set of chosen criterias. The sort feature shouldbe expanded to not only sort, but also divide avirtual edition in sections, where each section holdsfragments with the same degree of similarity tothe reference fragment. The last feature will beextended to not only sort, but also cluster fragmentswith the same value of similarity. All fragmentsinside a cluster express the same degree of similarityto the reference fragment. It will also be possibleto apply clustering measures to a cluster turning avirtual editions in sections with several, simulatinga chapter, section, subsection according to a set ofcriterias chosen by the user to apply to a certaincluster.C. Navigation through recommendationThe purpose of navigation through recommendation intends to suggest a fragment similar to theone currently being visualise, therefore, a set ofsuggestions will be extracted from the user’s virtualeditions. For each virtual edition which includesthe current fragment, the most similar fragmentfrom each virtual edition is found and shown tothe user. When the user accepts a recommendation,by travelling to the suggested fragment, he starts anavigation by recommendation, While in this mode,the user will build a recommended edition withthe fragments the user has visited while navigatingin this mode. While he travels through the recommended edition, the recommended fragment isalways a fragment from the virtual edition whichthe user hasn’t seen.D. Sort editionsWhen users visualise their virtual editions, theywill be able to sort their editions by similaritybetween fragments of that particular virtual. He willconfigure a set of recommendation criteria according to their taste and select a fragment to be startingpoint of the sorted edition. The sorted edition willorder fragments according to the similarity to theIII. S OLUTIONchosen fragment, following the user’s taste, showingthe most similarity fragments at the beginning ofA. Simple searchthe edition and the most different at the end. AfterThis feature will present a tool to insert a set of sorting a virtual edition, they will be able to eitherkeywords which are sent to the server. The servers save the virtual edition or create a new virtualfinds fragments with interpretation with the set of edition with the new order.words. The fragments are then presented to the user.E. Iterative sort editionsThe features introduce with sort editions areB. Advanced searchextended to split an edition in semantic sections ofUsers will build a genetic search through the similarity between fragment. The sectioning of ediselection and configuration of multiple options. The tions will be achieved with introduction of sectionsallowed options will be heteronym, date, inclusion to the current domain.on critical edition, existence of sources, presenceof keywords, inclusion on virtual editions and taxonomies. They will also be allowed to configure F. Recommendation Systemthe search mode. The conjunctive mode, where aThe recommendations features presented in thefragment has to satisfy all options. The disjunctive last three points are supported by a content basedmode, where a fragment only needs to satisfy one recommendation system. It will use a memoryoption. User selects the options and configure them, based strategy supported by vector space modelthe search is sent to the server and returns fragment where the similarity is calculated by cosine simwith interpretations capable of satisfying the options ilarity. This approach translate user’s preferencesand search mode.and items’ properties into vectors. The similarity is

4found through cosine similarity between the user’spreference vector and the items’ properties vector,funding the likelihood of the user being interestedin the item.1) User’s preferences: This recommendationsystem will use a explicit approach to gather user’spreferences. Users will select a fragment and calibrate his degree of interest in the fragments’ properties, such as heteronym, date, edition, sources,text and taxonomies. Users’ preferences vector willbe built from the fragment and will reflect thedegree of interest of the user in the fragmentsproperties, according to the calibrate relevance ofeach property.2) Properties vectors: The fragment’s propertiesvectors deals with the transformation of fragmentsproperties into vectors. This is achieve through theexplicit attribution of a position of the vector toa concrete property. The properties will be builtthrough the concatenation of smaller vectors produced with a proper semantic, where each positionexpress a property inside that semantic. Thereforeeach one of these small vector will present semantic properties referring to heteronym, date, edition,sources, text and taxonomies.The properties vector of a fragment’s heteronymwill express the heteronym associated with theirinterpretations. The vector has a position for eachheteronym in the application and they will containthe value one when there is at least one attributionassociated with the heteronym or zero when thereisn’t a single attribution to that particular heteronym.Figure 2 shows 2 vectors, v1 is a from a fragment with interpretations only assigned to BernardoSoares, while v2 has only interpretations assigned toVicente Guedes or without assignment.The date found in the interpretations of a fragment will be expressed in a vector where eachposition represents a year where the fragment waswritten. Therefore each position that represents ayear found in the fragment’s interpretations willappear with the value 1. If we choose to place 0 inthe other position, we wouldn’t be able to expresssimilarity between fragments written in differentyears, thus we populate the rest of the vector withvalues capable of expressing this relation. Starting at a position with a 1, we populate the nextand previous positions with 1 minus a predefineddecline, the second previous and the second nextwith 1 minus two times the predefined decline, eachposition is filled with a value until the cumulative decline reaches 0. From there the rest of thepositions are filled with 0. With this strategy wecan express similarity between fragments written indifferent dates. In figure 3 we have a fragment withinterpretation written in 1925 and 1917. We cansee those positions marked with 1 while the otherpositions display other values based on the decay.The inclusion on critical editions will representthe expert interpretations of the fragment. This vector has a position to represent each critical editionin the archive. The position which represents thecritical edition where the interpretation belongs,followed by a set o positions to express heteronymand date like described in the previous paragraphs.This edition vector of a fragment will have severalcopies of this vector and each copy will representthe inclusion in one critical edition. Figure 4 displaya vector with edition properties.The source vector will have three sections torepresent each type of source. The first section willhave a set o position where the first position appearswith 1 when there are manuscript sources or 0 whenthere isn’t. The second position appears with thevalue 1 when the source has the LdoD mark. Thefollowing position express the source informationfor heteronym and date like previously described.The second section is similar to the previous one,but it marks the existence of typescript source instead of manuscript sources. The last section refersto the printed sources. This section has one positionmarking the existence of printed sources, whichappears with one when there are printed sourcesor 0 when there isn’t. This position is followed bypositions to represent heteronym and date like it waspreviously described. Figure 5 display a vector withsource properties.The theme of the text will be used to expresssimilarity based on the most relevant words of thefragment’s interpretations. To calculate the similarity between two fragments, the tf-idf is calculatedfor each term of the fragment interpretation andthe 100 most relevant terms are chosen from eachfragment. These terms are the combined into a set.Each position of the vector will represent the termint he same position in the combined set and thevalue is the tf-idf for that term in the fragment. Infigure 6 shows 2 vectors with the correct mappingbetween term and value, as well was the tf-idf forthe term in that fragment.

5Bernardo Soares Vicente Guedes Sem Atribuição100011v1v2Fig. 2: Heteronym vector.19251F119260.91927119280.919290.8Fig. 3: Date vector.1 . . . . . 1 . . . . . 1 . . . . . 0 . . . . . {z} {z }Heteronym Date{z{z} {z }Heteronym} {zCoelho Date{z} {z }Heteronym} {zCunha Date{z} {z }Heteronym} Date{zZenith}PizarroFig. 4: Edition vector.1110 {z0 . . .} Heteronym {z{z1}Date0100 . . . {z} {zHeteronym} Fonte Manuscrita0}Date{z00 {z0 . . .} Heteronym}Fonte Datilografada {z}Date{z}Fonte PublicadaFig. 5: Source vector.Taxonomies will be represented in the vector toexpress the different categories of a taxonomy foundin the fragments. The lack of a category will beexpressed with a 0 in the vector. When a categoryis present, we can either chose an binary approach,setting it to 1 or a take advantage of the establisheddomain where each category found in a fragmenthas a value, which indicates the relevance of thecategory in the fragment.IV. I NTEGRATIONThe first step to achieve textual search was theinclusion of Lucene to overall scheme of the project.Lucene intends to index the textual content ofthe fragment’s interpretations, when the fragmentis uploaded to the application. Therefore Lucenejoins the persistence layer, with MySQL to host thedomain of the Fenix Framework and the file systemto store the facsimiles, like it is displayed in figure 7.The more detailed view can be seen in figure 8.Here we introduce the the interaction between theFig. 7: Overall layer diagramdifferent layers of the application. Client-side presentation deals with the interaction with users at thebrowser level. This layer handles the construction ofa search. The dynamic search was achieve with amodel view presenter strategy, where users interactwith the interface, which the presenter handles andpropagates the user’s action into the model. The presenter also updates the interface to reflect the currentof domain, forming a cycle where the user action ishandle the presenter, which manipulates the domain.The domain warns the presenter, which updatesthe interface to express the domain state. Serverside presentation introduces the parameterization

7Fig. 6: Text vector.Fig. 9: Class diagram of search.Fig. 8: Detailed layer diagramof views from templates built in JSP. Controllershandles all the calls from the interface.The search engine works with several SearchOptions. The classes of this can be seen in figure 9,where the strategy to determine if a fragmentsrespects the search criterias is supported by a visitorwhich is also displayed. SearchOption offers aninterface to build concrete search options.The classes of the recommendation system canbe seen in figure 10. Here we introduce the Recommender hierarchy which intends to offer a common interface for all Recommnders. VSMRecommender is an abstract implementation of a Recommender through VSM and supported by Propertyto extract properties vectors. Both VSMFRagIn-terRecommender and VSMFragmentRecommenderare concrete recommenders intended to recommendinterpretations and fragments. Subclasses of Property were created to implement the transformationof different characteristics into property vectors.VSMFgragInterRecommdner is the one being usedto produce recommendations for navigation andestablish the similarity sort intended to sort a virtualedition as well as perform their sectioning.We can also a observe that SearchOptions andProperties interact with the domain the identifycharacteristics of domain entities, such as interpretaions and fragments, while it also interacts withIndexer to perform textual search. The interface withLucene is handled through the Indexer.The domain was also extended to support theweights users define to express their taste, used inthe recommendation system. This can be seen infigure 11 with the introduction of RecommendationWeight, to store heteronym, date, edition, sourcesand text weight, while TaxonomyWeight stores thewights of all the taxonomies in the users virtualeditions. In this figure, we also present how a virtualedition is now composed by sections which maycontain more sections or virtual interpretations.V. R ELATED W ORKThe field of recommendation systems is highlyresearched and its use is highly desirable in the web,where it aims to provide an information capableof catching the users attention [7], [8], [9]. Onlinestores use recommendation systems to extrapolate

7Fig. 10: Class diagram of recommendation.Fig. 11: Class diagram of domain with recommendation weight and sections.the degree of interest of users in products andrecommend items with high chance of being sold.Recommendation can follow a content based approach, where items are found based on its properties and users interest in these properties [10].A common approach is the one suggested by vector space model, where vectors are created to express items properties and users’ preferences, whilecomputing their similarity between cosine similarity [11].On the other hand, a collaborative filtering approach [12], intends to discover items based onprevious interactions of items and users. To findthe user’s interest in an item, this approach findthe users most similar users, i. e, users which haveexpressed the same degree of interest in a set ofitems [13]. From the the group of users, it predictsthe user’s degree of interest based on the interestexpress by this group of users.To produce a prediction, interactions betweenusers and items needs to expressed in ratings [13].These ratings might be explicit, where users expresstheir interest in items, by classifying them in ascale, e.g. from 1 to 5 or just by placing a likeon an item. This approach is highly depended ofthe users will to provide ratings, but produces moreaccurate predictions. It can also be implicit, whereitems are classified based on the nature of theuser’s interaction, e.g. visualisation of items. Whilethis one doesn’t depend on the users willingness

8to provide ratings, but their predictions are lessaccurate.VI. C ONCLUSION & F UTURE WORKThe goals of the LdoD team were met through thedevelopment of a search and recommendation system, as it was envisioned in the book function of thedigital archive. The development of the advancedsearch system realised the search functions whichwhere required to actively explore the archive. Onthe other had, the recommendation system findscontent according to the user’s taste and enricheshis interaction with the archive by showing contentusers will most likely find interesting.The implementation goal was not only to createa search and recommendation system with severalconfigurable options, but also provide an architecture which could be extended, allowing the implementation of new options. This implementation goalwas also met.The development of new search options and recommendation criterias to deal with more metadataintroduced into the digital archive.The application will grow in its virtual dimensionwith more virtual editions, therefore a recommendation system for virtual editions might fit into overalllifespan of the archive. If the archive’s growthachieves a high volume of users, a collaborativefiltering recommendation system will most likely bedeveloped to deal with the recommendation of thisgrowing domain.R EFERENCES[1] A. R. Silva and M. Portela, “Tei4ldod: Textual encoding andsocial editing in web 2.0 environments,” Journal of the TextEncoding Initiative [Online], vol. 8, 2015.[2] M. Portela, “Nenhum problema tem solução: Um arquivodigital do Livro do Desassossego,” Estranhar Pessoa com asMaterialidades da Literatura, 2013.[3] A. R. Silva and M. Portela, “Social edition 4 the book ofdisquiet: The disquiet of experts with common users,” Journalof the Text Encoding Initiative, vol. 8, 2014.[4] “TEI: Text Encoding Initiative,” http://www.tei-c.org/index.xml,2007.[5] M. Portela and A. R. Silva, “A model for a virtual ldod,”Literary and Linguistic Computing Advance Access, 2014.[6] G. Salton, A. Wong, and C. Yang, “A vector space modelfor automatic indexing,” Communications of the ACM, vol. 18,no. 11, pp. 613–620, 1975.[7] A. Rajaraman and J. D. Ullman, Mining of Massive Datasets.New York, NY, USA: Cambridge University Press, 2011.[8] G. Adomavicius and A. Tuzhilin, “Toward the next generationof recommender systems: A survey of the state-of-the-art andpossible extensions,” IEEE Trans. on Knowl. and Data Eng.,vol. 17, no. 6, pp. 734–749, Jun. 2005. [Online]. ] P. Lops, M. de Gemmis, and G. Semeraro, “Contentbased recommender systems: State of the art and trends.”in Recommender Systems Handbook, F. Ricci, L. Rokach,B. Shapira, and P. B. Kantor, Eds. Springer, 2011, pp. 73–105. [Online]. Available: html#LopsGS11[10] M. Balabanović and Y. Shoham, “Fab: Content-based,collaborative recommendation,” Commun. ACM, vol. 40,no. 3, pp. 66–72, Mar. 1997. [Online]. Available: http://doi.acm.org/10.1145/245108.245124[11] R. V. Meteren and M. V. Someren, “Using content-basedfiltering for recommendation.”[12] M. D. Ekstrand, J. T. Riedl, and J. A. Konstan, “Collaborativefiltering recommender systems,” Found. Trends Hum.-Comput.Interact., vol. 4, no. 2, pp. 81–173, Feb. 2011. [Online].Available: http://dx.doi.org/10.1561/1100000009[13] A. M. Rashid, I. Albert, D. Cosley, S. K. Lam, S. M. McNee,J. A. Konstan, and J. Riedl, “Getting to know you: Learning newuser preferences in recommender systems,” in Proceedings ofthe 7th International Conference on Intelligent User Interfaces,ser. IUI ’02. New York, NY, USA: ACM, 2002, pp. 127–134.[Online]. Available: http://doi.acm.org/10.1145/502716.502737

Collaborative Digital Archive, Book of Disquiet, Fernando Pessoa I. INTRODUCTION The Book of Disquiet is an unpublished book composed by a set of fragmentary text written, by Bernardo Soares, one of Fernando Pessoa's het-eronyms, from the early 1910 to the year of its death, in 1935. Until our days, four Book of Disquiet were pub-