A Bookmaker's Workbench - University Of Waikato

Transcription

A Bookmaker’s WorkbenchVeronica Liesaputra and Ian H. WittenDepartment of Computer ScienceUniversity of WaikatoHamilton, New Zealand{vl6, ihw}@cs.waikato.ac.nzdesigned to ensure an attractive product whose content caneasily be assimilated and whose readers can easily findinformation and navigate around.ABSTRACTWe have been developing electronic Realistic Books thatcombine the natural advantages of electronic documents—full-text search, hyperlinks, animation, multimedia—withthose of conventional books—the ambient informationprovided by the physical object, analog page turning,random-access navigation, bookmarks, highlighting andannotation. Although simple Realistic Books can easily becreated from PDF or HTML files using a shell script or webservice, it is not so easy for book designers to takeadvantage of advanced features that are not normallyrepresented in the input files. This paper describes theBookmaker’s Workbench, an interactive system intended tohelp book designers produce Realistic Books. Itincorporates many features, including a text mining optionthat automatically identifies significant key terms andmarks them visually in the text, the ability to incorporatesynonyms automatically into the full-text search capability,and include automatically generated back-of-the-bookindex. A user evaluation is reported that demonstrates thesystem’s usability and learnability.We have been developing electronic Realistic Books thatcombine the natural advantages of electronic documents—searching, hyperlinks, animation, multimedia—with thoseof conventional books—the ambient information providedby the physical object, analog page turning, random-accessnavigation, bookmarks, highlighting, and annotation [6].The features of conventional books have developed andimproved over centuries; our aim is to replicate andenhance them. The Realistic Book system uses Adobe Flashas the presentation device. Studies of reader performancewhile searching and browsing show that participantsperform tasks significantly quicker using Realistic Booksthan with printed books, HTML and PDF formats, withoutany loss in accuracy [5, 6].Simple Realistic Books can easily be created from PDF orHMTL files using a shell script, or a web service that weprovide. However, to take advantage of advanced featuresthat are not normally present in such files, book producersmust edit a configuration file manually or work directly inthe Flash application—and both require specialistknowledge. To rectify this we have designed, implemented,and evaluated an interactive Bookmaker’s Workbench tofacilitate the process of making Realistic Books.Author KeywordsElectronic book, Book editor, Flash application.ACM Classification KeywordsH.3.7 [Information Storage and Retrieval]: Digital libraries,User issues; H.3.7. [Information Interfaces andPresentation]: User interfaces, Evaluation/methodology,Graphical user interfaces, Interaction styleThe conventional publication process for printed books hasthree stages: editorial, design and production [9]. In theeditorial stage, copy editors help finalize the book’scontent. In the design stage, the structure and visual aspectsof the book are determined. The book designer choosesdetails like page size and format, cover design and bindingmethod, margin size, typeface, font size, colors, style ofillustrations and tables—anything that affects how the pagelooks. In the production stage, the table of contents, subjectindex and bibliography are inserted and the manuscript isformatted according to the designer’s specification.Proofreaders read each page carefully and examine itslayout in a cycle of checking and correction; then the proofsare handed off to the printer. Self-publishing authors oftenperform one or more of these tasks themselves.General TermsDesign, Experimentation, Human Factors, PerformanceINTRODUCTIONPublishing a printed book is a major undertaking. Each ofmany components—size, format, binding, page layout,placement of chapters and sections, tables of contents andfigures, subject index, and so on—must be carefullyPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise,or republish, to post on servers or to redistribute to lists, requires priorspecific permission and/or a fee.CHINZ 2011, July 4–5, 2011, Hamilton, New Zealand.Copyright 2011 ACM 978-1-4503-0676-8/11/07. 10.00.The process of producing high-quality electronic booksinvolves much the same steps. Furthermore, for onlinereading to be generally acceptable, electronic book1

Figure 1. Bookmaker’s Workbench interfaceapplications must provide features that increase readerperformance by transcending the affordances of paperbooks [1]. Of course, such applications will only find wideusage if it is easy to convert conventional computerreadable documents into the special form they require [2].This is the role of the Bookmaker’s Workbench.FEATURES OF REALISTIC BOOKSThe reader’s view of a Realistic Book shows a double-pagespread flanked by a stack of page edges to the left and rightthat may include bookmark tabs [6]. Pages can be turned by―grabbing‖ them with the mouse, or simply clicking them;clicking a bookmark goes directly to the target page. Atoolbar appears beneath the double-page spread that givesthe reader access to forward and back arrows, selection bypage number, animation controls for pages with overlays, amagnification device, annotation tools, and a search box.The book designer can suppress individual tools or theentire toolbar; readers can suppress the latter. To the left ofthe double-page spread appears a preview area that gives athumbnail view of the book; again, this can be suppressedby the reader.Users interact with the Workbench to transform electronicdocuments into Realistic Books. Implemented within theAdobe Flash environment, it communicates with a PHPapplication server to save the books and any changes madeto them. It addresses the design and production stages, butnot the editorial processes. This paper uses the term bookdesigner to denote the various players involved during thelast two phases of the publishing operation, such astypesetters, indexers and proofreaders.To visualize all this, see the Bookmaker’s Workbench inFigure 1, of which the lower 80% gives a preview of thebook that is being constructed in a format that is identical tothe reader’s view described above. Above this is therepository area, which holds the raw material from whichthe book will be created—all the video, audio, text andimage files that the designer has uploaded. Beneath therepository area is the designer’s toolbar, from which theA Realistic Book is constructed from one or moreelectronic files that the designer uploads. Its logicalstructure, layout, content, physical properties, and readerservices are all defined in a template file. Sections 3 and 4explain how a template file for the book is automaticallytranslated into the book itself. Section 5 presents theevaluation study procedure and results.2

Parameter running headers automatically assigned section numbers and figurenumbers hyperlinked Table of Contents and Table of Figures bookmarks for sections and figures user-defined bookmarks tools for annotation, including textual and freeformnotes, and highlighting an incremental full-text search feature that records thelocation of search hits by creating temporary bookmarks content enrichment via automatically-generatedhyperlinks hyperlinked Subject Index a synonym resource that is used during full-text ewidth of a pageheight of a pagetype of papers used for the book’spagesPageColourbase color of the book’sLeftMarginleft margin of the pageRightMarginright margin of the pageTopMargintop margin of the pageBottomMarginbottom margin of the pageHeaderHeightfont size of the running headerPageNumHeight font size of the page numberCovers:CoverTypetype of papers used for the book’scoversCoverColourbase color of the book’s coverBookMargindistance of the page from the coverPage edges:xOffsethorizontal offset between each paperyOffsetvertical offset between each paperBookmarks:BaseColourdefault color of a bookmarkTabStylestyles of the bookmarkSnapTobookmark targetsPage flipping:FrontPausepause time before turning the frontcoverBackPausepause time before turning the backcoverTurnPausepause time before turning the book’spagesTurnSpeedspeed of turning a pageGeneral:ZoomStylemagnification level of a bookShadowLeveldarkness level of the shadowIn addition, there is a substantial amount of metadata thatcontrols the appearance of the book, as shown in Table 1.Most of these features are associated with the book itself,but some are stored on a reader-by-reader basis (e.g., userdefined bookmarks and annotations).The last three features in the above list are particularlynoteworthy. People often read documents extensively, firstdetermining whether a book is worth reading and thenfocusing on portions of the text in order to locate thedesired information [3]. Readers readily notice headings,illustrations, charts and tables, all of which stand outvisually. They also scan the text for key words and phrases.To facilitate this, the Bookmaker’s Workbench is able toautomatically identify significant key terms in the text andmark them visually. Wikipedia is used as a comprehensiveknowledge base for this operation, and the visual marktakes the form of a hyperlink to the relevant Wikipediaarticle. Furthermore, when readers move the mouse over aterm, a short description of it pops up.For example, Figure 2 shows part of a book page containinga news story about the day President Obama took his oathof office, which has been automatically augmented withlinks to relevant Wikipedia articles: Barack Obama, civilrights, Martin Luther King and African Americans, amongstothers. In this book hyperlinks are underlined and coloredblue. The mouse is hovering over the phrase 44th Presidentof the United States of America, which brings up a popupshowing an excerpt from the relevant Wikipedia article.Table 1. Modifiable parameters for visualizing a bookbook designer can customize visual aspects of the booksuch as its dimensions, fonts and page layout, and selectsuitable reading services that it will incorporate.All this information is generated using the text miningtechnology described in [8]. This automatic identification ofkey terms also allows a hyperlinked Subject Index to becreated automatically by collating them together withhyperlinked references to the appropriate page (and pagenumber).Realistic Books can have many features, and the designermust specify which ones are to be incorporated into theactual book that is being produced. Features include cover image and color page content: PDF files, HTML files, image files, videoand sound files different versions or ―overlays‖ of each page, selectedunder reader control (like animated PowerPoint) pagination of textual matter page numbering, including separately numbered frontmatterPrevious research has shown that one reason why peoplefail to locate the information they seek is that they describeit in terms that differ from the terminology used in thedocument [4]. One solution is to increase the number ofways in which each piece of information can be denoted.The Workbench generates a synonym table automatically as3

Button nameIconuploadadd bookmarkadd sectionbookmark listsection listhyperlinkoptionsdocumenttoolsFigure 2. A news story excerpt with links to Wikipedia articleslayoutrepaginatea byproduct of the above-mentioned text mining operation.This table is consulted automatically during full-text searchto increase the chance of getting a hit, even when the readeruses different terminology to that adopted in the book.page breaktext formatsaveIn Figure 2, President Obama, President Barack HusseinObama and 44th President of the United States of Americaare all linked to the article Barack Obama. Thus these fourterms are considered to be synonyms. Although the phraseBarack Obama does not appear in the text, it is inserted intothe synonym table because Wikipedia uses it as the term forthe topic (along with any synonyms, or ―redirects‖, that aredefined in Wikipedia). When a reader enters any of theseterms into the find box, all locations of the term and all ofits synonyms are returned—creating semantic links thatconnect related passages that contain information expressedin different terms. The search algorithm operatesincrementally, so that hits appear as soon as the firstcharacter is typed and are refined with each successiveletter.Table 2. Buttons in the designer toolbarthe book. When a PDF file is uploaded, each page is savedas a page image in Adobe Flash format (.swf). Similarly,audio or video files that are recognized by Adobe Flash arewrapped into a media player and stored as swf files. Allthese page images are displayed in the repository areashown at the top of Figure 1, with names such aspage 28.swf and page 29.swf that are assigned by thesystem.The template fileAs noted earlier, the complete structure of a book isspecified in a template file. It is marked with a red star inthe repository area. Any text document couched in astandard markup language such as XML or HTML (or evena flat ASCII file) can serve as the book’s template.Designers run a conversion script on HTML or ASCII filesto convert them to XHTML format. If no template isspecified, one is automatically generated. By default, it willcreate a hardcover book and use all the files in therepository area as the contents. This is a common way ofgenerating an initial version of a book, which will then becustomized.CREATING REALISTIC BOOKSRealistic Books are generated under the control of atemplate file written in XHTML. This file defines thesequence of pages that constitutes the book and the servicesto be provided to readers, along with any associatedmetadata such as page size, margins, title and where themain text starts (Table 1). It may also contain text, withHTML markup, that will go into the book.There are three phases: acquisition, automated documentprocessing, and customization. The buttons in the designertoolbar are shown in Table 2. The book itself does not exist(and therefore does not appear on the display of Figure 1)until it has been generated during the Automated documentprocessing operation described below. A common way ofproceeding is to first make an initial version of the bookand then customize it by adding new content to existingpages, and adding new pages.If a book designer chooses to work with an existing book,all files specified in its template file are automaticallyuploaded to the repository area.Automated document processingThe system paginates any text in the template file, or in anyother text documents that have been uploaded, using thealgorithm described in Section 4. Link information isextracted and saved for later use.Acquiring source materialDesigners use the upload button (Table 2) to add to therepository area every file that contains source material for4

Then the structure of the book is determined. Sectionsmarked by header tags are assigned a number and level(e.g., 1., 1.2., 7.1.9); level-1 sections receive specialtreatment (they are called ―chapters‖ in the present article).Similarly, images are numbered according to theirenclosing section identifier and order of appearance (e.g.,Figure 1.1, Figure 4.1.3). A Table of Contents and List ofFigures are generated from the titles of the differentsections and figures in the book respectively. Bookmarktabs are added to the beginning of sections and figures.Running headers and page numbers are also createdautomatically.during the pagination process. If necessary, book designerscan make further modifications manually.First, a book’s template file is parsed and the text isformatted according to the style described in the template—for example, section titles are given an appropriate typefaceand bookmark tabs are attached to the relevant pages. Linksource and destination are stored in the hyperlinks table,both for hyperlinks that appeared in the original documentand ones added by the text mining processes. Line and pagebreaks are created according to the guidelines explainedbelow. At the end of the pagination stage, the Table ofContents, List of Figures, and back of the book SubjectIndex are automatically generated, each entry beingexpressed as an internal hyperlink, and inserted into theappropriate position in the document.Predefined design styles are applied to particular elementsof the book. For example, section titles are given a boldfacestyle with a large font size that depends on the level of thesection. Unwanted tools in the reader tools area aredisabled. Each occurrence of every word in the document isprocessed and stored in a full-text index.Margins frame the content of a page. The objective ofmargin design is to enhance the book’s utility and ensurethat every pair of facing pages produces a pleasing aestheticeffect when the bound book lies open. Margins normallyoccupy up to 40% of the page, and the organization of thetext is often adjusted to prevent unacceptable situationsarising during pagination. The ContentWidth andContentHeight of a page are equal to the size of the pageminus the size of the page margin and the size of the pagenumber and running header. The body text will neverexpand outside this region.As mentioned in the previous section, text miningtechniques are applied to determine keywords and keyphrases in the running text. This results in the addition ofWikipedia hyperlinks and explanatory pop-ups within thetext, an automatically-generated Subject Index, and asynonym resource for full-text search.CustomizationOnce the document has been processed as described above,a Realistic Book is generated. The appearance of eachcomponent in the book space, reader tools areas andpreview area are updated accordingly.A page break is introduced before the beginning of eachchapter, and every chapter begins on a right-hand (oddnumbered) page. If the end of a chapter falls on an oddnumbered page, a blank page is inserted to ensure that thenext chapter begins on an odd-numbered page.Designers can further customize the resulting book. Thedesigner toolbar contains a set of functions for modifyingthe book’s appearance and the reading services associatedwith it—for instance, redefining its logical structure,associating different styles with section headers, andchanging the page margins. After these changes, thetemplate file is updated accordingly and the document isreprocessed according to the new specification. The cycle isrepeated until the designer is satisfied with the appearanceof each component of the Realistic Book.When a paragraph does not fit on to a page, orphan andwidow rules are applied. Section titles are always locatedon the same page as the section’s first paragraph. Aparagraph is moved to a new page when fewer thanOrphanCount lines remain on the current one or whenfewer than WidowCount lines would flow on to the nextpage.All pictures are automatically scaled so that their width andheight do not exceed ContentWidth and ContentHeightrespectively. A zoom facility allows readers to view imagesat their original size, to solve the problem of poor legibilityat reduced sizes. If insufficient space remains on the currentpage for a figure, it is placed on the next one. Figurecaptions appear on the same page as the illustration.PAGINATIONPagination is a vital step in the conversion of electronic textformats such as HTML or plain text into a book, and isperformed automatically by the Bookmarker’s Workbench.It is defined as the process of laying out parts of a documentinto pages [10]. Of course, text cannot be divided into pageswilly-nilly: for satisfactory results there are manytypographical constraints that must be respected [7].In Adobe Flash, the same piece of text may be rendered ondifferent occasions with slightly varying space and lettersize (for unknown reasons). This means that the location ofline breaks can change when the document is reloaded,altering the paragraph’s height and requiring the entire bookto be repaginated to ensure that the text does not overflowits allocated space. This is highly undesirable, because thepagination process is expensive; also, it gives books a veryunstable appearance. Consequently the Bookmaker’sFinding an optimal pagination for a document that satisfiesa formal specification of constraints for the criteriadescribed in [7] is computationally expensive, althoughheuristic methods help greatly. However, implementingpagination algorithms is not the focus of this research, sothe Bookmaker’s Workbench applies just a few basic rules5

Workbench inserts explicit line breaks into the text toguarantee the same paragraph size whenever the documentis reloaded.editing environment. We assessed the usability of eachmethod based on learnability, efficiency, effectiveness anduser experience.Extra space within words can degrade the quality of textpresentation and distract the reader’s attention. In theWorkbench, words are not hyphenated across line or pagebreaks. Successive words in a paragraph are inserted into aline until the text width exceeds the content’s width. Whena word does not fit, the line is broken and that word beginsthe next line. Because an explicit line break is inserted aftereach line, the text is unjustified. This has the side effect ofavoiding rivers of space, because the space between wordscannot exceed the space between lines.Setting an interactive GUI against an HTML editor issomething of a straw-man comparison, although the formermight be frustratingly slow because it frequentlyrepaginates in order to show a faithful preview of the book.Also, the learnability of each system is only informallyassessed, because the experimenter gave a verbalexplanation and demonstration of the two systems.Consequently we are more interested in qualitativecomparisons and user comments on the methods than inquantitative results.During the line breaking process, full-text indexing isperformed. All words in the book—and all theirsynonyms—are indexed and stored in the index table. Indexterms are case-folded; however, stemming is not performed.We have found that the ―redirects‖ that have been recordedmanually in Wikipedia provide a far more reliable basis forfull-text searching than any automatic stemming algorithm,and as noted in Section 2 these are incorporated into thesynonym table and used during full-text search.ParticipantsWe recruited 16 high school and university students aged15–40 from a variety of disciplines to create two booksusing each method. Participants had used computersextensively for several years and had, as readers,encountered Realistic Books at most twice before. Nonehad previously used the Workbench, but all knew how tocreate a web-page using Dreamweaver.ProcedureDuring pagination, any hyperlinks that occur in themanuscript are interpreted and recorded. Designers cancreate a link to a particular page in the book or to anexternal source. Figures, chapters, paragraphs, and sectionscan all be referred to by unique link identifiers rather thanby page number, to cater for the dynamically paginatednature of Realistic Books.Participants came individually to the study and were askedto complete a profiling questionnaire that recorded agerange, gender, language proficiency, and their experiencewith computers.Each participant was first trained to use both methods tocreate a book. After gaining confidence, they performedtwo tasks, which they undertook in the same order. Eachtask asked participants to fill in missing information in a50-page book—e.g., insert a bookmark to the page thatmentions a specified term, add an image that best describesthis text—and to modify its appearance—e.g., change thebook’s size to make it more readable. This involved a totalof 15 subtasks.Whenever the book is repaginated, the page numbers of linkdestinations may change but the link identifiers remain thesame. During the pagination process, every link identifierand its associated page number are stored in a table.Whenever a reader clicks an internal link, the hyperlinktable is consulted to determine the appropriate page numberand the book is opened at that page. Whenever the book isrepaginated, the table is updated. Designers do not need tomanually update link destinations unless they referexplicitly to page numbers.Participants used the Bookmaker’s Workbench for one taskand the HTML editor for the other, but in different ordersTask 1 using the Workbench and Task 2 with the texteditor, and vice versa. The 16 possible orderings wereallocated evenly between participants, so that eight createdeach book using the same tool.EVALUATIONThe final outcome of the book creation process is a filecalled book.htm that contains all the relevant metadata anddefines the book’s content in terms of the file names of eachpage image. An alternative, more primitive, way of creatingbooks is to use an HTML text editor to construct this filedirectly, and shell scripts exist to convert other documentssuch as PDF files to page images for the book.The functions participants used during each task and thetime they took to learn each tool and to finish each taskwere recorded. Participants were encouraged to ―think outloud‖ and make comments. Having finished both tasksparticipants completed a questionnaire to capture theiropinions. They had to choose which method was useful forthem, easier to use, more pleasant and engaging to use, andfinally which they preferred overall. They were asked toexplain their choices and state which features they liked anddisliked, as well as suggest improvements.In order to gain some experience with how users perceivethe Bookmaker’s Workbench a small evaluation study wasperformed. The study compared user experiences with theBookmaker’s Workbench with the more primitive method.Adobe Dreamweaver was used as the HTML editor. Itsupports HTML tag auto-completion and provides a visual6

opening tags in the correct placed in the document. In fact,every time users made any kind of change to the booktemplate file, however small, they would straight away tryto visualize the resulting Realistic Books to check that thebook template file was still well-formed.ResultsWe evaluated performance based on the outcome andprocess measures defined above.LearnabilityQuantitatively, the learnability of a system was measuredby the time it took subjects to build a book, and the numberof questions asked during the learning process. It isassumed that a system is easier to learn and understand ifsubjects spend less time and ask fewer questions during thetraining process. Subjects were also asked at the end of thestudy whether they found the system easy to learn.Subjects did not want to run the conversion script toautomatically ensure that the template file is valid, becausethe script took quite some time to finish. They only used itwhen they had made extensive changes.The Workbench gives an instant preview of the book everytime designers make any changes, and ensures that thetemplate file is well-formed. Participants only needed toconcentrate on completing the task. Some spent longercreating a book using the Workbench than the HTML editoronly because they performed extra measures that were notrequired to complete any sub-tasks to heighten theaesthetics of the finished book with a variety of information(i.e. graphics, audio and video).The time spent creating a book using the HTML editorranged from 18 to 45 minutes, with an average of 37minutes. The average number of questions asked was 15.With the Workbench, the time taken ranged from 10 to 40minutes, with an average of 28 minutes. On average, 9questions were asked during the training process. A t-testanalysis between the two systems on both criteria—showedthat the difference between the systems was statisticallysignificant at the five percent level.EffectivenessWith the Workbench, participants could clearly see thecorrect section level and position in the page. However, afew forgot to press the add section button in the designertools area when they were adding a new section. Theytyped the title straight into the page’s main text area, so thesection was not added to the automatically generated tableof contents and the page was not bookmarked as a section.Regardless of their computer experience, participants foundthe Workbench to be natural and needed less explanationabout its usage than the HTML editor. They were able toquickly discover and learn the tools required to complete atask by themselves without consulting the user manual orthe researcher. When asked, all participants commented thatthe icons chosen to represent the tools were clear and theycould guess the function of each button in the toolbar. Withthe HTML editor, participants relied heavily on the usermanual to know what tags were appropriate for completinga task.Because the section number was not visible when the booktemplate was viewed with the HTML editor, whenparticipants were asked to insert a new section at thespecified location, most either specified an incorrect sectionlevel or inserted the section tag at the wrong place.EfficiencyAll participants made fewer errors when creating bookswith the Workbench than with the HTML editor. Theaverage number of sub-tasks that were completedsuccessfully was 67% using the HTML editor and 87%using the Workbench. A t-test analysis between the twosystems showed that the difference was significant at thefive percent level.After just one training session, lasting less than 50 minuteson each system, most subjects were able to create a bookwith minimal requirements (i.e., front and back covers,table of contents and

The Workbench generates a synonym table automatically as Parameter Description Pages: PageWidth width of a page PageHeight height of a page PageType type of papers used for the book's pages PageColour base color of the book's LeftMargin left margin of the page RightMargin right margin of the page TopMargin top margin of the page