A Pipeline For Creative Visual Storytelling


A Pipeline for Creative Visual StorytellingStephanie M. Lukin, Reginald Hobbs, Clare R. VossU.S. Army Research LaboratoryAdelphi, MD, USAstephanie.m.lukin.civ@mail.milAbstractof high-quality datasets (Everingham et al., 2010;Plummer et al., 2015; Lin et al., 2014; Ordonezet al., 2011), however, there is little coverage oflow-resourced domains with low-quality imagesor atypical camera perspectives that might appearin a sequence of pictures taken from blind persons,a child learning to use a camera, or a robot surveying a site. For this work, we studied an environment with odd surroundings taken from a cameramounted on a ground robot.Narrative goals guide the selection of what objects or inferences in the image are relevant or uncharacteristic. The result is a narrative tailoredto different goals such as a general “describe thescene”, or a more focused “look for suspicious activity”. The most salient narrative may shift asnew information, in the form of images, is presented, offering different possible interpretationsof the scene. This work posed a forensic task withthe narrative goal to describe what may have occurred within a scene, assuming some temporalconsistency across images. This open-endednessevoked creativity in the resulting narratives.The telling of the narrative will also differ basedupon the target audience. A concise narrative ismore appropriate if the audience is expecting tohear news or information, while a verbose and humorous narrative is suited for entertainment. Audiences may differ in how they would best experience the narrative: immersed in the first person orthrough an omniscient narrator. The audience inthis work was unspecified, thus the audience wasthe same as the storyteller defining the narrative.To build a computational creative visual storyteller that customizes a narrative along these threeaspects, we propose a creative visual storytellingpipeline requiring separate task-modules for Object Identification, Single-Image Inferencing, andMulti-Image Narration. We have conducted an exploratory pilot experiment following this pipelineComputational visual storytelling produces atextual description of events and interpretations depicted in a sequence of images. Thesetexts are made possible by advances and crossdisciplinary approaches in natural languageprocessing, generation, and computer vision.We define a computational creative visual storytelling as one with the ability to alter thetelling of a story along three aspects: to speakabout different environments, to produce variations based on narrative goals, and to adaptthe narrative to the audience. These aspects ofcreative storytelling and their effect on the narrative have yet to be explored in visual storytelling. This paper presents a pipeline of taskmodules, Object Identification, Single-ImageInferencing, and Multi-Image Narration, thatserve as a preliminary design for building acreative visual storyteller. We have piloted thisdesign for a sequence of images in an annotation task. We present and analyze the collectedcorpus and describe plans towards automation.1IntroductionTelling stories from multiple images is a creativechallenge that involves visually analyzing the images, drawing connections between them, and producing language to convey the message of thenarrative. To computationally model this creative phenomena, a visual storyteller must takeinto consideration several aspects that will influence the narrative: the environment and presentation of imagery (Madden, 2006), the narrativegoals which affect the desired response of thereader or listener (Bohanek et al., 2006; Thorneand McLean, 2003), and the audience, who mayprefer to read or hear different narrative styles(Thorne, 1987).The environment is the content of the imagery,but also its interpretability (e.g., image quality).Canonical images are available from a number20Proceedings of the First Workshop on Storytelling, pages 20–32New Orleans, Louisiana, June 5, 2018. c 2018 Association for Computational Linguistics

Figure 1: Creative Visual Storytelling Pipeline: T1 (Object Identification), T2 (Single Image Inferencing),T3 (Multi-Image Narration)specify the audience, leaving the annotator free towrite in a style that appeals to them. The data andanalysis of the pilot are presented in Section 4,as well as observations for extending to crowdsourcing a larger corpus and how to use these creative insights to build computational models thatfollow this pipeline. In Section 5 we compareour approach to recent works in other storytellingmethodologies, then conclude and describe futuredirections of this work in Section 6.to collect data from each task-module to train thecomputational storyteller. The collected data provides instances of creative storytelling from whichwe have analyzed what people see and pay attention to, what they interpret, and how they weavetogether a story across a series of images.Creative visual storytelling requires an understanding of the creative processes. We arguethat existing systems cannot achieve these creative aspects of visual storytelling. Current objectidentification algorithms may perform poorly onlow-resourced environments with minimal training data. Computer vision algorithms may overidentify objects, that is, describe more objects thanare ultimately needed for the goal of a coherentnarrative. Algorithms that generate captions of animage often produce generic language, rather thanlanguage tailored to a specific audience. Our pilotexperiment is an attempt to reveal the creative processes involved when humans perform this task,and then to computationally model the phenomenafrom the observed data.Our pipeline is introduced in Section 2, wherewe also discuss computational considerations andthe application of this pipeline to our pilot experiment. In Section 3 we describe the exploratory pilot experiment, in which we presented images of alow-quality and atypical environment and have annotators answer “what may have happened here?”This open-ended narrative goal has the potential toelicit diverse and creative narratives. We did not2Creative Visual Storytelling PipelineThe pipeline and interaction of task-modules wehave designed to perform creative visual storytelling over multiple images are depicted in Figure 1. Each task-module answers a question critical to creative visual storytelling: “what is here?”(T1: Object Identification), “what happens here?”(T2: Single-Image Inferencing), and “what hashappened so far?” (T3: Multi-Image Narration).We discuss the purpose, expected inputs and outputs of each module, and explore computationalimplementations of the pipeline.2.1PipelineThis section describes the task-modules we designed that provide answers to our questions forcreative visual storytelling.Task-Module 1: Object Identification (T1).Objects in an image are the building blocks for storytelling that answer the question, literally, “what21

2.2is here?” This question is asked of every image in a sequence for the purposes of object curation. From a single image, the expected outputsare objects and their descriptors. We anticipatethat two categories of object descriptors will be informative for interfacing with the subsequent taskmodules: spatial descriptors, consisting of objectco-locations and orientation, and observational attribute descriptors, including color, shape, or texture of the object. Confidence level will provideinformation about the expectedness of the objectand its descriptors, or if the object is difficult oruncertain to decipher given the environment.From Pipeline Design to PilotOur first step towards building this automatedpipeline is to pilot it. We will use the dataset collected and the results from the exploratory studyto to build an informed computational, creative visual storyteller. When piloting, we refer to thispipeline a sequence of annotation tasks.T1 is based on computer vision technology. Ofparticular interest are our collected annotations onthe low-quality and atypical environments that traditionally do not have readily available object annotations. Commonsense reasoning and knowledge bases drive the technology behind deriving T2 inferences. T3 narratives consist of twosub-task-modules: narrative planning and natural language generation. Each technology can bematched to our pipeline, and be built up separately,leveraging existing works, but tuned to this task.Our annotators are required to write in naturallanguage (though we do not specify full sentences)the answers to the questions posed in each taskmodule. While this natural language intermediate representation of T1 and T2 is appropriate fora pilot study, a semantic representation of thesetask-modules might be more feasible for computation until the final rendering of the narrative text.For example, drawing inferences in T2 with theobjects identified in T1 might be better achievedwith an ontological representation of objects andattributes, such as WordNet (Fellbaum, 1998), andinferences mined from a knowledge base.In our annotation, the sub-task-modules of narrative planning and natural language generationare implicitly intertwined. The annotator does notnote in the exercise intermediary narrative planning before writing the final text. In computation, T3 may generate the final narrative text wordby-word (combining narrative planning and natural language generation). Another approach mightfirst perform narrative planning, followed by generation from a semantic or syntactic representation that is compatible with intermediate representations from T1 and T2.Task-Module 2: Single-Image Inferencing(T2). Dependent upon T1, the Single-Image Inferencing task-module is a literal interpretation derived from the objects previously identified in thecontext of the current image. After the curation ofobjects in T1, a second round of content selectioncommences in the form of inference determination and selection. Using the selected objects, descriptors, and expectations about the objects, thistask-module answers the question “what happenshere?” For example, the function of “kitchen”might be extrapolated from the co-location of a cereal box, pan, and crockpot.Separating T2 from T1 creates a modular system where each task-module can make the best decision given the information available. However,these task-modules are also interdependent: as theinferences in T2 depend upon T1 for object selection, so too does the object selection depend uponthe inferences drawn so far.Task-Module 3: Multi-Image Narration(T3). A narrative can indeed be constructed froma single image, however, we designed our pipelineto consider when additional context, in the formof additional images, is provided. The MultiImage Narration task-module draws from T1 andT2 to construct the larger narrative. All images,objects, and inferences are taken into consideration when determining “what has happened sofar?” and “what has happened from one image tothe next?” This task-module performs narrativeplanning by referencing the inferences and objectsfrom the previous images. It then produces a natural language output in the form of a narrative text.Plausible narrative interpretations are formed fromglobal knowledge about how the addition of newimages confirm or disprove prior hypotheses andexpectations.3Pilot ExperimentA paper-based pilot experiment implementing thispipeline was conducted. Ten annotators (A1 A10 )1 participated in the annotation of the three1A5 , an author of this paper, designed the experiment andexamples. All annotators had varying degrees of familiaritywith the environment in the images.22

Figure 2: image1 , image2 , and image3 in Pilot Experiment Sceneimages in Figure 2 (image1 - image3 ). Theseimages were taken from a camera mounted on aground robot while it navigated an unfamiliar environment. The environment was static, thus, presenting these images in temporal order was not ascritical as it would have been if the images werestill-frames taken from a video or if the imagescontained a progression of actions or events.Annotators first addressed the questions posedin the Object Identification (T1) and Single-ImageInference (T2) task-modules for image1 . They repeated the process for image2 and image3 , and authored a Multi-Image Narrative (T3). The annotator work flow mimicked the pipeline presentedin Figure 1. For each subsequent image, the timeallotted increased from five, to eight, to elevenminutes to allow more time for the narrative tobe constructed after annotators processed the additional images. An example image sequence withanswers was provided prior to the experiment. A5gave a brief, oral, open-ended explanation of theexperiment as not to bias annotators to what theyshould focus on in the scene or what kind of language they should use. The goal of this data collection is to gather data that models the creativestorytelling processes, not to track these processesin real-time. A future web-based interface willallow us to track the timing of annotation, whatinformation is added when, and how each taskmodule influences the other task-modules for eachimage.Object Identification did not require annotatorsto define a bounding box for labeled objects, norwere annotators required to provide objective descriptors2 . Annotators authored natural languagelabels, phrases, or sentences to describe objects,attributes, and spatial relations while indicatingconfidence levels if appropriate.During Single Image Inferencing, annotatorswere shown their response from T1 as they authored a natural language description of activityor functions of the image, as well as a natural language explanation of inferences for that determination, citing supporting evidence from T1 output. For a single image, annotators may answerthe questions posed by T1 and T2 in any order tobuild the most informed narrative.Annotators authored a Multi-Image Narrative toexplain what has happened in the sequence of images presented so far. For each image seen

Creative visual storytelling requires an under-standing of the creative processes. We argue that existing systems cannot achieve these cre-ative aspects of visual storytelling. Current object identication algorithms may perform poorly on low-resourced environments with minimal train-ing data. Computer vision algorithms may over-Cited by: 10Publish Year: 2018Author: Stephanie M. Lukin, Reginald L. Hobbs, Clare R. Voss