Design Patterns For Voice Interaction In Games

Transcription

Session: Paper PresentationCHI PLAY 2018, October 28–31, 2018, Melbourne, VIC, AustraliaDesign Patterns for Voice Interaction in GamesFraser Allison1, Marcus Carter2, Martin Gibbs1, Wally Smith1Interaction Design Lab, School of Computing and Information Systems, The University of Melbourne,Parkville, VIC, Australia2Department of Media and Communications, The University of Sydney, Camperdown, NSW, melb.edu.au, 2marcus.carter@sydney.edu.au1ABSTRACTremains a hallmark of fiction, as the current reality of voiceinteraction gameplay is not nearly as gratifying. Feedbackon voice interaction games often criticises the experience asfeeling “unnatural” or “forced” [13:268], highlighting thatgame designers have found it challenging to design asatisfying player experience of voice interaction.Voice interaction is increasingly common in digital games,but it remains a notoriously difficult modality to design asatisfying experience for. This is partly due to limitations ofspeech recognition technology, and partly due to theinherent awkwardness we feel when performing some voiceactions. We present a pattern language for voice interactionelements in games, to help game makers explore anddescribe common approaches to this design challenge. Wedefine 25 design patterns, based on a survey of 449videogames and 22 audiogames that use the player’s voiceas an input to affect the game state. The patterns expresshow games frame and structure voice input, and how voiceinput is used for selection, navigation, control andperformance actions. Finally, we argue that academicresearch has been overly concentrated on a single one ofthese design patterns, due to an instrumental research focusand a lack of interest in the fictive dimension ofvideogames.To help designers work in this space, and to crystallisesome of the common approaches that have been taken in thepast, we propose a pattern language for voice interactiongame design. A pattern language is a collection of designpatterns, each one of which describes the core of a commonsolution to a recurring design problem [1]. It is a practicalvocabulary for classifying a design space, which formalisesa body of knowledge that is otherwise implicit in thecommonalities between finished works. Originating in thefield of architecture, design patterns gained widespreadpopularity in computer science as a way of sharing reliable,repeatable approaches to common design goals [31]. Björket al. [9] introduced pattern languages to game studies as atool to help understand the form of games and to identifydesign choices that can affect the player experience inpredictable ways.CCS CONCEPTS Applied computing Computer games Computing methodologies Speech recognitionIn this paper, we present a pattern language for the use ofvoice interaction in games, based on a comprehensivesurvey of academic and commercial games that have usedany form of voice interaction from 1973 to 2018. Itcatalogues the major game mechanics, dialogue structuresand diegetic framing devices that have been employed, togive designers and scholars a library of options foraddressing the challenges of designing in this notoriouslydifficult modality. We identify clusters of patterns that areoften used together, and patterns that have so far resistedbeing used in combination with any other. We find thathuman-computer interaction (HCI) research to date hasfocused on only a narrow selection of design patterns,concentrated around pronunciation exercises, and paid littleattention to the fictive and experiential elements thatprovide players with a sense of presence in a gameworld.Author KeywordsDesign patterns; pattern language; interaction design; gamedesign; voice interaction; voice control; speech recognition.INTRODUCTIONMany visions of future technology, and future videogames,have us talking to virtual characters. In the film Her (2013),the main character falls in love with a Siri-like talkingvirtual assistant and spends his downtime playing avideogame with an alien character who communicatesthrough foul-mouthed natural language conversation. In2001: A Space Odyssey (1968), an astronaut plays voicecommand chess against his spaceship’s talking computer.These scenarios are characterised by the fluency and ease ofcommunication between humans and computer agents,which supports engaging and enjoyable gameplay. But thisPermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the fullcitation on the first page. Copyrights for components of this work owned by othersthan the author(s) must be honored. Abstracting with credit is permitted. To copyotherwise, or republish, to post on servers or to redistribute to lists, requires priorspecific permission and/or a fee. Request permissions from Permissions@acm.org.CHI PLAY '18, October 28–31, 2018, Melbourne, VIC, Australia 2018 Copyright is held by the owner/author(s). Publication rights licensed toACM.ACM 978-1-4503-5624-4/18/10 D WORKVoice Interaction in Digital GamesResearch on voice control of digital games has beenundertaken since at least the 1970s, more than a decadebefore voice control first appeared in commercialvideogames [2]. Academic interest in the topic haspredominantly come in the form of instrumental research5

Session: Paper PresentationCHI PLAY 2018, October 28–31, 2018, Melbourne, VIC, Australia[14], using game design to create an engaging format forlanguage learning practice [39,49] and speech rehabilitationexercises [54,69]. Alongside this, a substantial body ofresearch has explored the implementation of voice input asan alternative game control scheme, to enable access forplayers with motor impairments or other disabilities thatprevent them from using physical controls [33,68]. Severalresearch prototypes have explored options for controllinggames through non-speech qualities of voice, such as thevolume [76] and pitch [36,59,68] of vocal input; theseapproaches lend themselves to relatively simple arcadestyle game mechanics such as one-dimensional movement.mechanics. This context makes it even more timely toconsider a pattern language that can illuminate the widerdesign space for voice interaction gameplay.Design PatternsDesign patterns are a way of identifying and describingcommon solutions to design problems. The term was firstused by Christopher Alexander in relation to architectureand urban planning [1], and has since been applied tointeraction design fields such as human-robot interaction[41] and game design [8]. A design pattern consists of aformalised explanation of a repeatable approach to aspecific design scenario. The elements of a pattern varyfrom Alexander’s original formulation, but generallyinclude an idiomatic title, an explanation of the problem orgoal addressed, a description of the solution, one or moreillustrative examples, and a note on connections with otherpatterns. Importantly, design patterns are not meant to bedetailed blueprints, and do not include specific rules forhow they should be implemented. Rather, Alexanderconceptualises patterns as abstractions of a design solution:Three-dimensional gameworlds introduce a particularchallenge for voice interaction. Wadley and Ducheneaut[79] found that players struggle to communicate spatialdeixis (referential concepts such as “there” and “on yourleft”) when speaking to other human players in a virtualworld. This difficulty is exacerbated with AI characters,which require a situation model of the game space beforethey can interpret commonplace spatially deictic phrasessuch as “come over here and pick this up” [32:140]. Severalof the patterns that we identify (notably Waypoints,Absolute Directions, Relative Directions, Select by Nameand Select by Type) are aimed at resolving such referentialambiguity without requiring complex situation models.Each pattern describes a problem which occurs over andover again in our environment, and then describes thecore of the solution to that problem, in such a way thatyou can use the solution a million times over, withoutever doing it the same way twice. [1:x]Studies on the player experience of voice interaction gameshave tended to emphasise its social character. Fletcher andLight describe the karaoke game as a “glue technology thatassists in crafting and strengthening social linkagesamongst players” [29:1]. Conversely, Dow et al. highlightthe “uncomfortable” and “awkward” [26:9] feeling thatsome players experienced when talking to a computer. Astudy by Carter et al. found that players’ feelings ofembarrassment and awkwardness were most pronounced ingames that required them to say things that wereinconsistent with their in-game persona as the playercharacter – an effect they termed “identity dissonance”[13]. All together, these studies suggest that social framingis an important element in voice interactions in games.Consequently, we have captured the social framing of voiceactions in our analysis of game design patterns.Design patterns were introduced to the games researchliterature by Björk et al. [9]. They avoided framing patternsas solutions to problems, as this “creates a risk of viewingpatterns as a method for only removing unwanted effects ofa design” [9:185], rather than as a tool for more creativedesign work. Instead, they identify their value as tools forinspiring game ideas, understanding how existing games areput together, and helping to make choices about how newgame will work and what elements they should include [8].Of course, patterns like Power Ups and High Score Listshave been known in game design for decades, but collectingand formalising these patterns creates a sort of referencelibrary in which the concepts can be detailed, annotated andcross-referenced for the benefit of designers who may beless familiar with their use. It also establishes a commonlanguage for discussing voice interaction, with clear pointsof reference to ensure mutual understanding [28].In a previous study [2], we traced the history of voiceinteraction games from the 1970s to 2016, and notedcharacteristic differences between Japanese and Westernvoice interaction games. We identified the beginnings of aboom in small-scale games that focus on voice interactionas their central modality, and predicted that this boomwould continue thanks to the increasing availability ofmicrophone-equipped game platforms and effective speechrecognition systems. The current study shows that this haseventuated, with 106 new voice interaction videogamesreleased in 2017 alone, more than twice as many as anyprevious year. That excludes audio-only games, such asthose on Amazon Alexa, of which there have been hundredsreleased since 2015. However, most of these recent gamesemploy only a small set of simple and repetitive gameBjörk et al. [9] define a five-part format for game designpatterns, comprising a short, specific and idiomatic name; aconcise description, with examples; a discussion ofconsequences of applying the pattern; common choicesinvolved in using the pattern; and a list of relationsbetween patterns. Variations on this format have been usedto collect patterns in game dialogue [11], character design[44,45,50,62], level design [35,52,67] and psychologicalfactors in game design [24,46,51,80]. Pattern languageshave often been used to map relatively under-documenteddesign spaces, including mobile games [22], pervasivegames [10], AI-based games [17,71], public games [7],6

Session: Paper PresentationCHI PLAY 2018, October 28–31, 2018, Melbourne, VIC, Australiadisaster response games [70] and games controlled by theplayer’s gaze [75].Navigation). They emphasise the importance of examples toprovide an illustration of the pattern in practice. Finally,they recommend integrating pattern collections into largerlibraries with a consistent format. We have followed thesesuggestions as much as possible in the formation of thepattern language in this study.Most pertinent for the current study is Alves and Roque’spattern language for sound design in games [3,4], whichcovers a wide variety of game sound patterns such asoverheard conversations, contextual music and soundeffects that signal player failure or success. While theircollection is primarily concerned with sound output, itincludes one pattern (Sound Input) that covers playerproduced sound, including both sounds introduced to thegame through a microphone and sounds made by the gamein response to player button presses. This can be seen as anumbrella category for our study, which only concernsdesign choices related to sounds made by the player, andconsiders sound output only insofar as it relates to the usesof sound input.NOTE ON TERMINOLOGYDiscussions about voice interaction and game charactersshare a common problem: their key terms are often usedinterchangeably, as they appear superficially similar,despite referring to distinct concepts [5,15]. To avoidambiguity, in this section we define how we use these termsin this paper.Voice interaction encompasses any way that voice can beused as an input modality for a computer system to processand render a response. Notably, it excludes player-to-playercommunication, as in online voice chat. Voice control refersto the intentional use of voice to directly control a system,which may be verbal or non-verbal, whereas voicecommand is specifically the use of spoken words as directinstructions to the system. A drone that moves in responseto whistling would therefore be an example of voicecontrol, but not voice command. Voice command relies onspeech recognition, the recognition of spoken words by acomputer system.Despite the enthusiasm for pattern languages, somecautions have been sounded about how they are applied.Erickson warns against the “common misconception”[28:361] that a pattern language is a prescriptive tool thatdescribes universal templates for designers to apply directlyfrom the page. The prescriptive approach is articulated byDormans, who argues that “a design pattern library shoulddepart from a clearly formulated notion of quality ingames” [25:4] and present its patterns as prescribedsolutions to common problems, on the pragmatic basis thatthis would be more attractive to game developers than apattern language that does not pre-judge the quality of adesign choice. Erickson counters this view with theargument that the purpose of a pattern language isgenerative: it provides a common meta-language tobootstrap the creation of a project-specific vocabulary inwhich all participants in a design process can effectivelyparticipate. Rather than being rigid prescriptions, “both thelanguage and individual patterns are malleable and are usedto generate site-specific pattern languages for particularprojects” [28:361]. This is echoed in an interview studywith HCI researchers [58], which found that overlyformalised or prescriptive pattern languages tend to be lessusable as they fail to account for the dynamic andcontextual nature of design and technology. The studyidentified the extensive effort involved in collectingpatterns and developing them into a coherent language as abarrier to their wider use, but once formulated, a patternlanguage was considered to be a particularly useful tool forcapturing interactions, documenting conventions, sharingknowledge and encouraging design thinking.Virtual character encompasses any kind of character withan identity in a gameworld, whether controlled by a playeror by AI, and whether or not it appears on screen. Playercharacter is the specific virtual character that represents theplayer’s identity in the gameworld, often with its ownseparately defined identity in the game fiction. It mayappear in the form of an avatar, a virtual body over whichthe player exerts consistent control in the gameworld andserves as their locus of manipulation, “a tool that extendsthe player’s ability to realise affordances within the gameworld” [5:2]. All game characters that are not representativeof or controlled by the player are referred to as non-playercharacters (NPCs); these are usually proximally controlledby an AI system rather than the player’s inputs. Finally, unitis a generic term for AI-driven entities that includes thosethat are not strictly “characters”, such as a group of NPCsthat functions as one unified company, or a tank that isinterchangeable with all other tanks. The term is associatedwith strategy games, in which characters often operate morelike tokens on a game board than fully-realised characters.Diegetic elements are part of the fiction of the gameworld;they are “real” from the perspective of the characters in thegame. Non-diegetic elements are external to the gameworld,such as game menus or loading screens.A case study by Segerståhl and Jokela [65] notedshortcomings in the comprehensiveness, naming andorganisation of some existing pattern languages whichreduced their usability to designers. To improve theorganisation of pattern languages, they advise that patternsshould be grouped into task-related areas, such asSearching and Navigation, and that pattern names shouldbe illustrative of the solution they represent (as in BreadCrumbs) rather than generic or ambiguous (as in MainMETHOD FOR COLLECTING DESIGN PATTERNSTo begin our design pattern collection, we first set out toidentify every screen-based digital game that has used voiceinteraction up to early 2018. We compiled this list byconducting keyword searches on terms related to voiceinteraction and speech recognition in the online game7

Session: Paper PresentationCHI PLAY 2018, October 28–31, 2018, Melbourne, VIC, AustraliaVOICE INTERACTION GAME DESIGN PATTERNSlibraries Steam, Giant Bomb, itch.io, Newgrounds,Kongregate and the Video Game Console Library, thereview aggregator site Metacritic, the discussion websiteReddit, and general web searches through Google Search.Next, we applied our search terms to Google Scholar andthe ACM Digital Library to identify academic works thathave described a unique voice interaction game or gamecontrol system. We looked at the prior works cited in thesepapers and, through their Google Scholar listings, worksthat had cited them since their publication to find moreexamples. We repeated this process with new keywords andvenues, informed by our initial results, until our listconverged on a set of 40 academic research games and 409commercial and independent games published between1973 and 2018.1 English was by far the most commonlanguage these games used, but a substantial number werealternatively or exclusively available in other languages,primarily Japanese and a variety of European languages.Where our searches turned up non-English sources thatdescribed these games, we reviewed them using GoogleTranslate. However, this study relies in the main onEnglish-language sources.Our pattern language comprises 25 patterns for voiceinteraction (see table 1). These patterns are descriptions ofemergent features of this design landscape. They are notintended to be exhaustive or mutually exclusive; this is alanguage, not a logical model for the categorisation of voiceinteraction games. Like a language, it resists neatcategorisations and ontological consistency. However, toaid understanding we have arranged the pattern into sixcategories, which we will now explain.1 Diegetic framing2 Dialogue structure3 SelectionSpeak as a characterChoice of optionsSelect by nameSpeak as the playerQuestion and answerSelect by typeSituated commanderWho-what-wherecommandsFloating commanderScripted conversationUnscripted conversationFor each game in our list, the first author recorded theorigin, platform and game genre, how central voiceinteraction was to the gameplay, whether verbal or nonverbal voice input was recognised, what types of speechacts were possible, whether voice actions were diegeticallyframed as coming from the player or their avatar, andwhether voice actions were directed towards the gamesystem or objects within the gameworld. The last twocategories were prompted by studies of voice interaction[13] and voice communication [78] that have found thatvoice can disrupt players’ sense of presence and identitywhen it conflicts with the diegetic framing of thegameworld. The first author then provided a descriptivesummary of each game’s voice interaction mechanics, andidentified potential patterns through an open-codingprocess. Initial codes described the specific combination ofa voice input and a response in the game (such as Blow airto move sailing ship). As shared themes emerged across thedata set, the codes were iteratively revised and abstracted todescribe higher-level patterns (such as Breath physics). Weapplied additional codes to describe how voice actions wereframed by the game’s presentation (see Patterns forDiegetic Framing). We ceased coding once we arrived at astable list of patterns that encapsulated all the mainvariations we observed in our list of games.4 Navigation5 Control6 PerformanceWaypointsName an actionPronunciationAbsolute directionsName a behaviourKaraokeRelative directionsShout to shootBreath physicsVolume controlExclamationPitch controlSpellcastingOverheard noiseTable 1. Game design patterns for voice interaction.The first category, Diegetic Framing, consists of thepatterns for how the game fiction frames the player’s voiceinput. The patterns in this category are relatively abstract,as they describe the collective implications of multiplegame elements, from narrative text to game mechanics.They address the role-playing dimension that has beenidentified as an important factor in players’ experience ofvoice interaction [2,13].The second category, Dialogue Structure, contains thepatterns that describe how dialogue is configured between aplayer and an NPC. This category looks one level above theindividual utterances, to the arrangement of utterances intoa particular form of dialogue, such as conversation.The remaining patterns are categorised into four broad taskareas, in accordance with Segerståhl and Jokela’s [65]guidelines for usable pattern languages. The categories areSelection, Navigation, Control and Performance. Patternstherein describe individual utterances and the responsesthey engender from the game.After this process was complete, we conducted a smallersurvey on 22 audio-only games on Amazon Alexa, to testwhether different design patterns were apparent. Thepatterns we found were consistent with those in the largersurvey. Our findings for audiogames are discussed at theend of the paper.We define each pattern with an illustrative Name, a shortDescription, an Example that exemplifies the pattern, abrief discussion of Consequences for the player experience,and a compact list of Relations to other patterns, such asconflicts and parent-child relationships. This is a simplifiedversion of Björk et al.’s [9] model, foregoing their sectionon Using the Pattern for brevity, to allow for a morecomprehensive list of the design patterns we have observed.1A complete list of the games surveyed for this study hasbeen submitted as a separate spreadsheet.8

Session: Paper PresentationCHI PLAY 2018, October 28–31, 2018, Melbourne, VIC, AustraliaWe distinguish patterns by how they are presented to theplayer, rather than by their underlying mechanism in thegame system. To illustrate this, consider a game that asksthe player to extinguish a candle by blowing into themicrophone in one scene, and to interrupt a wedding byshouting an objection in another scene. The trigger for bothactions may be identical in the game code, with either onebeing activated by any sufficiently loud sound. But in ourpattern language, the former would be described as BreathPhysics and the latter as Exclamation in accordance withtheir diegetic framing.Conversation, Spellcasting, Breath Physics, OverheardNoise and often Exclamation.1.3 SITUATED COMMANDERThe player gives voice commands to NPCs in the role of anon-screen avatar who they control directly.Example: This pattern is common in tactical shootergames, such as Rainbow Six: Vegas [73], in which theplayer controls the leader of a military squad and can givethe other squad members instructions such as “check yourfire” and “regroup on me”.Consequences: Provides the experience of an embodiedpersona as the imagined source of the player’s voice, butalso requires the player to divide their attention betweenmanaging the avatar’s actions and using voice. This can bechallenging, especially when the avatar is under threat ortime pressure, as speech formulation draws on some of thesame cognitive resources that are needed for strategicproblem solving, to a greater extent than hand-eyecoordination does [66].Relations: Instantiates Speak as a Character. Can beinstantiated by Who-What-Where Commands. Conflictswith Speak as the Player and Floating Commander.1. Patterns for Diegetic FramingPatterns in this category describe how the game fictionframes and positions the player’s voice actions in relation tothe game’s imaginary. These patterns do not describespecific voice actions, but how the configuration of voiceactions casts the player in a speaking role. The patterns areSpeak as a Character, Speak as the Player, SituatedCommander and Floating Commander.1.1 SPEAK AS A CHARACTERThe player’s utterances correspond to the utterances of acharacter in the gameworld. The game responds to theplayer’s voice as though it is the diegetic voice of theplayer-character.Examples: Guitar Hero: World Tour [55] representsplayers as a rock band on stage, performing the song thatthe players are performing. The virtual crowd cheers orheckles the virtual singer based on how well the playersings.Consequences: Maintains consistency between the player’sin-game persona and the persona implied by their voiceactions. Player commentary suggests that this supports anexperience of immersion and flow [13].Relations: Instantiated by Situated Commander andFloating Commander. Instantiated by patterns thatrepresent embodied diegetic actions, namely ScriptedConversation, Unscripted Conversation, Spellcasting,Breath Physics, Overheard Noise and often Exclamation.Conflicts with Speak as the Player.1.4 FLOATING COMMANDERThe player gives voice commands to NPCs from a freefloating perspective above the gameworld, in the role of anunseen player-character.Example: In Tom Clancy’s EndWar [74], the player directsa battalion of infantry, tank and helicopter squads fromabove the battlefield using Who-What-Where Commands.Consequences: Allows the player to dedicate their fullattention to issuing commands without the distraction ofmanaging an avatar. Combined with the wide field of viewgranted by the bird’s-eye perspective, this makes FloatingCommander well suited to strategy games and games withcomplex Who-What-Where Commands.Relations: Instantiates Speak as a Character. Can beinstantiated by Who-What-Where Commands. Conflictswith Speak as the Player and Situated Commander.2. Patterns for Dialogue Structure1.2 SPEAK AS THE PLAYERThe player’s utterances do not correspond to the utterancesof a character in the gameworld. The game treats theplayer’s voice actions as non-diegetic inputs.Example: In Tomb Raider: Definitive Edition [20], theplayer can say the name of a weapon to make the playercharacter hold that weapon. This would not make sense as avocalisation by the player-character, as she would be givingverbal instructions to herself.Consequences: Carter et al. [13] noted that playerssometimes experienced “identity dissonance” whenspeaking to (rather than as) the player-character, whichsuggests that Speak as the Player can disrupt the player’ssense of presence in the gameworld if not applied carefully.Relations: Conflicts with Speak as a Character andpatterns that instantiate it: Situated Commander, atterns in this category describe a systematic arrangementof speech between the player and the game. They areconcerned with how utterances are configured into a certainform of dialogue. The patterns are Choice of Options,Question and Answer, Who-What-Where Commands,Scripted Conversation and Unscripted Conversation.2.1 CHOICE OF OPTIONSThe game presents a list of options that the player can selectfrom by saying an associated phrase. The phrase may be thefull text of the option or a corresponding keyword.Example: In Thayer’s Quest [61], each scene ends with anumbered list of locations. The player selects a destinationby saying the number next to the place they want to visit.Consequences: Reduces the possible inputs, which allowsthe speech recognition task within the game system to besimplified to improve reliability. Also simplifies the task9

Session: Paper PresentationCHI PLAY 2018, October 28–31, 2018, Melbourne, VIC, Australiafor the player, minimising uncertainty and the need forcreative cognition to formulate an answer.Relations: Instantiated by Scripted Conversation and oftenQuestion and Answer.Behaviour, Waypoints, Relative Directions and AbsoluteDirections.2.4 SCRIPTED CONVERSATIONAn in-character conversation in which the playerparticipates by reading pre-scripted lines of dialogue thatappear on screen, effectively “performing” the voice of theplayer-character.Example: During conversation scenes in Mass Effect 3 [6],each time the player-character has a turn to talk, the gameshows a list of several options for what they can say. Theplayer can choose an utterance from this list by reading italoud. The player-character then gives a longer version ofthis utterance in its own voice, and the NPC responds.Consequences: Allows the player to participate in a fullyexpressive conversation without the need for a sophisticatedlanguage understanding system, as the game system onlyneeds to match their speech to a small number of pre-setutterances. Pre-scripted

Design Patterns Design patterns are a way of identifying and describing common solutions to design problems. The term was first used by Christopher Alexander in relation to architecture and urban planning [1], and has since been applied to interaction design fields such as human-robot interaction [41] and game design [8].