Sequential Pattern Mining Of Multimodal Data Streams In Dyadic Interactions

Transcription

Sequential Pattern Mining of Multimodal DataStreams in Dyadic InteractionsDamian FrickerChen YuHui ZhangIndiana University, Bloomington,IN 47408,U.S.A{dfricker, huizhang, chenyu}@indiana.eduAbstract-Inthispaperwe proposeasequentialpatternmining method to analyze multimodal data streams using aquantitative temporal approach. While the existing algorithmscan only find sequential orders of temporal events, this paperpresents anewtemporal dataminingmethodfocusingonextracting exact timings and durations of sequential patternsextracted from multiple temporal event streams. We present ourmethod with its application to the detection and extraction ofhuman sequential behavioral patterns over multiple multimodaldata streams in human-robot interactions. Experimental resultsconfirmed the feasibility and quality of our proposed patternmining algorithm, and suggested a quantitative data-driven wayto ground social interactions in a manner that has never beenachieved before.Index Terms--Cognitive Science, Human Robot Interaction,Sequential Pattern Mining.I.INTRODUCTIONThe real-time behavioral dynamics of collaborating socialpartners in human-human interactions are composed ofsequences of coordinated bodily cues -- rapid shifts of eyegaze, head turns, and hand gestures -- that happen on a timescale of fractions of a second. It is those coordinated behaviorsthat make the whole interaction appear to be so smooth andeffortless. However, a closer look of those behaviors [1, 2] hasrevealed that such interactions are indeed highly complex ontwo grounds: 1) such interactions are multimodal, as weinteract with each other through multiple communicationchannels (e.g., eye-gaze, speech, and pointing); and 2) suchbehaviors are adaptive, but not pre-defined, as people activelyperceive information from their social partners anddynamically adjust their own actions to form real-timeperception-action loops within one agent and across twoagents. For example, in natural circumstances, the eye, head,and hand within an agent are in continual motion in thecontext of ongoing behavior [14] wherein eye, head, and handall need to act with respect to a common coordinate systemand remain synchronized in time across mUltiple actions. Foranother example, in face-to-face interactions, the speakeralways looks at the listener's face at the end of a spokenutterance with an expectation that the listener would produce aback-channel cue (e.g. nodding or mutual gaze) to confirm thereception of the speech [15]. In addition, this confirmationaction needs to happen at the right moment in time; otherwisethe same action generated with the wrong timing, such as withan unexpected delay, may disrupt turalcommunications, how those behaviors are coordinated atcritical timing moments, and the sequential patterns theyexhibit, is critical to advancing our understanding of human human communication as well as human-robot interactions.New sensing and computational technologies - eye-trackingsystems, and advanced techniques for storing andautomatically annotating massive data streams - promise newinsights of a micro-level understanding of multimodalbehaviors that aligns with contemporary explanations ofcognitive and neural processes. In light of this, the goal of ourresearch is to focus on data-mining and studying the sequentialpatterns exhibited in multimodal perception-action loopsduring real-time interactions within an agent and between twoagents, i.e., how what one looks at or says relates to what thesame person is going to look at or say next, as well as how thebehavior of one agent influences what the other agent willlook at and do next.One particular challenge here is to discover reliablesequential patterns and the precise timing of free-flow humanbehaviors when they interact with another human or a robot innaturalistic contexts. Distinct from a robot executing a pre defined action, human behaviors are much more adaptive andflexible. At a micro-behavioral level, it is rarely the case thattwo instances of the same human action are exactly executedin the same way in both time and space even though they mayshare underlying patterns (e.g. the trajectories of two instancesof the same reaching action). Toward our research goal, wedevelop a sequential pattern mining algorithm to analyzemultiple event-based data streams with the focus ofdiscovering quantitative timings between sequential eventsfrom multiple temporal streams. We first describe thealgorithm followed by applying it with multimodal datacollected from a real-time human robot interaction platformthat we built. We demonstrate the potential utilities of thismethod by providing several concrete examples of how thismethod is used to discover various kinds of micro-levelbehavioral patterns in human-robot interactions.978-1-61284-990-4/11/ 26.00 2011IEEEAuthorized licensed use limited to: University of Texas at Austin. Downloaded on October 19,2020 at 15:12:20 UTC from IEEE Xplore. Restrictions apply.

EventTypeC EventType B .EventType A.TimeEventTypeCEventType BEventType A ) . Time)P5 . '" Q , .',p,3 P'7: P, P32: P,Starting Timea) An event data set, consisting ofb) Data Cast Into Eventmultiple example sequences.Space and Clustered1- )) 11)1- )1 " )P22 / 'P, P2P2 P3 p.--p32 I . )d) Example of anEstimated Pattern ;.I )c) Clusters Used asPrototype Patterns for A-e) Pattern From (d)Priori-Like AlgorithmAdjustmentAfter ProbabilisticFigure-I: rocess ow diagr m of the algorithm.(a) An ex. ample subset of the event data set, displaying two example sequences, each with three event.types. ThiS I. S what IS used as nput to the algOrithm. b) ThiS first step of the algorithm casts each example of the event data set into points in 2D space(event start time x eve t duration) and clusters the pomts. (c) Each cluster is used as an intial canadite pattern by an iterative Apriori-Iike algorithm.(d) Ex mple of one estlmat d pattern detected from the Apriori-Iike algorithm. (e) A final adjustment to refine and adjust the temporal relations by.removmg some potential nOise from the data. Whether or not this is used is based on top-down knowledge of the data set, i.e., whether or not you wouldexpect events to be completely synchronys.II. METHODWith huge amounts of multimodal data collected fromvarious studies of human-human and human-robot studiesthere have already been previous research efforts focused o issues such as gesture spotting and recognition [3], temporalpattern analysis [4], and predictions of human behaviorpatterns [5]. Most existing work employed temporal patternmining methods developed for point-based events, whichdates back to Agrawal and Srikant in 1995 [6]. Although thosedata-driven approaches provided useful means to understandmultimodal interactions in human-human and human-robotinteractions, an intrinsic limitation was that multimodal datastreams in both human-human and human-robot interactionsare indeed interval-based events instead of point-based events.Human behavioral patterns are composed of sequences ofcoordinated social cues on time scales of fractions of seconds'therefore point-based events are not able to represent event with this time scale and such complexity.A. Related WorkInterval-based event mining was first proposed by Kam andFu as an algorithm to discover temporal patterns for interval based events (see [7]). Similar to point-based event miningalgorithms, interval-based event mining algorithms use anApriori-like algorithm to find longer frequent patterns fromexamples. As Allen discussed the difference between timepoints and time intervals (see [8]), interval-based event miningis essentially different from point-based event mining in thatinterval-based events are time-stamped with continuousstarting and ending timestamps instead of discrete time pointsby which point-based events are represented.However most interval-based event mining algorithmsdeveloped since then only focus on finding temporal ordersbetween events [10-13] (e.g., event A is happeningbefore/after event B). Quantitative timing information, such asat what moment a temporal event happens with what timing,has been a missing component in various data-mining anddata-analysis scenarios. To our best knowledge, [9] is the onlywork to identify continuous interval-based events using aquantitative temporal mining algorithm. The approachexploited Apriori algorithm to generate candidate sequences asthe first step, then a MCMC sampling method and anexpectation-maximization (EM) algorithm were applied toselect candidate patterns. This approach however has severalissues. First, its time complexity is significantly high due tothe sampling method and the EM algorithm. Second, thehypercube their approach constructs is based on theoccurrences of events, which significantly limits itsexpressional flexibility. Lastly, this approach does notconsider the relationship between events. In the next section,we propose our own approach to address these issues.B.Continuous Interval-based Event MiningIn this section, we propose a new algorithm, event space miner(ESM), used to data-mine interval-based event patterns fromexample sets. An event data set consists of multiple examplesequences, all of the same total duration, which can containany number of event types, e.g., an event data set of naturaldisasters could contain multiple examples, each a year long,with timing information, example sequences, of volcaniceruptions, earthquakes, and tsunamis. Given an event data setour ESM consists of 3 steps: I) First, the algorithm convert the data into an event-space (see Figure-I (c)) and clustering isperformed in the event space based on event types; 2) Second,these clusters become the first set of candidate patterns for anApriori-like procedure to compute representative sequentialprototypes (see Figure-I(d)); 3) Those prototype patterns arerefined by considering their temporal relationships (seeFigure-I(e)). Each step in the algorithm is described in detailsin the following sections./) Clustering: Given example sequences, ESM converts theexamples into points in 20 vectors embedded in theconstructed 20 event space (e.g., see Figure-I (c)), for EMalgorithm to find clusters in each event type. We take thecenter points of the clusters as seed patterns for an Apriori-likealgorithm. Each seed pattern is a 20 vector. Different fromclassic EM algorithm where the number of Gaussians ispredefined, we only define the maximum number of Gaussiansto be executed before we find the best fitting number using amodel selection method -- normalized entropy criterion(NEC ('We used the package from www.mixmod.orgAuthorized licensed use limited to: University of Texas at Austin. Downloaded on October 19,2020 at 15:12:20 UTC from IEEE Xplore. Restrictions apply.

2) Apriori-like Algorithm:Typically Apriori-likealgorithms first scan the database to find all the frequent singleitems, which are the initial input used as seed patterns, e.g.p] . Pi. Then they generate I-item longer candidate sequenceswhich are tested if they are more frequent than a giventhreshold. If they are frequent enough, they will be used asseed patterns for the next iteration. The search for candidatesequence is performed recursively until there is no newfrequent pattern or no candidate can be generated. This finaloutput is a set of patterns, ( {p]:t]-tz}' {p]:t]-tz; P2:t3-t4}), wheret# is a timestamp.Our customized Apriori-like algorithm first uses the meanvectors of the clusters as the candidates (e.g., see Figure-I (d)).Our algorithm then examines the occurrence frequency ofeach candidate against a threshold and re-generates newcandidates in a new round by excluding those candidates thatappeared less frequently. We note that our algorithmcalculates the frequency of a pattern in a fuzzy way based onhow much the pattern overlaps the examples. Let p be apattern and x be an example event with starting time x.s andending time x.e, the frequency is calculated by:c)h(p,;L ,{m.i n(p.e, x.e) h (p,p.e -p.sf(P) 1 1X ESwhere 151 is the size of example set andx) m axm a.-x(p.s, x.s)},0 .In this way, if an example overlaps with the prototype patternonly 10% of the time, then that example still adds 0.1 to theoverall frequency counting.3) Probabilistic Adjustment: Given an event data set, ESMconverts the data into event-space and finds frequent patterns.However, the temporal relationships between the events havenot been considered in pattern discovery. Given the examplesas in Figure-I (b), which were generated by adding Gaussiannoise to a prototype, the estimated pattern is shown in Figure1 (d) wherein three temporal events on the right seem tohappen at slightly different time points due to noises in theexamples.We improved the results using the probability distributionsof the events. Each event in the estimated pattern has its owndistribution from examples. We assume that the distributionsare Gaussian. We then consider the starting time and theending time of an event separately as in point-based eventmining algorithms. The procedure is given as follows:1. Calculate the mean and variance for each point (a starting orending time of each event).2. Sort the distances between each pair of the means.3. Check the probability that the closest pair happens actuallyat the same time. If the probability is higher than a giventhreshold,a. Adjust all the connected points.b. Update the means and variances of all theconnected points.c. Register the pair into the connected list.4. Go to 3 unless the next closest pair's distance is bigger thana threshold.More specifically, each point has the mean and the variancefrom the examples, which are enough to estimate its Gaussiandistribution. Let Gland G2 be the two Gaussian distributionsfor two points from two different events, respectively. Theyare described byG1.Gz '"( 1.,an,( , a ).Then, the probability that the distance between the two meansis zero is given by{-}p( di stance 0) .J :1 exp 2 1 2 ,abt awhere a at di and 1. ,assuming that thetwo distributions are independent (or uncorrelated). If they arear2 - a 1., where at and a 1. arecorrelated, a d r athe cross-covariance.Given two sets of examples with Gland G2 distributionsand with N1 and N2 numbers of examples, if the probabilityof zero distance is higher than a given threshold, it means theyare close enough to be considered as the same point. Thethreshold is set to 0.2 in the experiments here. If they are closeenough, they can be combined and make a new distribution asfollows. Once they are combined, they move together as onepoint with new frequency score, mean and standard deviation:-Nt Nz 'NtI' com - 1'1.rvA'c om acorn J:,c am1;;-- (N1.(a1.He'omI'D N (al1')) - I'com ·In this way, those two events are merged as a new composedsequential event in Figure-I(e). This is an optional step thatrequires some top-down knowledge of the data set, i.e., shouldevents be completely synchronous, is one type of eventcausally related to the prior event? One possible direction offuture work is to estimate the parameters in this step based onthe timing distribution of the overall event data set.III.EXPERIMENT AND DATAIn this section, we introduce our human-robot interactionexperimental paradigm. Figure-2 shows the overview of ourmultimodal real-time human-robot interaction. In a word learning task we designed, a human teacher was asked to teacha robot the names of a set of novel objects in a sharedenvironment. Participants were allowed to freely point, holdand manually manipulate those objects to attract the robot'sattention and then name those objects using the artificialnames we provided (e.g. "tema" and "dodi"). The robot isequipped with two cameras located in the forehead withcapabilities of providing 640x480 resolution images at 30 fps.In particular, the robot was able to perform gaze-contingentactions by following a human participant's gaze in real dding/shaking etc).Using this platform, we have designedand conducted several experimental conditions in which therobot may sometimes follow the human's attention orsometimes ignore the human's attention by looking at a spatiallocation to where the human was not attending. In addition,the robot may sometimes look more toward the human's faceAuthorized licensed use limited to: University of Texas at Austin. Downloaded on October 19,2020 at 15:12:20 UTC from IEEE Xplore. Restrictions apply.

to initialize mutual gaze or sometimes look less at the human'sface. Our research goal was to record and analyze responsivebehaviors from participants when they interacted with thesame robot but with different gaze-contingent behaviors.Multimodal data streams were recorded from both humanparticipants and the robot in those experimental conditions,including human participants' eye gaze data, speech acts, first person view video, and as well as the robot's body movementsand head/gaze direction. We have developed various video,speech and motion data processing tools to automaticallyannotate raw multimodal data. As a result, we derived a set oftemporal event streams from raw data which is used for thefollowing sequential pattern discovery.hurnan firstRobot!la contingentaction control1robot nr tp sonVi@wp.son viewLook at obJKtLook at human Sc eechRObol Commaod, ·oanc Nod heCld Shake h adHumanROIRobot ROIHumanSpeecFigure-2: An overview of system structure. A participant and the robotsat across the table and interacted with each other in a sharedenvironment. The human teacher attempted to teach the robot a set of(artificial) objects to the robot, and the robot generated gaze-contingentresponsive behaviors based on the real-time detection of the humanteacher's attention. In addition, multiple data streams were recorded toanalyze human behaviors in this joint task.IV.SEQUENTIAL PATTERNS IN MULTIMODAL DATA STREAMSOur primary goal in this section is to test and demonstratethat ESM can discover various kinds of meaningful patternsfrom multimodal human-robot interaction data. Toward thisgoal, we focus on four main types of sequential patterns thatcapture multimodal perception-action dynamics both withinhuman participants and between participants and the robot:one agentlunimodal, one agentlmultimodal, two-agentlunimodal, and two-agentlmultimodal. As we mentioned earlierand demonstrate with multiple examples below, a key featurethat distinguishes ESM with other existing algorithms is thatwe are able to extract exact timings and durations ofsequential patterns from multiple data streams.The set of temporal event streams comes from 15 subjects,each subject's data consisting of a total of 8 minutes across 4conditions. In order to specify the questions we want toanswer, we use top-down knowledge to chunk the datastreams at key moments, e.g., to answer what are thequantitative timings to patterns of gaze around naming events,we would chunk the data streams into 5 second sequencesbased on each naming event and use those sequences as theevent sequences of the event data set.1) Unimodal Patterns from a single agent: One agentunimodal patterns can reveal sequential patterns within asingle stream and how those patterns may evolve over time.In our human-robot interaction task, there are severalunimodal streams, e.g., human speech, human gaze, and robotgaze. Here, we select human speech as an example. Thespeech data was transcribed and then coded as one of threetypes of speech acts: naming an object, describing an object,and attention getting. Naming events were sentences like"This is a wawa," in which an object name was mentioned in aspoken utterance; Describing events include those spokensentences like "The wawa is blue and looks kind of like a 'y'."in which participants described physical properties of theobject to the robot; Attention getting speech events were thosethat participants used speech to attract the robot's attention,such as "hi, robot, look over here."Figure-3 shows the two most reliable patterns of speechevents detected by ESM: 1) a 1.4 second naming sentencefollowed by a 3-second describing event with a 400 ms silencein between; 2) a shorter naming followed by a describingevent which is then followed by an attention getting act.These sequential patterns reveal two key different strategiesbeing used by participants in teaching the robot object names.In fact, we found that the first pattern most likely appeared inthe beginning of the interaction to provide the key informationin this teaching task, while the second pattern was used moreoften later as a re-iteration of the information with attentiongetting speech acts as a confirmation or a query of theinformation being received at the robot's end. In addition, ashorter duration of naming events (600 ms) in the secondpattern indicates that participants were likely to just use anobject name alone in a naming utterance without other spokenwords in the later part of teaching.Speech Events After a Naming EventNamingDescribingAttention G ettinNamingDescribingAttention G ettiFigure-3: The two most reliable sequential patterns found in human'sspeech data stream. (a) A sequential pattern showing a l.4s namingutterance followed by a 3s describing utterance. (b) A sequential patternshowing a 0.6s naming utterance followed by a l.4s describing utteranceand a l.4s attention getting utterance. Note how all of the pauses betweenutterances across both patterns last about 0.4 seconds.2) Multimodal patterns from a single agent: Multimodalpatterns within an agent can reveal the interplay andAuthorized licensed use limited to: University of Texas at Austin. Downloaded on October 19,2020 at 15:12:20 UTC from IEEE Xplore. Restrictions apply.

synchronization of multimodal action, i.e., how gaze fixationson a target object and the naming of that object co-occur overtime. To demonstrate the usefulness of ESM to extract thesekinds of sequential patterns, we select three event streamsfrom participants -- human naming events, human objectlooking and human face looking -- and fed them into ESM.Figure-4shows the most reliable sequential pattern detected byES data miner: Prior to a naming event (around 500 ms),.partIcIpants gazed at the to-be-named object first, and thenstarted a naming event while looking at the target object.Aro d 500ms after the onset of the naming event,partIcIpants quickly checked the robot's face for 700 ms andthen switched their gaze back to the named object for 1.5seconds.This sequential pattern from human-robot interactions is inline with two previous findings of human behaviors. First,psycholinguistic studies in speech production showed thatparticipants were likely to gaze at the object that will bereferred to in the following spoken utterance before the onsetof that spoken utterance to facilitate sentence planning [16].Second, the most likely moment that a participant checked therobot's face was right after a naming utterance was started.This result is supported by [15], which found that during theonset of speech, humans tend to look away from the listener asthey process the information required to utter the sentence. Abrief look on the partner's face serves as a communicativesignal to continue communication or to finish their social tumin a conversation. Thus, those confirmed results based onliterature demonstrated that ESM is able to successfullydiscover sequential patterns that are embedded in the data.3) Unimodal patterns across two-agents: We next take thehuman gaze stream and the robot gaze stream to analyze jointattention behaviors between two social partners. This analysisallows us to identify coupled and adaptive behaviors betweentwo agents to not only discover those joint attention momentswherein they paid attention to the same region-of-interests butalso who was leading who into a joint state. In particular, wefocused on face gazing behaviors to ask how the human andthe robot jointly attend to each other's face and what are thetemporal differences between a human leading and robotleading mutual gaze.Figure-5 shows the most reliable mutual gaze patterndetected from two gaze streams. Overall, the robot's facegazing duration was significantly longer than human's faceMultimodal Human Behavior Around Naming Events7sFaceNamingomsFigure-4: The most reliable sequential pattern detected from human'sspeech and gaze streams. Participants gazed at the to-be-named objectbefore naming and then checked the robot's face 500 ms after the onsetof the naming event followed by a 1.5-second look at the target object.Human and Robot Face Gaze PatternsHuman LeadingRobot e-5: Two mutual gaze patterns discovered: (a) robot leading (b)human leading. Moreover, the results showed the robot's face gazeduration was significantly longer than human's face gaze events (a) and(b) humans produced more face looks.gazing events and humans produced more face looks than therobot. Moreover, we found two different ways to lead tomutual gaze between humans and the robot: human leading(left) and robot leading (right).In the human leading pattern (left), the results show that therobot was following the human's face gazing after 0.6seconds, which was comparable to the amount of the time ittook the human to follow a face gazing from the robot (that is,the robot leading pattern on the right). From the humanperspective, a participant must first notice that the robot islooking at him, decide to look back, and then plan a saccade toswitch his attention to the robot's face. This whole processseems to take around 600ms in human-robot interactions. Inour platform, it was our original design to program the robot'sgaze following behaviors to emulate human gaze behaviors.Therefore, the similar timing delay in the human followingand the robot following cases verified the real-time aspects ofour human-robot interaction platform. In addition, we foundthat a continued face looking from the robot (right) madeparticipants look back the robot's face periodically; showinghuma 's sensitivity to the robot behaviors and their adaptivebehaVIOrs to further facilitate interaction.4) Multi-modal patterns across multiple-agents: Multi agent multimodal patterns are the most impenetrable (andarguably most interesting) aspects of sensorimotor dynamicswithin multimodal interaction, since they encompass thedynamics both across modalities as well as across agents.Here, we select five temporal streams: the robot's facelooking, the human's face looking, the robot's object gazing,the human's object gazing and an object moving eventindicating whether the target object is moving or not. Figure5 shows the most reliable pattern that ESM discovered fromfive event streams. At the beginning, the robot looked at thehu an's face while the human looked at a moving targetobject and then the robot switched its attention to the targetobject by following the human's attention. The robot's objectfollowing behavior triggered the human to check the robot'sface which in tum led the robot to look to the human's face asa response action of the human's face looking. After that, thehuman has switched back to the target object and staysfocused on it, even while the robot continued to look at thehuman's face -- both went back to their initial state after thissequence of responsive actions. We also found that right atthe moment after the human switched to the robot's face(presumably to check the robot's attention), the human seemedto stop moving the target object. One plausible reason is that itAuthorized licensed use limited to: University of Texas at Austin. Downloaded on October 19,2020 at 15:12:20 UTC from IEEE Xplore. Restrictions apply.

might be easier for the human to check whether the robot wasfollowing at a non-moving target. Also, in this sequentialpattern detected, we found two instances of the robotfollowing of the human's gaze. The first can be seen as therobot followed the human gaze to the target object withinabout 1 second after the onset of the human gaze. The secondinstance of the robot following was the moment that the robotfollowed the gaze to the face about 0.5 seconds after the onsetof the human gaze. More generally, this complex cascadingsequence happening within a short window of time reflects thereal-time dance of multimodal interactions, which we argue isthe key to understand both human-human communication andhuman-robot interaction. Therefore, ESM is able tosuccessfully discover rather complex patterns across mUltipledata streams, showing its potential utilities in analyzingcomplex micro-level behavioral data.Multimodal Pattern from Multipl e Agentscontinue our current research in various multimodal socialscenarios, from human-robot interaction to child-parentinteraction. We argue that understanding the coupling andsequential patterns from micro-level human behaviors can leadto a mechanistic account of fundamental aspects of humancommunication and interaction, and exact timings ofmultimodal behaviors within an agent and across two socialpartners can play a critical role in such smooth interaction.ACKNOWLEDGMENTWe thank Hong-Wei Shen and Amanda Favata for helpwith running subjects, Computational Cognition and LearningLab members for help in coding data, and Henry Choi andHong-Wei Shen for programming assistance. Research wassupported by NSF BCS 0924248 and AFOSR FA9550-09-10665.REFERENCESFast Object Speed[I]Robot Geze On FeceHuman G aze On Face[2]Robot Gaze On ObjectChen Yu, Matthias Scheutz, and Paul Schennerhorn, "Investigatingmultimodal real-time patterns of joint attention in an HRI word learningtask," in Proceedings of the 5th ACMIIEEE international conference onHuman-robot interaction, 2010.Hui Zhang, Damian Fricker, Thomas G. Smith, and Chen Yu, "Real time adaptive behaviors in multimodal human-avatar interactions," inInternational Co

mining method to analyze multimodal data streams using a quantitative temporal approach. While the existing algorithms can only find sequential orders of temporal events, this paper presents a new temporal data mining method focusing on extracting exact timings and durations of sequential patterns extracted from multiple temporal event streams.