The Workshop Programme Corpora For Research On Emotion And Affect

Transcription

The Workshop ProgrammeCorpora for Research on Emotion and AffectTuesday 23rd May 200614:30 – 20:0014:30-14:50 Welcome (R. Cowie)14:50-16:30 First Oral Session (15 min. per speaker 5 min. for questions)14:50 Nick CampbellA Language-Resources Approach to Emotion: Corpora for the Analysis of ExpressiveSpeech15:10 Franck Enos, Julia HirschbergA Framework for Eliciting Emotional Speech: Capitalizing on the Actor's Process15:30 Magalie Ochs, Catherine Pelachaud, David SadekA Coding Scheme for Designing: Computational Model of Emotion Elicitation15:50 Tanja Bänziger, Hannes Pirker, Klaus SchererGEMEP - GEneva Multimodal Emotion Portrayals: A corpus for the study of multimodalemotional expressions16:10 Laurence Vidrascu, Laurence DevillersReal-life emotions in naturalistic data recorded in a medical call center16:30-17:00 Coffee break17:00-18:00 Poster SessionPoster 1: Noam Amir, Samuel RonCollection and evaluation of an emotional speech corpus using event recollectionPoster 2: M. Rita Ciceri, Stephania Balzarotti, F. ManzoniMEED: the challenge towards a Multidimensional Ecological Emotion DatabasePoster 3: Steffi FrigoThe relationships between acted and naturalistic emotional corporaPoster 4: Nicolas Audibert, DamienVincent, Véronique Aubergé, Olivier RosecEvaluation of expressive speech resynthesisPoster 5: Isabella PoggiBody and Music. An annotation scheme of the pianist multimodal behaviouri

Poster 6: Aubergé Véronique, Rilliard Albert, Audibert NicolasAuto-annotation: an alternative method to label expressive corporaPoster 7: Loyau Fanny, Aubergé VéroniqueExpressions outside the talk turn: ethograms of the feeling of thinkingPoster 8: Valérie Maffiolo, G. Damnati, V. Botherel, E. Guimier de Neef and E. MaillebuauMultilevel features for annotating application-driven spontaneous speech corporaPoster 9 : Maartje Schreuder, Laura van Eerten, Dicky GilbersMusic as a method of identifying emotional speechPoster 10: Aaron S. Master, Ing-Marie Jonsson, Clifford Nass, Peter X. Deng, Kristin L.RichardsA Framework for Generating and Indexing Induced Emotional Voice DataPoster 11: Alexander OsherenkoAffect Sensing using Lexical Means: Comparison of a Corpus with Movie Reviews and aCorpus with Natural Language DialoguesPoster 12: N. Mana, P. Cosi, G. Tisato, F. Cavicchio, E. Magno and F. PianesiAn Italian Database of Emotional Speech and Facial Expressions18:00-19:40 Second Oral Session (15 min. per speaker 5 min. for questions)18:00 Vaishnevi Varadarajan, John Hansen, Ikeno AyakoUT-SCOPE – A corpus for Speech under Cognitive/Physical task Stress and Emotion18:20 Chloé Clavel, Ioana Vasilescu, Laurence Devillers, Gaël Richard, Thibaul EhretteThe SAFE Corpus: illustrating extreme emotions in dynamic situation18:40 Janne Bondi Johannessen, Kristin Hagen, Joel Priestley and Lars NygaardA speech corpus with emotions19:00 Dirk Heylen, Dennis Reidsma, Roeland OrdelmanAnnotating State of Mind in Meeting Data19:20 Mark Schröder, Hannes Pirker, Myriam LamolleFirst suggestions for an emotion annotation and representation language19:40-20:00 Conclusion and Panel/Discussion (20mn)ii

Workshop OrganisersLaurence Devillers / Jean-Claude Martin,Spoken Language Processing group/ Architectures and Models for Interaction,LIMSI-CNRS, FranceRoddy Cowie / School of PsychologyEllen Douglas-Cowie / Dean of Arts, Humanities and Social SciencesQueen's University, Belfast BT7 1NN, UKAnton Batliner - Lehrstuhl fuer Mustererkennung (Informatik 5)Universitaet Erlangen-Nuernberg, GermanyWorkshop Programme CommitteeElisabeth André, Univ. Augsburg, DVéronique Aubergé, CNRS-STIC, FRAnton Batliner, Univ. Erlangen, DNadia Bianchi-Berthouze, Univ. Aizu, JNick Campbell, ATR, JRoddy Cowie, QUB, UKLaurence Devillers, LIMSI-CNRS, FREllen Douglas-Cowie, QUB, UKJohn Hansen,Univ. of Texas at Dallas, USASusanne Kaiser, UNIGE, SStephanos Kollias, ICCS, GChristine Lisetti, EURECOM, FRValérie Maffiolo, FranceTelecom, FRJean-Claude Martin, LIMSI-CNRS, FRShrikanth Narayanan, USC Viterbi School of Engineering, USACatherine Pelachaud, Univ. Paris VIII, FRIsabella Poggi, Univ. Roma Tre, IFiorella de Rosis, Univ. Bari, IIzhak Shafran, Univ. Johns Hopkins, CSLP, USAElisabeth Shriberg, SRI and ICSI, USAMarc Schröder, DFKI Saarbrücken, DIoana Vasilescu, ENST, FRiii

Table of ContentsIntroduction .viiA Language-Resources Approach to Emotion: Corpora for the Analysis of ExpressiveSpeech .1Nick CampbellA Framework for Eliciting Emotional Speech: Capitalizing on the Actor's Process.6Franck Enos, Julia HirschbergA Coding Scheme for Designing: Computational Model of Emotion Elicitation .11Magalie Ochs, Catherine Pelachaud, David SadekGEMEP - GEneva Multimodal Emotion Portrayals: A corpus for the study ofmultimodal emotional expressions.15Tanja Bänziger, Hannes Pirker, Klaus SchererReal-life emotions in naturalistic data recorded in a medical call center.20Laurence Vidrascu, Laurence DevillersCollection and evaluation of an emotional speech corpus using event recollection.25Noam Amir, Samuel RonMEED: the challenge towards a Multidimensional Ecological Emotion Database.29M. Rita Ciceri, Stephania Balzarotti, F. ManzoniThe relationships between acted and naturalistic emotional corpora.34Steffi FrigoEvaluation of expressive speech resynthesis.37Nicolas Audibert, DamienVincent, Véronique Aubergé, Olivier RosecBody and Music. An annotation scheme of the pianist multimodal behaviour .41Isabella PoggiAuto-annotation: an alternative method to label expressive corpora.45Aubergé Véronique, Rilliard Albert, Audibert NicolasExpressions outside the talk turn: ethograms of the feeling of thinking.47Loyau Fanny, Aubergé VéroniqueMultilevel features for annotating application-driven spontaneous speech corpora.51Valérie Maffiolo, G. Damnati, V. Botherel, E. Guimier de Neef and E. MaillebuauMusic as a method of identifying emotional speech .55Maartje Schreuder, Laura van Eerten, Dicky GilbersA Framework for Generating and Indexing Induced Emotional Voice Data .60Aaron S. Master, Ing-Marie Jonsson, Clifford Nass, Peter X. Deng, Kristin L. Richardsiv

Affect Sensing using Lexical Means: Comparison of a Corpus with Movie Reviewsand a Corpus with Natural Language Dialogues.64Alexander OsherenkoAn Italian Database of Emotional Speech and Facial Expressions.68N. Mana, P. Cosi, G. Tisato, F. Cavicchio, E. Magno and F. PianesiUT-SCOPE – A corpus for Speech under Cognitive/Physical task Stress and Emotion 72Vaishnevi Varadarajan , John Hansen, Ikeno AyakoThe SAFE Corpus: illustrating extreme emotions in dynamic situation.76Chloé Clavel, Ioana Vasilescu, Laurence Devillers, Gaël Richard, Thibaul EhretteA speech corpus with emotions.80Janne Bondi Johannessen, Kristin Hagen, Joel Priestley and Lars NygaardAnnotating State of Mind in Meeting Data.84Dirk Heylen, Dennis Reidsma, Roeland OrdelmanFirst suggestions for an emotion annotation and representation language.88Mark Schröder, Hannes Pirker, Myriam Lamollev

Author IndexAmir . 25Aubergé. 37, 45, 47Audibert . 37, 45Ayako. 72Balzarotti. 29Bänziger . 15Beverina . 29Botherel. 51Caldognetto Magno. 68Campbell . 1Cavicchio . 68Ciceri. 29Clavel . 76Cosi . 68Damnati. 51de Neef . 51Deng. 60Devillers. 20, 76Ehrette . 76Enos . 6Gilbers. 55Hagen . 80Hansen . 72Heylen . 84Hirschberg. 6Johannessen. 80Jonsson. 60Lamolle . 88Loyau . 47Maffiolo . 51Maillebuau . 51Mana . 68Manzoni . 29Master . 60Nass. 60Nygaard. 80Ochs . 11Ordelman . 84Osherenko . 64Pelachaud . 11Pianesi. 68Piccini . 29Pirker. 15, 88Poggi . 41Priestley . 80Reidsma . 84Richard. 76Richards . 60Rilliard . 45Ron. 25Rosec. 37Sadek. 11Scherer . 15Schreuder . 55Schröder . 88Sedogbo . 76Tisato . 68Vaishnevi . 72van Eerten . 55Varadarajan. 72Vasilescu. 76Vidrascu . 20Vincent. 37vi

Corpora for research on Emotion and AffectIntroductionThis decade has seen an upsurge of interest in systems that register emotion (in a broad sense)and react appropriately to it. Emotion corpora are fundamental both to developing soundconceptual analyses and to training these 'emotion-oriented systems' at all levels - to recogniseuser emotion, to express appropriate emotions, to anticipate how a user in one state mightrespond to a possible kind of reaction from the machine, etc. Corpora have only begun togrow with the area, and much work is needed before they provide a sound foundation.The HUMAINE network of excellence (http://emotion-research.net/) has brought togetherseveral groups working on the development of databases, and the workshop aims to broadenthe interaction that has developed in that context.Many models of emotion are common enough to affect the way teams go about collecting anddescribing emotion-related data. Some which are familiar and intuitively appealing are knownto be problematic, either because they are theoretically dated or because they do not transferto practical contexts. To evaluate the resources that are already available, and to constructvalid new corpora, research teams need some sense of the models that are relevant to the area.What are appropriate sources?In the area of emotion, some of the hardest problems involve acquiring basic data. Four maintypes of source are commonly used. Their potential contributions and limitations need to beunderstood.Acted: Many widely used emotion databases consist of acted representations of emotion(which may or may not be generated by actors). The method is extremely convenient, but it isknown that systems trained on acted material may not transfer to natural emotion. It has to beestablished what kind of acted material is useful for what purposes.Application-driven: A growing range of databases are derived from specific applications(e.g. call centres). These are ideal for some purposes, but access is often restricted forcommercial reasons, and it is highly desirable to have more generic material that couldunderpin work on a wide range of applications.General naturalistic: Data that is representative of everyday life is an attractive ideal, butvery difficult to collect. Making special-purpose recordings of everyday life is a massive task,with the risk that recording changes behaviour. Several teams have used material frombroadcasts, radio & TV (talk shows, current affairs). That raises issues of access, signalquality, and genuineness.Induction: A natural ideal is to induce emotion of appropriate kinds under appropriatecircumstances. Satisfying induction is an elusive ideal, but new techniques are graduallyemerging.vii

Which modalities should be considered, in which combinations?Emotion is reflected in multiple channels - linguistic content, paralinguistic expression, facialexpression, eye movement, gesture, gross body movement, manner of action, visceral changes(heart rate, etc), brain states (eeg activity, etc). The obvious ideal is to cover allsimultaneously, but that is impractical - and it is not clear how often all the channels areactually active. The community needs to clarify the relative usefulness of the channels, and ofstrategies for sampling combinations.What are the realistic constraints on recording quality?Naturalism tends to be at odds with ease of signal processing. Understanding of the relevanttradeoffs needs to be reached. That includes awareness of different applications (high qualitymay not be crucial for defining the expressive behaviours a virtual agent should show) and oftimescale for solving particular signal processing issues (eg. recovering features from imagesof heads in arbitrary poses).How can the emotional content of episodes be described within a corpus?Several broad approaches exist to transcribing the emotional content of an excerpt - usingeveryday emotion words; using dimensional descriptions rooted in psychological theory(intensity, evaluation, activation, power); using concepts from appraisal theory (perceivedgoal-conduciveness of a development, potential for coping, etc). These are being developed inspecific ways driven by goals such as elegance, inter-rater reliability, and faithfulness to thesubtlety of everyday emotion, relevance to agent decisions, etc. There seems to be a realprospect of achieving an agreed synthesis of the main schemes.Which emotion-related features should a corpus describe, and how?Corresponding to each emotion-related channel is one or more sets of signs relevant toconveying emotion. For instance, paralinguistic signs exist at the level of basic contours - F0,intensity, formant-related properties, and so on; at the level of linguistic features of prosody(such as 'tones and breaks' in TOBI); and at more global levels (tune shapes, repetitions, etc).Even for speech, inventories of relevant signs need to be developed, and for channels such asidle body movements, few descriptive systems have been proposed. Few teams have theexpertise to annotate many types of sign competently, and so it is important to establish waysof allowing teams that do have the expertise to make their annotations available as part of adatabase. Mainly for lower level features, automatic transcription methods exist, and their roleneeds to be clarified. In particular, tests of their reliability are needed, and that depends ondata that can serve as a reference.How should access to corpora be provided?Practically, it is clearly important to find ways of establishing a sustainable and easilyexpandable multi-modal database for any sorts of emotion-related data; to develop tools foreasily importing and exporting data; to develop analysis tools and application programmersinterfaces to work on the stored data and meta-data; and to provide ready access to existingdata from previous projects. Approaches to those goals need to be defined.viii

What level of standardisation is appropriate?Standardisation is clearly desirable in the long term, but with so many basic issuesunresolved, it is not clear where real consensus can be achieved and where it is better toencourage competition among different options.How can quality be assessed?It is clear that some existing corpora should not be used for serious research. The problem isto develop quality assurance procedures that can direct potential users toward those whichcan.Ethical issues in database development and accessCorpora that show people behaving emotionally are very likely to raise ethical issues - notsimply about signed release forms, but about the impact of appearing in a public forumtalking (for instance) about topics that distress or excite them. Adequate guidelines need to bedeveloped.The number and quality of submissions were well above our expectations, out of 26 submittedpapers, 10 papers were accepted for oral presentations and 12 for poster presentations. Theyenable the workshop to cover several dimensions of emotional corpora: Music and emotion Speech and emotion Multimodal behaviour Acted, simulated and Real-life emotion Portrayed emotion Annotation scheme and language of representationWe expect that the output of the workshop will contribute to the study of practical,methodological and technical issues central to developing emotional corpora (such as themethodologies to be used for emotional database creation, the coding schemes to be defined,the technical settings to be used for the collection, the selection of appropriate coders).Looking forward to an exciting emotional workshop!Laurence Devillers and Jean-Claude Martin, LIMSI-CNRS, France,devil@limsi.fr, martin@limsi.frRoddy Cowie and Ellen Douglas-Cowie, QUB, UK,r.cowie@qub.ac.uk, e.douglas-Cowie@qub.ac.ukAnton Batliner, University of Erlangen, Germany.batliner@informatik.uni-erlangen.deix

A Language-Resources Approach to Emotion: Corpora forthe Analysis of Expressive SpeechNick CampbellAcoustics & Speech Processing Department,Spoken Language Communication Research Laboratory,Advanced Telecommunications Research Institute International,Keihanna Science City, Kyoto 619-0288, Japan.nick@atr.jpbut are instead rich in feelings. For them, talk is almost aform of physical contact.There are many steps along the continuum between thesetwo hypothetical extremes of speaking-style variation.Perhaps they can be distinguished by the ratio ofparalinguistic to linguistic content, i.e., the amount of‘personal’ information that is included in the speech. Thelecture, having almost no personal information and a veryhigh amount of propositional content will result in a verylow value of this measure, while the phatic mutterings willscore very high.If we are to collect data that contains sufficient examples ofnatural spoken interactions along the whole range of thiscontinuum of values, then low-scoring material will provevery easy to collect, but most lovers might object stronglyto the suggestion of a recording device intruding into theirprivacy. Thus, by far the majority of speech corpora thathave been used in previous research score very poorly onthis scale and as a result the speech that they contain is notvery far removed from pure text in its style and content.ABSTRACTThis paper presents a summary of some expressive speechdata collected over a period of several years and suggeststhat its variation is not best described by the term“emotion”. Further, that the term may be misleading whenused as a descriptor for the creation of expressive speechcorpora. The paper proposes that we might benefit fromfirst considering what other dimensions of speech variationmight be of more relevance for developing technologiesrelated to the processing of normal everyday spokeninteractions.INTRODUCTIONSpoken language has been extensively studied through theuse of corpora for several decades now, and the differencesbetween the types of information that can be conveyedthrough written texts and those that are signalled throughspeech are beginning to be well understood.The paralinguistic information which is perhaps unique tospeech communication, is largely carried throughmodulations of prosody, tone-of-voice, and speaking style,which enable the speakers to signal their feelings,intentions, and attitudes to the listener, in parallel with thelinguistic content of the speech, in order to facilitate mutualunderstanding and to manage the dynamics of the discourse[1].The different types of information that are signalled bydifferent speaking styles are also well understood and arebeginning to be modelled in speech technologyapplications. The more formal the speech, the moreconstrained the types of paralinguistic information that areconveyed.As an example of one extreme, we might consider a publiclecture, where the speaker is (sometimes literally) talkingfrom a script, to a large number of listeners (or even to arecording device with no listeners physically present) andhas minimal feedback from, or two-way interaction with,the audience. This type of ‘spontaneous’ speech is perhapsthe most constrained, and most resembles text.As an example of the other extreme, we might consider themumblings of young lovers. Their conversation is largelyphatic, and the words might carry little of linguistic contentA CORPUS OF EXPRESSIVE SPEECHWe need more varied and representative corpora if we areto develop future speech technology that is capable ofprocessing the more human aspects of interactive speech inaddition to its propositional content. However, thedifficulties of doing this are well known. Since Labov, thepresence of an observer (human or device) has been knownto have an effect on the speech and speaking style of therecorded subject, and unobtrusive recording is unethical, ifnot already illegal in most countries. Several approacheshave been proposed to overcome this obstacle to futureresearch. This section reports one of them, and discussessome of the conclusions that we reached on the basis of thatexperience. The JST/CREST Expressive Speech Corpus [2]was collected over a period of five years, by fitting a smallnumber of volunteers with head-mounted high-qualitymicrophones and small mini-disc walkman recorders to beworn while going about their ordinary daily socialinteractions.1

Figure 1: A screenshot of the labelling spreadsheet for the word “honma”. The columns include data described in more detailin Table 1. In this form of labelling, tokens are listened to in isolation, free of contextual influence, while in other forms oflabelling they are annotated in time-aligned sequence, taking context into account. By clicking on a filename, the labeller canlisten to each sample interactivelysame way as “really” does in English; both as a qualifyingadjective (really hot!) and as an expressive exclamation(really?!). The word is typical of many that are repeatedfrequently throughout the corpus, and that are used by thespeaker more for their discourse effect than for theirlinguistic or propositional content. No two pronunciationsof this word are the same, and each carries subtle affectiveand interpersonal information that signals many kinds ofdifferent states and relationships, as will be described inmore detail below.These words proved most difficult for the labellers toadequately categorise. They function primarily asbackchannel utterances, but also serve to display a widerange of attitudinal and affective states. We have reportedelsewhere [3] studies that measure the extent to whichtheir function can be similarly perceived by differentgroups of listeners belonging to different culturalbackgrounds and languages. In terms of quantity, moreFurther groups of paid volunteers transcribed andannotated the speech data for a variety of characteristics,including speech-act, speaker-state, emotion, relationshipto the interlocutor, etc. All the data were transcribed, andabout 10% was further annotated. Figure 1 shows asample of the annotation results, and Table 1 shows someof the categories that were used for annotation. Thesesamples can be listened to at the project web-site,http://feast.atr.jp/non-verbal/. The material is in Japanese,but many of the findings hold for other languages as well.Japanese are people too, and many of the non-verbalspeech sounds in this language can be equivalentlyunderstood by native-speakers of other languages whohave no experience of either the Japanese language orculture. A laugh is a laugh in any language. So is a sigh.The data in figure 1 represent a few of the approximately3,500 tokens of the Japanese word /honma/ from onespeaker of the corpus. The word functions in much the2

than half of the utterances in the corpus were of this type;short words or simple syllables that occurred alone orwere repeated several times in succession, often notappearing at all in a dictionary of the formal language, butforming essential components of a two-way spokeninteraction.an angry person who is being angry. The effect on thechildren is immediate. They are afraid.The three types of speech illustrated above all containanger, but they differ in whether it is felt or expressed. Wecould further differentiate by degree of anger, or degree ofexpression, or both, and with respect to degree ofexpression, also determine whether “something inside isbeing let out” or whether the voice is being made to soundas though it is, when in fact inside the feelings may beneutral (whatever that expression might mean).ANNOTATING THE CORPUS FOR EMOTIONIt is clear that these types of expression carry emotionalinformation. They are very expressive, and revealing ofthe speaker’s type(s) and degree(s) of arousal. Wetherefore attempted to label emotion in the corpus data.A version of the Feeltrace software was implemented(square, rather than round!) and each utterance wasassigned a value within the valency/arousal space thusdefined. The labellers understood the meaning andvalidity of these two dimensions, and felt easy aboutworking with the mouse-based software for data entry, butmost complained about the work after a short time. Theyclaimed that the framework simply wasn’t appropriate fordescribing the different types of variation that theyperceived in the speech. They proposed instead thedescriptive categories shown in Table 1.While the speaker was clearly in a given state ofemotional arousal during each utterance, thecorrespondence between what the labellers coulddetermine about the speaker state, from various contextualand expressive clues, and how the speaker’s utterance wasperforming in terms of her stance within the discourse,was often very small.When labelling five-years worth of someone’s speech,you become very familiar with that person’s mannerismsand even those of their circle of acquaintances. Forexample, it might be clear from various such clues that thespeaker is angry on a given day. Yet the presence orabsence of anger in a person may have little or norelationship to the presence or absence of anger in theexpression of a given speech utterance. How is this to belabelled in the simple valence/arousal framework?Specifically, let’s examine three such cases:(i)A schoolteacher walks into the classroomand the children continue to be noisy. Theteacher gets angry with the children.(ii)The same teacher has been wrongly accused ofmalpractice during the lunchbreak and continuesto teach in the afternoon. She explai

15:10 Franck Enos, Julia Hirschberg A Framework for Eliciting Emotional Speech: Capitalizing on the Actor's Process 15:30 Magalie Ochs, Catherine Pelachaud, David Sadek A Coding Scheme for Designing: Computational Model of Emotion Elicitation 15:50 Tanja Bänziger, Hannes Pirker, Klaus Scherer