Robust Sociolinguistic Methodology - Ldc.upenn.edu

Transcription

Robust Sociolinguistic Methodology:Tools, Data and Best PracticesChristopher Cieri, Stephanie Strassel{ccieri, strassel}@ldc.upenn.eduUniversity of PennsylvaniaLinguistic Data Consortium and Department of Linguistics3600 Market Street, Philadelphia, PA 19104 U.S.A.www.ldc.upenn.edu NWAVE 32, University of Pennsylvania, Philadelphia, October 20031

Background NWAVE 32, University of Pennsylvania, Philadelphia, October 20032

Sponsors National Science Foundation– TalkBank: (www.talkbank.org) an interdisciplinary research projectfunded by a 5-year grant (BCS-998009, KDI, SBE) to Carnegie MellonUniversity and the University of Pennsylvania.– The TalkBank coordinators are Brian MacWhinney (CMU) andChristopher Cieri (Penn). Co-P.I.'s are Mark Liberman (Penn) andHoward Wactlar (CMU). Steven Bird (Melbourne) consults.– Foster fundamental research in the study of human and animalcommunication. TalkBank will provide standards and tools for creating,searching, and publishing primary materials via networked computers.– 15 disciplinary groups were identified in the TalkBank proposal; sixhave received focused efforts: Animal Communication, ClassroomDiscourse, Conversation Analysis, Linguistic Exploration, Gesture, Textand Discourse and Technical Development. In 2002, Sociolinguisticsadded as the seventh area on the strength of the DASL project NWAVE 32, University of Pennsylvania, Philadelphia, October 20033

Sponsors Linguistic Data Consortium– a not-for-profit activity of the University of Pennsylvania– serving researchers, educators and technology developers in languagerelated fields– by creating and collecting, archiving, distributing– language resources, including data, tools, standards and best practices Data Distribution– organizations join per year receiving ongoing rights to all data releasedthat year– data from funded projects at LDC or elsewhere, community or LDCinitiatives– broad data distribution across research communities– funding agencies avoid distribution costs– users receive vast amounts of data while avoiding enormousdevelopment costs Data Collection, Annotation, Research Projects– support NSF, DARPA programs– other government and commercial technology development programs– all results distributed through LDC NWAVE 32, University of Pennsylvania, Philadelphia, October 20034

Who/What is LDCIn operation 11 years, 36 FT Staff248 Corpora 2/month 15,000 copies to 468 members 1197 organizations in 57 countries NWAVE 32, University of Pennsylvania, Philadelphia, October 2003N/S AmericaEuropeAsiaME/AfricaAus/NZ78451818453415

Investigate best practices in use of digital data and toolsto support empirical linguistic inquiry anddocumentation. Now a Talkbank activity. Vision for empirical, quantitative research that is––––robust – tackles new challenge conditionsaccountable – documents relationship between method and resultrepeatable – shares data, tools methods to allow comparisoncollaborative – encourages researchers to build upon each others‟ work Analysis of –t/d deletion in the published TIMIT (isbn:158563-019-5) and Switchboard (isbn:1-58563-121-3)corpora Web based annotation tool SLX Corpus of Classic Sociolinguistic Interviewsconducted by William Labov and his students SLX Corpus toolkit This workshop NWAVE 32, University of Pennsylvania, Philadelphia, October 20036

Definitions Corpus – a body of records of linguistic behaviorcollected and annotated for a specific purpose– audio and video recordings of speech and gesture– written text– collected under naturalistic or experimental conditions Annotation is any process of adding value to a corpus– through the application of human judgment or– (semi)automatic processing based upon human judgment or previousannotation Segmentation and Transcription are special kinds ofannotation– segmentation defines the scope and granularity of future annotations– transcription encodes subtle human judgements about what was said,who said it and what was intended Coding of sociolinguistic variables is annotation NWAVE 32, University of Pennsylvania, Philadelphia, October 20037

Evolution?1963Interviews are recorded but not alwaystranscribed; when transcribed, transcriptsare often only partial.Analytical tools arenot integrated.The presentationis an independentartifact.After 40 years of technological advance, our use of data is largely unchanged; only thecomponents differ.2003 NWAVE 32, University of Pennsylvania, Philadelphia, October 20038

So What? Suboptimal methodologies lose information––––miss tokens, give an unbalance view of corpuscode information redundantlylose sequence and time of utterances, eventsignore the style profile of an interview Optimal methodology– simplifies work so that researchers can address currenttopics more completely and with balance and can approachnew topics– improves consistency– retains time and sequence information– retains mapping between sound, transcript, selectedtokens, their coding, the analysis and examples inpublication– encourages re-use of data» each additional pass requires less effort than original NWAVE 32, University of Pennsylvania, Philadelphia, October 20039

Vision2003- NWAVE 32, University of Pennsylvania, Philadelphia, October 200310

Case Study NWAVE 32, University of Pennsylvania, Philadelphia, October 200311

The Study Is the phonological variation observed better modeledas a small number of varieties with inherent variationor a larger number of invariant varieties? Vowel system of a Regional Italian influenced byStandard Italian and two local dialects Data––––––80 subjects stratified for age, gender, socioeconomic backgroundInterviewers both native and non-nativeSubjects typically interviewed in pairsMultiple conversational situations (styles)Style as a function of time in the interviewObjective and subjective analyses:» vowels system, intervocalic /v/, “c” before high vowels Need Tools, Formats– Collect and Annotate data– Manage layers of analysis– Summarize and Present results NWAVE 32, University of Pennsylvania, Philadelphia, October 200312

Before Listen to tape for interesting tokensDigitize individual tokensCode tokens (using software where appropriate)Mark tokens on score sheetReformat data for statistical analysis Problems––––––slow, labor intensivehigh risk of missed tokenstokens typically unbalanced, representation of styles poortime measured poorlyeffort for reanalysis nearly equal to effort for originalonly limited opportunities for re-use NWAVE 32, University of Pennsylvania, Philadelphia, October 200313

After Digitize entire interview & check audio quality.Transcribe, segment & check format.Query system for items of possible interest.Where appropriate, preprocess for segmentalanalysis. Label and analyze segments of interest. Summarize. Advantages––––fewer missesbalanced coveragetime measured accuratelyre-use & reanalysis profits from previous preparation NWAVE 32, University of Pennsylvania, Philadelphia, October 200314

Digitize Recorded on audio cassette using SonyWalkman Pro stereo recorder and two lavaliermicrophones.– each subject on separate mike, interviewer typically off-mike Digitized as two channel, 16 bit, 32KHz files viaSony DAT recorder; down-sampled to 16KHzand transferred to computer via a TownshendDAT Link; saved in Entropic .sd format– .wav and .sph formats also possible Demultiplex, check signal levels & removeempty or clipped channels Confirm recording length, trim beginning &ending silence NWAVE 32, University of Pennsylvania, Philadelphia, October 200315

Segment Time align transcript to audio file– allows transcript to serve as index into audio– focuses attention on units smaller than interview One long file instead of many small files– preserves integrity of original event, allows later resegmentation– preserves time Levels– Initial Segmentation» at each speaker turn» within long turns at 8 seconds» segmented into breath groups where convenient– Further segmentation refines domain of analysis» word level, phonetic segment level (for vowels) NWAVE 32, University of Pennsylvania, Philadelphia, October 200316

Transcribe To transcribe or – fewer misses– balanced coverage– re-use & reanalysis Automatic or manual transcription? Segmentation before Transcription Orthographic transcription withinteresting items & features transcribedphonetically Who does 1st and 2nd pass? NWAVE 32, University of Pennsylvania, Philadelphia, October 200317

Tools Strans– Emacs with menus modified and macros added to supporttranscription talking to Xwaves through “send xwaves” Segment Helper– Emacs running in server mode– Client writes all commands to stdout where Emacs either actson them immediately or passes them onto Xwaves.SegmentEmacsXwavesHelper– Segment Helper & all utilities hereafter written in PerlTK -- free,available on Unix and NT, merges the TK GUI capacity withPerl‟s flexibility and flow control.– Now Transcriber does it all! NWAVE 32, University of Pennsylvania, Philadelphia, October 200318

Strans Next Segment - shifts displayso that 10% of last segmentshows CreateSegment polls Xwavesfor left, right cursor positionsand writes those as time stampswith channel marker in text FindSegment finds position inwaveform of segment defined intext Monoauralrecording withsubject on single mike;interviewer off mike. Segmentdefined by start &stop times plus channel markerand written by software basedon cursor positions. Speaker ID writtenby humanand later normalized. Situtationcode written semiautomaticallyand checked by human. Interestingfeaturetranscribed phonetically. NWAVE 32, University of Pennsylvania, Philadelphia, October 200319

Transcription Features––––––Editing signal: - Non-lexemes: %m (English & Italian spelled differently)Truncation: n- nonNon-Standard pronunciation: usciti [usci‟i]Code switching: English Where are you from? Overlap/Back-channel: (CCXX: %mhm)» favor subject over interviewer, turn-holder over others ASR Transcription experiment– native speaker trained Dragon Naturally Speaking Italian– listened to tapes via foot-pedal controlled device– repeated each utterance to Naturally Speaking & corrected its mistakesASRManualExperiment 1 13.1xRT 13.4xRTExperiment 2 11xRT 7.8xRT NWAVE 32, University of Pennsylvania, Philadelphia, October 200320

Quality Checking After Segmentation and Transcription, filesare checked by a second transcriptionist for– bad segmentation» too much silence in segment» segment boundary too close to signal» signal not contained within segment– inaccurate transcription– inaccurate situation code– misspellings– inaccurate phonetic transcription within [ ] Format– 628.67 633.94 X: MC01: 2: e m- -- a mezzanottesiamo rientrati %e -- in albergo NWAVE 32, University of Pennsylvania, Philadelphia, October 200321

Syntax Check After last human QC pass use automaticprocess– segments that are too long– time stamps out of order or internally inconsistent– impossible channel marker, speaker ID or situation code QC catches human formatting errors. System controls all subsequent processingavoiding most kinds of human error. Format– uttnum 77 speaker MC01 situation 2 channel Xustart 628.67 ustop 633.94utterance e m- -- a mezzanotte siamo rientrati%e -- in albergo NWAVE 32, University of Pennsylvania, Philadelphia, October 200322

Token Selection Software looks up each word in pronouncing lexicon toenable phonetic query, categorization. Software searches reformatted transcript, identifies andnumbers any words matching query. Each hit word ispresented to user in context as text and audio Software guesses location of word in utterance basedon simple assumption that all syllables are of roughlyequal length -- does surprisingly well Linguist adjusts word boundaries in waveform display,zooms and iterates until satisfied. Format– hitnum 276 pattern e/R] word albergowstart 632.934813 wstop 633.778312uttnum 77 speaker MC01 situation 2 channel Xustart 628.67 ustop 633.94utterance e m -- a mezza notte siamo rientrati %e -- inalbergo comments "" NWAVE 32, University of Pennsylvania, Philadelphia, October 200323

FindWords GetSignallocatesand plays utterance,guesses word positionand sets cursors SegmentWordwrites segmentationto new file andmarks hit as done. Retaining timesallows user to balancesamples over corpus LexicalItemmatching search.May be more thanone per utterance AbstractLabel forSearch Pattern Unique HitNumber NWAVE 32, University of Pennsylvania, Philadelphia, October 200324

Analysis Automatically create analytic files for each tokenAccepts word start and end times from previous stepFinds corresponding audioCreates–––––Wide band spectrogramNarrow band spectrogramMaximum entropy (LPC) spectrogramFormant tracksF0 analysis Saves all files for later use by human annotator. NWAVE 32, University of Pennsylvania, Philadelphia, October 200325

Label Formants Time Aligneddisplaysof waveform, F0 andspectrograms Software guessesposition of segmentwithin word. User adjustssegmentation and savesto file. Software estimatesformant valuesautomatically. Userselects or corrects. Allsound files,spectrograms, and F0files processed ahead oftime in batch and savedfor later redisplay. NWAVE 32, University of Pennsylvania, Philadelphia, October 200326

Formatspeaker MC01situation 8channel Xhitnum 1267word gabbiautterance gabbiauttnum 376pattern a/BBcomments ""mstart 2610.823500sstart 2610.740000wstart 2610.710000ustart 2610.71mstop 2610.848500sstop 2610.908000wstop 2611.533687ustop 2611.54F1 891.1739F2 1706.9408F3 2337.6178 NWAVE 32, University of Pennsylvania, Philadelphia, October 200327

AnnotationsU1 U2 U3U6 U7U4: una donna bella U5H1:bellaS1: EF123 NWAVE 32, University of Pennsylvania, Philadelphia, October 200328

RelationsSubjectSpeakerAgeSexEd LevelProfessionRegionLocationHitHit #UtterancePatternUtterance # Utterance #U Start Time WordU Stop Time W Start TimeChannelW Stop TimeSpeakerActual PronSituationLexiconWordExpected PronStressed VowelPreceding EnvFollowing Env.SegmentHit #SegmentS Start TimeS Stop TimeAnalysisHit #F1F2F3 Software flattens relations and exports to analyticalsoftware; R in this case. NWAVE 32, University of Pennsylvania, Philadelphia, October 200329

Best Practices for DigitalMethodology:Collection NWAVE 32, University of Pennsylvania, Philadelphia, October 200330

Coding ExperimentSpeakers utter phonetically rich sentences under avariety of circumstances.1 2 3Is "dark" r-ful?Is fricative in "greasy" voiced?Is there intrusive-r in "wash"?What's the vowel in "water"How confident are you? NWAVE 32, University of Pennsylvania, Philadelphia, October 200331

Recording Commonly used: small portable recorder and lavalieremicrophone––––High quality is possibleCost is generally lowUnobtrusiveHighly portable Obtrusiveness and quality are variables that can bemanaged. Data collected under other conditions may be naturaland valuable.– Examples from CALLHOME, Switchboard, ROAR NWAVE 32, University of Pennsylvania, Philadelphia, October 200332

Recording Experiment Two subjects in sociolinguistic interviews with semanticdifferentials, phonetically rich sentences, word list. Microphones and recording devices 121314PZM on Subject's ChairWireless, Cardioid Lavalier on InterviewerHypercardioid, Head MountedLavalierCardioid LavalierDynamic Studio on StandStudio on StandShotgun (Hypercardioid) on BoomBuilt-in on o SystemStudio SystemStudio SystemStudio SystemStudio SystemStudio SystemStudio SystemStudio SystemPanasonic RQ-A70Sony Walkman ProSony TCM5000EVSony Walkman DATSony M2-R50 MinidiskComputerLow Frequency HumNearly InaudibleVery Little NoiseVery Little NoiseVery Little NoiseFaint HissLow Frequency HumHigh Frequency NoiseLow Signal, High NoiseLow Frequency HumFaint Low Frequency HumFaint Low Frequency HumLow Signal, No HumHiss NWAVE 32, University of Pennsylvania, Philadelphia, October 200333

Observations Variables– Really poor choices can affect coding of even highly salient variables. Distance from mouth to microphone– Low frequency is affected by even small differences.– Room noise becomes more obvious with greater distances. Unobtrusive collections– Very unobtrusive microphones can still produce very useful recordings. Motor Hum– Recorders with motors– But compare minidisk and TCM5000EV Interference– Recording from laptop‟s sound board. NWAVE 32, University of Pennsylvania, Philadelphia, October 200334

Recording Quality Two very poor choices and one good NWAVE 32, University of Pennsylvania, Philadelphia, October 200335

Recording Quality Lavalier microphone and minidisk Lavalier microphone and computer sound board NWAVE 32, University of Pennsylvania, Philadelphia, October 200336

Recording Quality PZM Lavalier and Walkman DAT NWAVE 32, University of Pennsylvania, Philadelphia, October 200337

Best Practices for DigitalMethodology:Published Data NWAVE 32, University of Pennsylvania, Philadelphia, October 200338

Using Published Data Linguistic Corpus: a body of records oflinguistic behavior collected and annotated for aspecific purpose Why should a sociolinguist want to usesomeone else‟s data?––––––––Exploratory study before doing individual data collectionBroaden scopeLocate „rare‟ constructionsSupplement individual data collectionLots more data, possibly greater range of dataLow- or no-cost access to dataOften highly searchable - get lots done quicklyNew perspective NWAVE 32, University of Pennsylvania, Philadelphia, October 200339

Published Data LDC: http://ldc.upenn.edu/Catalog Free text search incatalog number,corpus name, author,corpus description,and or select one ormore search terms inlanguage, membershipyear, corpus type, datasource, sponsoringproject orrecommendedapplication menus NWAVE 32, University of Pennsylvania, Philadelphia, October 200340

Published Data ELRA: http://www.elra.info/ Select: “Fast track to ELRA‟s Catalogue” Search for words anywhere in catalog entry NWAVE 32, University of Pennsylvania, Philadelphia, October 200341

Published Data OLAC: http://www.language-archives.org/ Union catalog of 28other providers oflinguistic resources Free text search intitle, contributor andcorpus description,and/or select one ormore search terms inarchive, language,corpus type menus NWAVE 32, University of Pennsylvania, Philadelphia, October 200342

Role of Fieldwork Original fieldwork will always be necessary, providing–––––In-depth knowledge of the speech communityNew communities and language varietiesValuable researcher training and experienceNew methodological perspectivesPotential new contributions of data to public archive Corpus-based approaches can complement ts comparison of results across studies and over timeProvides a stable benchmark for competing theoriesAllows re-annotation and reuse of existing dataSupports measurement of inter-annotator consistencyReduces impediments facing new researchersAllows established scholars to tackle broader issuesDemonstrates best practice in corpus creationServes as a teaching toolAllows for multi-site collaboration NWAVE 32, University of Pennsylvania, Philadelphia, October 200343

Using Public Data (De)Compressing Audio––––Tony Robinson‟s ShortenLossless (2:1) and (3-5:1) lossy modesWindows: http://www.softsound.com/Shorten.htmlMacintosh and Linux: http://www.hornig.net/shorten/ Converting from NIST Sphere audio to .wav, .aiff, .au– Dave Graff‟s sph convert– Win32: ftp://ftp.ldc.upenn.edu/pub/ldc/misc sw/sph convert v2 1.zip– Mac: ftp://ftp.ldc.upenn.edu/pub/ldc/misc sw/sph convert v2 0.sit Other Conversions– Chris Bagwell‟s SoX– http://sox.sourceforge.net/– Does audio type, sample rate and byte order conversions Viewing text– Internet Explorer 5 and later handle Unicode (http://www.microsoft.com/)– Gaspar Sinai‟s Yudit (http://www.yudit.org/) Citing the corpus as you would any publication– But who is the author? NWAVE 32, University of Pennsylvania, Philadelphia, October 200344

Best Practices for DigitalMethodology:Code of Ethics NWAVE 32, University of Pennsylvania, Philadelphia, October 200345

Code of Ethics Assure that data users respect rights of participants, contributors Participants sign Informed Consent release approved by local IRB Data collected before IRB system, from non-funded work, fromspeakers of indigenous, endangered languages may be exempted.Such data collected is still subject to the same ethical concerns. Respect for Participants who make an important, generouscontribution to scientific research by permitting scholars to accessand analyze their linguistic behavior– avoid open public criticism of these individuals– avoid comparisons in terms of intelligence, verbal facility, social skills, orphysical appearance Confidentiality by avoiding any identifying information apart fromvideo and audio records and demographic information On discovering personal acquaintance with a participant,– refrain from using the data– acquire explicit permission from participant This requirement does not extend to use of depersonalized data orin which participants‟ identity is not examined. NWAVE 32, University of Pennsylvania, Philadelphia, October 200346

Code of Ethics Respect for Groups who may be justifiably sensitive tocriticism from the wider society.– avoid making between-group comparisons that impact core features ofsocial identity and worth. Seek of professional review in cases where datapublication may compromise the principles of respectfor participants or groups. Share Data so that others can benefit as you have. Sanctions: It is the responsibility of the entirecommunity to counter misuse in public forums andthrough personal contact. For more information, see:http://www.talkbank.org/share/ethics.html NWAVE 32, University of Pennsylvania, Philadelphia, October 200347

Annotation: Adding value tothe data NWAVE 32, University of Pennsylvania, Philadelphia, October 200348

Audio Segmentation Divides the corpus into manageable units– To indicate structural boundaries in audio file– To make subsequent transcription easier– To provide time-alignment for transcripts and other annotations Preserve integrity of original signal– Virtual, not actual, chopping of digital signal Segmentation for a specific purpose––––Speaker turn level, utterance level, breath/pause groupWord levelPhone levelFiner-grained segmentation best handled as additional,specialized pass over data NWAVE 32, University of Pennsylvania, Philadelphia, October 200349

Audio Segmentation Requirements for any segmentation specification––––Specify level of granularityTreatment of multiple speakers on one channelOverlapping speechPauses Additional features– Background or other non-speaker noise– Speaker ID, speaker changes– Fidelity Cost– Turn-level segmentation can proceed at close to 1 x Real Time– Utterance, pause, breath group segments at 5 x Real Time– Word, phone level segmentation» Requires initial segmentation at broader granularity» Much more difficult (and therefore costly)» Imparts additional level of analysis And requires specialists– Manual verification of automatic process can save time NWAVE 32, University of Pennsylvania, Philadelphia, October 200350

Transcription Why a full transcription?– Index to speech– Searchable– Provides stable basis for subsequent annotations Requirements for any transcription specification– Conventions for capitalization, punctuation, spelling– Description of any special markup– Treatment of variation» Distinguish production error from non-standard usage» Use standard orthography with markup Need to find all occurrences of same word– Disfluencies» Filled pauses, repetitions, restarts, etc.– Overlapping speech on same channel– Non-lexemes, interjections and other speaker noise– Sections of transcriber uncertainty NWAVE 32, University of Pennsylvania, Philadelphia, October 200351

Transcription Types Quick Orthographic Transcription– Speed over accuracy; close to verbatim; limited markup– Adequate for some purposes; 5 x Real Time Verbatim Orthographic Transcription––––Word-for-word accurateLimited additional markupHesitations, disfluencies, overlaps not carefully handledRequires 2 passes minimum; 35 x Real Time per channel Careful Orthographic Transcription– Verbatim, plus– Special treatment for range of features» E.g., proper names, disfluencies, non-standard variants» Background noise conditions, speaker ID, careful treatment of difficultsections– Requires multiple passes; 50 x Real Time per channel Phonetic Transcription––––Based on careful orthographic transcriptionAutomatic transcription with human verification/correctionInter-annotator agreement rates at 70-90%Cost much higher (estimates?) NWAVE 32, University of Pennsylvania, Philadelphia, October 200352

Token Selection What parameters drive token selection?– phonological, morphological, syntactic– balance across extra-linguistic features– Are there hidden parameters?» Convenience» Time» Fatigue Incomplete coverage, lack of balance affects the studyitself Variation across studies affects the ability to compareresults Pronouncing dictionaries can mediate token selection What do we know about time as independent variable? NWAVE 32, University of Pennsylvania, Philadelphia, October 200353

Time as Variable9Time is on the horizontal axis.Conversational situation (style) is on the vertical.87Larger numbers mean greater formality.64 are elicited styles3 is the default interview situation2 is for narratives and extended descriptions1 is for speech to another party54The longer interview clearly provides greateropportunities to study style 432105001000150020002500 NWAVE 32, University of Pennsylvania, Philadelphia, October 2003300054

Coding Coding Specification– Difficulty of achieving fully explicit guidelines– Coding of independent variables also a source oferror– E.g., DASL t/d deletion study» Published studies vary in terms of detail inguidelines» Complex factor groups, e.g. Morphology» Passives, e.g. „I was frightened‟» But also seemingly simple factor groups What to do with nasal flaps? Glottalized segments? How to measure pause? NWAVE 32, University of Pennsylvania, Philadelphia, October 200355

Annotator Consistency Measure of success for coding specification– Can coding be re-applied by independent annotator with highagreement? Determining inter-annotator agreement andconsistency– For both dependent and independent variables– Raw percentages aren‟t enough – some agreement just due tochance– More robust measures, e.g. Kappa scores Why bother?– Reveals ambiguities and unstated assumptions in spec– Necessary for comparison of results across studies and overtime NWAVE 32, University of Pennsylvania, Philadelphia, October 200356

Annotation Tools Overview NWAVE 32, University of Pennsylvania, Philadelphia, October 200357

Inventory http://www.ldc.upenn.edu/annotation/ NWAVE 32, University of Pennsylvania, Philadelphia, October 200358

Transcriber User-friendly GUI for segmentation, transcription and transcript labelingOpen-source; handles variety of audio, text formats; multi-platformLimitations–––Requires full segmentation of audioCustomized for single-channel broadcast news recordingsInelegant handling of overlapping speech NWAVE 32, University of Pennsylvania, Philadelphia, October 200359

AGTK Annotation Graph Toolkit: agtk.sourceforge.netSuite of tools for various types of annotationDeveloped by LDCOpen-sourceHandles variety of audio, text formatsMulti-platformSLX Corpus Tools utilize AGTK– MultiTrans for transcription– DASLTrans (version of TableTrans) for coding NWAVE 32, University of Pennsylvania, Philadelphia, October 200360

MultiTrans Transcription tool for transcribing multiparty conversations Similar to Transcriber but MultiTrans has one transcription panel foreach channel in the signal NWAVE 32, University of Pennsylvania, Philadelphia, October 200361

TableTrans Spreadsheet-stylelinguistic annotation tool User-defined features(column headings) Spreadsheet, audio aretime-aligned Each row corresponds toregion of audio signal Import existingannotation files in XML,table (csv) and LDCformat Export annotation files intable format for furtheranalysis NWAVE 32, University of Pennsylvania, Philadelphia, October 200362

Data Formats Tools read most standard audio formats (via Snack library) Transcriber– Default format is .trs,– Accepts .typ format– Default segment boundary format» Sync time "48.428"/ MultiTrans– Default is LDC-style format (.lcf)– Segment boundary format» 213.33 234.15 A: TableTrans/DASLTrans– Accepts MultiTrans .lcf files as input» Start Time, End Time, Channel/Speaker, Transcription as first four columns– Accepts table format as input» Tab or comma delineated spreadsheet» Exclude column headers– Accepts ag-xml input (.aif)» Native AGTK format– Outputs table or ag-xml format» Can import table to Excel or stats packages NWAVE 32, University of Pennsylvania, Philadelphia, October 200363

Publishing Development, production methods fully documented Complete audio available in standard format (AIFF, RIFF,SPH) uncompressed or with lossless compression Transcripts in XML or other standard, non-proprietaryplatform-independent and application-independentformat Consistent naming conventions for audio, transcriptionsand any annotations All data formats specified and confirmed Inter-annotator agreement measured and published Coding practice fully documented Results shared– Not ju

Time align transcript to audio file -allows transcript to serve as index into audio -focuses attention on units smaller than interview One long file instead of many small files -preserves integrity of original event, allows later re-segmentation -preserves time Levels -Initial Segmentation » at each speaker turn