Distributed Speech Recognition - W3

Transcription

Distributed SpeechRecognition“Where is 358MadisonAvenue”David PearceMotorola Labsbdp003@motorola.com

Voice & MultimodalMultimodal-enabledServicesVoice-enabled ServicesUser enterscommands via:SPEECHScreen OUTAudioOUTKEYPADSystemresponds:GRAPHICTEXTSSPEECH SOUNDSKeypad IN2Speech IN

Distributed Speech reless]Packet DataNetworkVoice Gateway / Server: VoiceXML / mm Browser Speech Resources (ASR, TTS, etc.)ConventionalSpeechCoderCircuit SwitchedMobile Voice ChannelSpeechDecoderISDNDSRASRFront-endPacket Data Channele.g. GPRS or CDMA 1x3ASRDecoderASRFront-endASRDecoder

Benefits of DSRWord Accuracy (%)10095EFR Coded SpeechDSR908580Baselineerror freestrongmediumweakGSM signal strength Improves performance over wireless channels Minimises impact of codec & channel errors Consistent performance over coverage area Improved performance in background noise 53% reduction in error rate Ease of integration of combined speech and dataapplications Use packet data channel for both DSR and other data4

DSR StandardsDistributed Speech RecognitionDSR Advanced front-end (Oct 2002)DSR Extended Advanced Front-end (Nov 2003)Speech Enabled ServicesFixed point DSR standard createdDSR selected as the recommended codec for SES(Approved June 04)IETF3GPP2RTP payload formats for DSRSpecifications standardised rfc4060Speech Enabled ServicesNew Work Item (Approved Jan 2005)5

DSR Advanced Front-end (ES 202050) Noise Robust Front-end Half error rate cf mel-cepstrum in background noise Double Wiener filtering noise suppressionWaveform processingBlind equalisationRepresentation: 12 cepstral coeffs, C0, logECompression gives bit rate of 4.8kbit/sFeature Extraction8 & 16 epstrumCalculation6BlindEqualizationto featurecompression

DSR Extension (ES 202 212) Enables Speech waveform reconstruction at server for humanlistening Adds 800bps containing pitch (total 5.6kbps):Assists recogniser with tonal language recognition (e.g. Mandarin, Cantonese)SpeechInETSI StandardDSR Front-EndMFCC & log-EDSRBack-End@ 4800 bpsCHANNELPitch & ClassEstimationPitch & ClassTonalInformationPitch Trackingand Smoothing@ 800 bps7SpeechReconstruction SpeechOut

Results of ASR vendor evaluations in3GPPNumber AMR4.75DSRAverage8 kHz DigitsSub-wordTone confusability1151Channel errorsWeighted %30.0%14.8%52.8%36%Extensive testing on 21 different speech databases of dbtestedCovering different languages, tasks and environmentsTests performed with IBM and Scansoft commercial recognisersResults above are for low data-rate comparison for packet data ( 8kbit/s)8

Packet Switched Channel ErrorsRobustness to block errors narrow-band (8kHz)98.0Word accuracy (%)96.094.0DSR92.0AMR 12.2AMR 4.7590.088.086.001234Block error rate (%) Aurora-3 Italian speech database GPRS network simulation for distribution of errors3GPP Feb 20049

Coded speech vs DSR (Aurora-3Italian)DSRAMR 4.75DegradationWell matched96.594.4-57%Med mismatch90.483.9-68%High radationWell matched96.590.6-165%Med mismatch90.475.9-151%High mismatch88.670.5-160%Average92.480.4-159%10

Distributed Multimodal ArchitectureMM GatewayDSRFrontFrontEndEndDSRHandset device RTP/SIPRTP & SIPJ2MEApplicationApplicationRTP/SIPRTP& SIPHandsetGPRSor erMultimodalHTTPApplicationsand contentASRDecoderDSRMultimodal GatewayInput modalities (i.e., DSR, keypad input, pen entry) Output media (e.g., Visualrendering, Decoded speech output)Application Environment(Java or WAP Browser)Protocols (SIP / RTP,Multimodal remote control)Content ServerDSR DecoderMultimodalVoiceXML browserProtocols11Applications andcontent Content authoringContent delivery

IP Netwo rk Content Servers [Wireless] Packet Data Network Voice Gateway / Server: VoiceXML / mm Browser Speech Resources (ASR, TTS, etc.) Client Devices Conventional Circuit Switched Mobile Voice Channel Speech Coder Speech Decoder ISDN ASR Front-end ASR Decoder DSR Packet Data Channel e.g. GPRS or CDMA 1x ASR Front-end ASR Decoder