Telephony Speech Recognition System: Challenges

Transcription

National Conference on Communication Technologies & its impact on Next Generation Computing CTNGC 2012Proceedings published by International Journal of Computer Applications (IJCA)Telephony Speech Recognition System: ChallengesJoyanta BasuRajib RoyCDACSalt Lake, Sector–VKolkata- 700091joyanta.basu@cdac.inCDACSalt Lake, Sector–VKolkata- 700091rajib.roy@cdac.inMilton S. BepariSoma KhanCDACSalt Lake, Sector–VKolkata- 700091CDACSalt Lake, Sector–VKolkata- 700091ABSTRACTPresent paper describes the challenges to design the telephonyAutomatic Speech Recognition (ASR) System. Telephonicspeech data are collected automatically from all geographicalregions of West Bengal to cover major dialectal variations ofBangla spoken language. All incoming calls are handled byAsterisk Server i.e. Computer telephony interface (CTI). Thesystem asks some queries and users’ spoken responses arestored and transcribed manually for ASR system training. Inreal time scenario, the telephonic speech contains channel drop,silence or no speech event, truncated speech signal, noisy signaletc along with the desired speech event. This paper describesthese kinds of challenges of telephony ASR system. And alsodescribes some brief techniques which will handle suchunwanted signals in case of telephonic speech to certain extentand able to provide almost desired speech signal for the ASRsystem.General TermsAutomatic Speech Recognition, Signal Processing, TelephonyApplication.KeywordsAsterisk server, Interactive Voice Response, TranscriptionTool, Temporal and Spectral features, Knowledge Base1. INTRODUCTIONModern human life is totally dependent on technologyand along with these devices become more and more portablelike mobile, PDAs, GPRS etc. Beside this, there is also agrowing demand for some hands free Voice controlled publicpurpose emergency information retrieval services like Weatherforecasting, Road-Traffic reporting, Travel enquiry, Healthinformatics etc. accessible via hand-held devices (mobiles ortelephones) to fulfill urgent and on the spot requirements. Butreal life deployment of all these applications involvesdevelopment of required modules for voice-query based easyuser interface and quick information retrieval using mobiles. Infact, throughout the world the number of telephone users ismuch higher than that of the PCs. Again human voice or speechis the fastest communication form in our daily busy schedulethat further extends the usability of such voice enabled mobileapplications in emergency situations. In such a scenario,speech-centric user interface on smart hand-held devices iscurrently foreseen to be a desirable interaction paradigm whereAutomatic Speech Recognition (ASR) is the only availableenabling technology.Interactive Voice Response (IVR) systems provide a simple yetefficient way for retrieving information from computers inspeech form through telephones but in most of the cases usersstill have to navigate into the system via Dual Tone MultipleFrequency (DTMF) input and type their query by telephonekeypad. A comparative study by K.M Lee & J.Lai, 2005 [1]revealed that in spite of occasionally low accuracy rates, amajority of users preferred interacting with the system byspeech modality as it is more satisfying, more entertaining, andmore natural than the touch-tone modality which involves theuse of hands, quite time consuming and require at least theknowledge of English alphabets.Especially for a country like, India, with its multi-lingualrequirements and not so fortunate achievements in terms ofoverall literacy, development of IVR application with backendASR support for recognizing voice query and response innative languages is of major importance because these systemsfacilitates the common mass of the country to access the hugeinformation available in Internet using telephones. But at thetime of deployment, such an IVR based access system willdefinitely have to cope up with real world speech and relatedchallenges as well. So, it is very important to detect suchchallenging situations and find out ways to overcome themgracefully without annoying the user.Present paper addresses some real time challenges of telephonyASR applications. And also provides a clear picture of theabove tasks in a well planned and sequential manner aimingtowards the development of an IVR application in spokenBangla language.2. MOTIVATION OF THE WORKA practical IVR system should be designed in such a way that itshould be capable of handling real time telephony hazards likechannel drop, clipping, speech truncation etc. It should alsoprovide robust performance considering following issues:1.Speaker-Variability: Handle speech from any arbitraryspeaker of any age i.e., it would be a speaker-independentASR pronunciations, dialectical variations and accents within aparticular Indian language3.Channel Variability: Different channels such as landlineversus cellular and different cellular technologies such asGSM and CDMA.4.Handset Variability: Variability in mobile handsets dueto differences in spectral characteristics.30

National Conference on Communication Technologies & its impact on Next Generation Computing CTNGC 2012Proceedings published by International Journal of Computer Applications (IJCA)5.Different Background Noise: Various kinds ofenvironmental noise, so that it is robust to real-worldapplication.Considering the above said requirements, telephonic ASR isbeing designed in such a way that, to-some-extent it can meetthe above mentioned capabilities. In the present study, speechdata are mainly collected from all the geographical regionswhere native Bangla language speaking population isconsiderably high. The collected speech data is then verifiedand used for ASR training.The reason behind choosing a large geographical area for datacollection is to cope up with the problem of speaker variability,accentual variability. Additionally, various issues regarding thetelephonic channel such as channel drop or packet lost duringtransmission, handset variability, service provider variability,various types of background noise such as cross-talk, vehiclenoise etc. have been observed, analyzed and estimatedefficiently from the collected speech data and modeling ofthose can improve ASR performance. These issues will notonly help us to improve the system performance effectively, butalso provide us very good research motivation on othertelephonic applications.3. BRIEFOVERALLOVERVIEWSYSTEMTelephonic ASR system is designed such a way, that users getthe relevant information in a convenient manner. First thesystem will give the user a language preference (within Hindi,Bangla and Indian English) and then onwards each time adirected question is asked, and the user would reply it withappropriate response from a small set of words. System iscomposed of three major parallel components. They are IVRserver (hardware and API), Signal Processing Blocks with ASRengine and Information Source. Fig. 1. represents an overallblock diagram of the system.automatically when the user calls, recognize the inputinformation (like dialed digit or speech) by the user, interactwith computer to obtain the necessary information, then convertthe information into speech form and also convert the incomingspeech into digital form and store it in the computer.In development of telephonic ASR system, Asterisk [2][3] isused here as an open source IVR Server, converged telephonyplatform, which is designed primarily to run on Linux. Itsupport VoIP protocols like SIP, H.323; interfaces with PSTNChannels, supports various PCI Cards, and also open sourceDrivers and Libraries are available.3.2 Signal Processing Block and ASR engineThis block consists of three major blocks namely SpeechAcquisition and Enhancement module, Signal Analysis andDecision module and ASR Engine. Block diagram of suchSignal Processing Block is shown in Fig. 3.PSTNIVR Hardware(PC add-on card)PCInformation SourceFig. 2: Interactive voice response systemTrainingSpeechsignalSpeechAcquisition tionAcoustic ModelTrainerLanguage ModelTrainerAcousticModelLanguage ASREngineModelDecoder (SpeechTestingRecognizer engine)RecognizedWordsequenceFig. 3: Basic block diagram of Signal ProcessingBlocks including Automatic Speech Recognitionengine3.21 Speech Acquisition and Enhancement moduleFig. 1: Block diagram of Telephonic ASR System3.1 IVR hardware and APIAs shown in fig. 2 the Interactive Voice Response (IVR)consists of IVR hardware (generally a Telephony Hardware), acomputer and application software running on the computer.The IVR hardware is connected parallel to the telephone line.The functionality of the IVR hardware is to lift the telephoneThe first block, which consists of the acoustic environment plusthe transduction equipment, can have a strong effect on thegenerated speech representations because additive noise, roomreverberation, recording device type etc. are associated with theprocess. A speech enhancement module suppresses the aboveeffects so that the incoming speech can easily be recognized inheavy perturbed conditions.31

National Conference on Communication Technologies & its impact on Next Generation Computing CTNGC 2012Proceedings published by International Journal of Computer Applications (IJCA)3.2.2 Signal Analysis and Decision module (SAD)In this module all incoming speech waveform analyzed for thevalid speech signal or not. This is the very important module ofthis system. This module extracts different temporal featureslike Zero Crossing Rate (ZCR) [4][5], Short Time Energy(STE) [6][7] and Spectral features like formant analysis etc.Then using some predefined Knowledge Base (KB) this modulegives some decision valid information regarding incomingspeech signal. After this module system takes some decisionwhether users need to re-record speech signal or not. If rerecording is not required then recorded speech signal gothrough ASR engine for decoding the signal.remarks are given in Table 1 and Table 2 shows the differenttypes of noise tags.From Table 1 it has been seen that S UTTR, CPR UTTR,TR UTTR and R UTTR are the rejection tag sets. Speech filesmarker by other tag set may be accepted and considered at thetime of ASR training.Table 1: Description and measurement of RemarksWave Remarks (WR)DescriptionA UTTR (Amplitude)Amplitude (Partly or fully) will bemodifiedC UTTR (Clean)Speech Utterance may contain somenon overlapping non-speech event3.2.3 ASR engineThe basic task of Automatic Speech Recognition (ASR) is toderive a sequence of words from a stream of acousticinformation. Automatic recognition of telephonic voice queriesrequires a robust back-end ASR system. CMU SPHINX [8], anopen Source Speech Recognition Engine is used here whichtypically consists of Speech Feature Extraction module,Acoustic Model, Language Model, Decoder [9]3.3 Information SourceRepository of all relevant information is known as a trustedInformation Source, and design architecture of the sametypically depends on type of information. In current workdynamic information is mainly refereed from trustedinformation source or online server and all other informationwhich does not change very much during a considerable periodof time are kept in local database using web crawler. Systemresponse of any query on dynamic information must ensuredelivery of latest information. To accomplish this objective aspecific reference made to trusted information source. Thisapproach ensures quick delivery of information.4. CHALLENGES OF TELEPHONY ASRSpeech quality is affected by the transmission impairmentsfound on telephone connections. Sometimes the intelligibilityand naturalness of speech degrade to an intolerable extent [10].These challenges include loudness loss, circuit noise, side toneloudness loss, room noise, attenuation distortion, taker echo,listener echo, quantizing distortion, phase jitter etc. In addition,such user interfaces terminating the transmission channel asmobile handsets and hands-free terminals are likely to pick upbackground noise, mainly including circuit noise, noise floor,impulse noise, ambient room noise and crosstalk noise.To find out the problems we have designed one SemiAutomatic Transcription Tool [11]. This tool has been designedfor offline transcription of recorded speech data, such that alltranscriptions during data collection can be checked, correctedand verified manually by human experts. Automatic conversionof text to phoneme (phonetic transcription) is necessary tocreate pronunciation lexicon which will help the ASR Systemtraining.The methodology for Grapheme to Phoneme (G2P) conversionin Bangla is based on orthographic rules. In Bangla G2Pconversion sometimes depends not only on orthographicinformation but also on Parts of Speech (POS) information andsemantics [12]. G2P conversion is an important task for datatranscription.From where, many information were gathered regardingtelephonic speech data. At the time of transcription we have togive some transcription remark tags and also noise tags. It’stotally human driven task. Descriptions and measurement of theTranscriptionRemarks (TR)CLPD UTTR(Clipped)DescriptionClipping of speech UtteranceCPC UTTR (ChannelProblem Consider)Channel drop occurs randomly insilence region which have not affectspeech regionCPR UTTR (ChannelProblem Reject)Some word or phonemes droppedrandomlyI UTTR (Improper)MN UTTRIn this case the utterance is slightlydifferent than prompt in phonemelevelNoise within speechReasonable silence (pause) withinspeechMP UTTRR UTTR (Reject)S UTTR (Silence)TA UTTR (TruncateAccept)TR UTTR (TruncateReject)W UTTR (Wrong)In this case speech signal is wronglyspelt or may be too many noise ormay be non-sense words, can’t beable to understandNo speech Utterance presentTruncation of not so significantamount (may be one or twophoneme) speech UtteranceTruncation of significant amountspeech UtteranceIn this case the utterance i

Automatic Speech Recognition (ASR) is the only available enabling technology. Interactive Voice Response (IVR) systems provide a simple yet efficient way for retrieving information from computers in speech form through telephones but in most of the cases users still have to navigate into the system via Dual Tone Multiple