DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Transcription

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERINGCHITKARA UNVERSITY, PUNJABDeclaration by the StudentI hereby certify that the work, presented in this thesis entitled “ACOUSTIC FEATURESOPTIMIZATION FOR PUNJABIAUTOMATIC SPEECH RECOGNITIONSYSTEM” is for partial fulfilment of the requirement for the award of Degree of Doctor ofPhilosophy submitted in the Department ofComputer Science & Engineering,Chitkara University, Punjab and is an authentic record of my own work carried out underthe supervision of Dr. Archana Mantri, Professor, Department of Electronics andCommunication Engineering, Chitkara University, Punjab, India and Dr. R. K. Aggarwal,Associate Professor, Department of Computer Engineering, National Institute ofTechnology, Kurukshetra, Haryana, India.The work has not formed the basis for the award of any other degree or diploma, in this orany other Institution or University. In keeping with the ethical practice in reportingscientific information, due acknowledgements have been made wherever the findings ofothers have been cited.April, 2018Chitkara University Punjab(Virender Kadyan)ii

Certificate by the Supervisor(S)This is to certify that the thesis entitled “ACOUSTIC FEATURES OPTIMIZATIONFOR PUNJABI AUTOMATIC SPEECH RECOGNITION SYSTEM” submitted byVirender Kadyan with Roll No. 1410951014 to the Chitkara University, Punjab inpartial fulfilment for the award of the degree of Doctor of Philosophy is a bona fide recordof research work carried out by him under our supervision and guidance. The contents ofthis thesis, in full or in parts, have not been submitted to any other Institution or Universityfor the award of any degree or diploma.(Supervisor)Dr. Archana MantriProfessorDepartment of Electronics & Communication EngineeringChitkara University Punjab, India(Co-Supervisor)Dr. R. K. AggarwalAssociate ProfessorDepartment of Computer EngineeringNational Institute of Technology, Kurukshetra, Haryana, Indiaiii

ToMy grandmother Ms. Malo DeviMy father Mr. Kuldeep Singh KadyanMy mother Mrs. MuneshMy wife Mrs. AmarjeetMy daughter ManviMy Brother Mr. Narender KadyanMy Brother’s Wife Mrs. ManjuMy nephew KannavMy guide Dr. Archana MantriMy guide Dr. R. K. Aggarwaliv

AcknowledgmentSuccess in life is always possible through support and encouragement from people aroundus. It is true also specially for my PhD work and I owe my gratitude to the following personwithout whom this journey would not have been possible.My profound and deepest gratitude goes to my thesis supervisors Dr. Archana Mantri andDr. R. K. Aggarwal for their continuous efforts and guidance. Words are not enough toexpress my thought; both of their deep knowledge of research, sprit in work, painstakingstand, innovative and creative thoughts are hearty admired that nurture me into a betterresearcher. Their proficiency and guidance provided all the insights for refining the contentof my thesis.I am sincerely thankful to Dr. Pankaj Kumar, Dean and Professor, Chairman of my DRCcommittee along with other committee members for giving valuable feedback, technicaldiscussion and suggestions on my presented work during entire tenure of my research workto accomplish my research goals.Special gratitude to Dr. B. Yegnanarayana, whose contribution in speech recognition hasinspired me for my research. I would like to express my gratitude to Dr. Amitoj Singh,MRS PTU Bathinda, Mr Mohit Dua, N.I.T. Kurukshetra, Dr. Jagpreet Sidhu, Mr. VinayKukreja, for their valuable input for my research work.I am thankful to my Speech and Multimodal Laboratory alumni and current members Ms.Nancy, Mr. Vivek, Ms. Mandeep, Ms. Sashi Bala for their support and help in each phase.Writing the thesis made me tackle various ups and downs in each stage and fortunately myteam’s understanding helped me during this tenure.Above all I am extremely thankful to almighty, my parents, my wife and my children.Achieving the desired goals through my PhD work has been possible with their grace, love,support, guidance and sacrifice.(Virender Kadyan)v

AbstractWith many advances made in automatic speech recognition technology over past fewdecades, there is now an increasing demand of developing Indian ASR. There is a huge gapbetween performance of machines and a human due to lack of resources, complexity inhandling feature vectors, decorrelation of feature information and robustness beside everincreasing changes in input speech conditions. Different approaches have been examinedto tackle these factors. The aim of the proposed research work is to cope with these issuesthrough refinement, combination, & integration of front and back end approaches withdifferent methodologies.One of the solution to overcome thesis issues is to explore optimization techniques fortraining (GA HMM, DE HMM) of large corpora. The proposed method is applied onbaseline acoustic modeling approaches in training stage. We use the stratergies to developthem for most frequently used and language resources levied. Punjabi language falls in thiscategory but for that we need to first build Punjabi speech dataset. So, in this thesis we firstbuild Punjabi speech corpora of isolated and continuous sentences spoken by adult Punjabispeakers. Its performance is not suggested to be productive on large corpus with traditionalapproaches at front and back ends of the system. To reduce feature vector complexity intraining stage, Mel Frequency Cepstral Coefficient (MFCC) feature vectors are combinedwith optimization algorithms. It refines the processed feature vectors before performingclassification using baseline hidden Markov model (HMM) approach. The experiments arethen conducted on large vocabulary of Punjabi isolated words.Despite the improvement in performance, a large gap exists due to the mismatch betweentrain and test conditions. We try to reduce this gap by using different combinations of frontend approaches (MFCC, Perceptual Linear Prediction (PLP), Relative Spectral Transform(RASTA) - PLP or their fusion). We test them on each refined modeling approachesthrough baseline or combination of two feature approaches. Based on our studies, wevi

propose a technique in Punjabi speech recognition system that outruns baselineapproaches.One of the manifesto in this thesis is the pruning of high variance with low bias inHamming window through an integration of multi-taper approaches - Thomson, SineWeighted Cepstrum Estimator (SWCE) and multi-peak in front end MFCC andGammatone Frequency Cepstral Coefficient (GFCC) techniques. Additionally, theirfilterbank are optimized using Principal Component Analysis (PCA) based Gammatoneand Mel filterbank analysis. Both parallel integrated approaches prove to be highlyeffective in context of isolated Punjabi speech dataset in mismatch train and testconditions.Most of the existing techniques presented here were initially constructed with modelingapproaches which employ HMM or Gaussian Mixture Model (GMM). Systematic storageof frame coefficient and fitting of information is made possible through connectionposterior probabilities. Furthermore, the acoustic modeling technique of Deep NeuralNetwork (DNN) - HMM is observed to be more effective and reduces overfitting oftraining corpora. On the other hand, it is applied on extracted feature vectors from MFCCand GFCC for reduction of their dimensions (using Linear Discriminat Analysis (LDA)),decorrelation of its vectors information and cope with issue of speaker variability, analyzedusing Speaker adaption methods such as Maximum Likelihood Linear Transforms (MLLT)and Feature Space Likelihood Linear Regression Speaker Adaptive Training (fMLLRSAT). On the other hand, fMLLR-SAT is found to be more effective on continuous andconnected Punjabi dataset. Parameters of these integrated approaches are varied (such as anumber of hidden layers, Gaussian, and feature dimensions) that helps in tuning of efficientparameters. Finally, our observation is found to be more effective in context of triphonewith tri-gram language modeling on Punjabi language ASR for large training corpora.Keywords: Punjabi Speech Recognition, Acoustic Feature Refinement, MFCC, PLP, MFPLP, GFCC, Multitaper, PCA, HMM, GMM, DNNvii

List of AcronymsAcronymDescriptionABFBAdaptive Bands Filter BankAMAcoustic ModelANNArtificial Neural NetworkASRAutomatic Speech recognitionATISAir Travel Information SystemBGABaum Welch Genetic AlgorithmBWBaum WelchCAChaos AlgorithmC-DACCentre for Development of Advanced ComputingCDHMMContinuous Density Hidden Markov ModelCD-DNN-HMMContext Dependent Deep Neural Network HMMCFCCCochlear Filter Cepstral CoefficientCDContext DependentCIILCentral Institute of Indian LanguagesCIContext IndependentCNNConvolutional Neural NetworkCMUCarnegie Mellon UniversityCRCrossover RateDBNDeep Belief NetworksDCTDiscrete Cosine Transformviii

AcronymDescriptionDEDifferential EvolutionDFTDiscrete Fourier TransformDNNDeep Neural NetworkDTWDiscrete Wavelet TransformELRAEuropean Language Resources AssociationEMExpectation-MaximizationfDLRFeature Space Discriminative Linear RegressionfMLLRFeature Space Likelihood Linear RegressionFFTFast Fourier TransformGAGenetic AlgorithmGcFBGammachirp Auditory FilterbankGFCCGammatone Frequency Cepstral CoefficientGMMGaussian Mixture ModelHFCCHuman Factor Cepstral CoefficientHMMHidden Markov ModelHTKHidden Markov Model ToolkitIIRInfinite Impulse ResponseLDALinear Discriminat AnalysisLDCLinguistic Data ConsortiumLIHLinear Hidden NetworkLINLinear Input Networkix

AcronymDescriptionLMLanguage ModelLPCCLinear Prediction Cepstral CoefficientLSTMLong Sort-Term MemoryLVCSRLarge Vocabulary Continuous Speech RecognitionMCEMinimum Classification ErrorMFCCMel Frequency Cepstral CoefficientMGFCCModified-GFCCMLLTMaximum Likelihood Linear TransformsMMDMaximum Model DistanceNNNeural NetworkNSPNative Signal ProcessingODTOptimal Discriminative TrainingPCAPrincipal Component AnalysisPCSParallel Cepstral and SpectralPGAParallel Genetic AlgorithmPLPPreceptual Linear PredictionPLDAProbabilistic Linear Discriminant AnalysisPNCCPower-Normalized Cepstral CoefficientsPSOParticle Swarm OptimizationRASTARelative Spectral TransformRHMMRecurrent Hidden Markov Modelx

AcronymDescriptionRNNRecurrent Neural NetworkROVERRecognizer Output Voting Error ReductionRTFReal Time FactorSATSpeaker Adaptive TrainingSFVGSpeech Feature Vector GenerationSDSpeaker DependentSGMMSubspace Gaussian Mixture ModelSISpeaker IndependentSNRSignal to Noise RatioSVMSupport Vector MachineSWCESine-Weighted Cepstrum EstimatorTBTabu SearchTDNNTime Delay Neural NetworkTDILTechnology Development for Indian LanguagesUBMUniversal Background ModelVLTNVocal Tract Length NormalisationWAWord AccuracyWERWord Error Ratexi

List of Figures1.1Basic block diagram of speech signal processing in ASR system21.2Factors affecting in designing of a standard speech corpus111.3Flow diagram displaying the proposed techniques implemented in 14different chapters3.1Punjabi isolated word recognizer403.2Architecture of proposed Punjabi-ASR system413.3Basic block diagram of HMM, GA HMM, and DE HMM based 42Punjabi speech signal processing elements3.4Word recognition rate for varying iterations on development corpus533.5Performance analysis obtained with different speaker region543.6Word recognition rate with varying SNR level563.7Training time complexity analysis for the number of spoken utterances583.8Performance comparison of various acoustic classifiers on varying 59corpus size with proposed classifier approaches on Punjabi-ASR system4.1Flow diagram of heterogeneous feature and hybrid acoustic modeling 69approaches for proposed Punjabi-ASR system4.2A frame of speech feature vectors with their mean and variance 70information4.3Training time complexity verses spoken utterances785.1Block diagram of Parallel multitaper, PCA integrated GFCC and MFCC 88feature extraction approaches with DE HMM employed acousticmodeling classifier6.1Flow diagram of GMM HMM and DNN HMM modeling classifiers 103integrated with monophone and triphone modeling units6.2(a,b) Comparison of WER (%) for varying feature dimension with different 107system typexii

6.3(a,b) Comparison of WER(%) profile for varying Gaussian mixture obtained 109for different acoustic modeling classifiersxiii

List of Tables2.1Outline of NN based ASR engine252.2Brief overview of Indian ASR engine313.1Training dataset selection criteria523.2Testing dataset selection criteria533.3Optimal value of different parameters employed in DE543.4Optimal value of different parameters employed in GA543.5Performance analysis obtained with different testing and training 55dataset using heterogeneous modeling techniques3.6Performance analysis obtained between various system type versus 57different environment conditions3.7Performance comparison of proposed classifiers approach obtained 60with different existing implemented approaches4.1Custom ga (A,B) method for feature refinement704.2Control parameters value used in feature refinement using GA and DE 714.3DE feature refinement sequence4.4Summary of training and testing information of Punjabi speech 7372dataset4.5Word performance analysis of different SFVG techniques with hybrid 75HMM classifiers in clean environment condition4.6Word performance analyses of different SFVG techniques with hybrid 76HMM classifiers in real time environment conditions4.7Word performance analysis with different speakers versus hybrid 77acoustic classifiers5.1(a)Performance evaluation with various multitaper methods using MFCC 93on varying taper length5.1(b)Performance evaluation of different multitaper methods using GFCC 93xiv

on varying taper length5.2Performance evaluations with different SNR level on baseline and 94multitaper estimation methods in different feature sets5.3Recognition rate using PCA optimized GFCC and MFCC filter with 94different multitaper methods5.4Recognition performance obtained through single or multitaper 96methods based on GFCC and MFCC features with integ

Chitkara University, Punjab and is an authentic record of my own work carried out under the supervision of Dr. Archana Mantri, Professor, Department of Electronics and Communication Engineering, Chitkara University, Punjab, India and Dr. R. K. Aggarwal, Associate Professor, Department of Computer Engineering, National Institute of Technology, Kurukshetra, Haryana, India. The work has not .