Foreign Accent Classification

Transcription

Foreign Accent ClassificationCS 229, Fall 2011Paul Chenpochuan@stanford.eduJulia Leejuleea@stanford.eduABSTRACTWe worked to create an effective classifier for foreign accentedEnglish speech in order to determine the origins of the speaker.Using pitch features, we first classify between two accents,German and Mandarin, and then expanded to a set of twelveaccents. We achieved a notable improvement over randomperformance and gained insights into the strengths of andrelationships between the accents we classified.1. INTRODUCTIONAccented speech poses a major obstacle for speech recognitionalgorithms [4]. Being able to accurately classify speech accentswould enable automatic recognition of the origin and heritage of aspeaker. This would allow for robust accent-specific speechrecognition systems and is especially desirable for languages withmultiple distinct dialects. Accent identification also has variousother applications such as automated customer assistance routing.In addition, analyzing speech data of multiple accents canpotentially hint at common linguistic origins. When an individuallearns to speak a second language, there is a tendency to replacesome syllables in the second language with more prominentsyllables from his native language. Thus, accented speech can beseen as the result of a language being filtered by a secondlanguage, and the analysis of accented speech may uncover hiddensemblances among different languages.Spoken accent recognition attempts to distinguish speech in agiven language that contains residual attributes of anotherlanguage. These attributes may include pitch, tonal, rhythmic, andphonetic features [3]. Given the scale constraints of this projectand the difficulty of extracting phonemes as features, we start byextracting features that correspond to pitch differences in theaccents. This is a common approach when it comes to speaker andlanguage identification and calls for feature extraction techniquessuch as spectrograms, MFCCs, and LPC.2. PREVIOUS WORKA previous CS229 class project [6] experimented with HierarchicalTemporal Memory in attempting to classify different spokenlanguages in transcribed data. They preprocessed their data using alog-linear Mel spectrogram and classified it using support vectormachines to achieve above 90% accuracy. Although their projectfocuses on classifying completely different languages and wewould like to classify different accents, their results can serve as agood frame of reference.Research presented in a paper by Hansen and Arslan [3] usedHidden Markov Models and a framework that they termed “SourceJulia Neidertjneid@stanford.eduGenerator” which attempts to minimize the deviation of accentedspeech from neutral speech. They used a large number of prosodybased features. In comparing accented speech to neutral speech,they found that pitch based features are most relevant. Their worksuggests that it is possible to classify accented speech with goodaccuracy using just pitched-based features.A paper by Gouws and Wolvaardt [2] presented research that alsoused Hidden Markov Models to construct a speech recognitionsystem. Their results elucidated some of the relations betweentraining set size and different feature sets. They showed that theperformance of using LPC and FBANK actually decrease withincreasing number of parameters, while LPCEPSTRA increasedand MFCC stayed the same. These results give us a better guidancefor our choice of feature sets and amount of data.Research by Chen, Huang, Chang, and Wang [1] used a Gaussianmixture model in order to classify accented speech and speakergender. Using MFCC’s as their feature set, they investigated therelationship between the number of utterances in the test data andaccent identification error. The study displays very impressiveresults, which encourages us to think that non-prosodic feature setscan be promising for accent classification.3. DATA AND PREPROCESSINGAll training and testing were done with the CSLU: ForeignAccented English v 1.2 dataset (Linguistic Data Consortiumcatalog number LDC2007S08) [5]. This corpus consists ofAmerican English utterances by non-native speakers. There are4925 telephone quality utterances from native speakers of 23languages.Three independent native American English speakers ranked andlabeled the accent strength of each utterance. We used the HiddenMarkov Model Toolkit (HTK) for feature extraction, MATLABfor preprocessing, and LibSVM and the Waikato Environment forKnowledge Analysis (Weka) for classification.Data points were taken from 25 ms clips of utterances and wereaveraged over a window of multiple seconds to form features.Various preprocessing techniques were attempted, includingsliding windows, various window lengths, standardization, and theremoval of zeros from data points. The four second, non-slidingwindows with standardization was chosen for use in further workas it gave the best results on our baseline classifier.

4. CLASSIFYING TWO ACCENTSWe began by assessing feature set quality and classifierperformance based on classification accuracy between twoaccents. Aiming to select accents that are more easilydifferentiable, we initially selected the Mandarin and the Germanaccent. Our initial feature sets were Mel Frequency CepstralCoefficients (MFCC), Linear Predictive Coding (LPC), andFilterbank Energies (FBANK) features, as they were the mostfrequently used features in other previous works, especiallyMFCC and LPC. FBANK features represent the prominence ofdifferent frequencies in a sound sample, while MFCCs normalizethese to take human perception of sound into account. LPCfeatures also represent sound as frequencies, but separate thesound into a base buzz and additional formants.4.1 Establishing a BaselineFigure 1. Significance of data set size.For our baseline classification, we ran Naive Bayes, logisticregression, and SMO classifiers1 each on FBANK, MFCC, andLPC feature sets for German and Mandarin accented speech files.For each pair of classifier and feature set we obtained the resultsshown in Table 1.Table 1. Testing accuracy for baseline classifiers and featuresZeroRNaïve 8.85MFCC51.4660.6469.359.7060.2866.5359.6160.454.2 Assessing Data QualityTo determine whether insufficient data was causing pooraccuracy, we divided our feature data into a testing set (30%) anda training set (70%). We measured classification accuracy for thetesting set when each classifier was trained on increasing fractionsof the training data. We observed that accuracy increased whenthe classifier was trained with more data, but decreasing accuracygains suggested that insufficient data was not the primary cause ofpoor accuracy (see Figure 1).We also tested whether the accent data was too subtle, as somespeech samples barely sound accented even to a human listener.Each speech sample was previously rated by 3 judges on a scalefrom 1 (negligible or no accent) to 4 (very strong accent withhindered intelligibility) [5], so we extracted FBANK features(which produced higher baseline accuracies than MFCC and LPC)from 3 different subsets of the more heavily-accented data withstronger accents and measured classification with our baselineclassifiers. Specifically, we selected speech samples with averageratings greater than 2.5 and greater than 2.7. However,classification accuracy saw little improvement, perhaps due to theeffect of a reduced data set size (see Table 2). Consequently, wecontinued to use all data available for Mandarin and Germanaccented speech.1Unless otherwise specified, default Weka values were used forclassifier parameters.Table 2. Classifier accuracies. using most heavily accenteddata and FBANK onSMOAccentStrength 2.5AccentStrength 261.250.763.057.969.152.561.959.366.958.44.3 Improving Feature Set SelectionNext we considered the quality of our features and expanded ourMFCC feature set to include deltas, accelerations, and energies(“TARGETKIND MFCC E A D” in HTK configuration files).This again achieved little improvement over MFCC. By plottingtraining accuracy vs. testing accuracy (see Figure 2), we observedthat training accuracy was also low, showing us that we wereunder-fitting the data. Thus, we attempted to boost accuracy byfirst over-fitting our training data before trying any optimization.We merged the individual feature sets (expanded MFCC, LPC,and FBANK) into a single set, but found that training error stilldid not improve substantially (see Table 3). We subsequently ranfeature selection algorithms (including Correlated Features SubsetEvaluation and Subset Evaluation using logistic regression andSMO) to try to remove all but the strongest features. Thisimproved the accuracy on the training data, but not the testingdata, which suggests that classifying on stronger accents using alarger data set could help.

Figure 2. Classifier training and testing accuracies vs.training set size.Table 3. Accuracy of baseline.classifiers on merged feature setcontaining MFCC, LPC, and FBANK (%)53.5658.6561.3359.9061.3959.624.4 Selecting a Better ClassifierTo improve training error, we tried using K-Nearest Neighbors(KNN) as well as LibSVM. KNN performed poorly, but weobserved dramatic improvements in training set classificationaccuracy using a LibSVM classifier with a Gaussian kernel (seeTable 4).Table 4. Accuracy of initial LibSVM classifiers usingGaussian 82.4857.8889.6357.45Although training accuracy increased significantly, we did not seesimilar gains in testing accuracy. In order to boost testingaccuracy, we optimized parameters of our LibSVM classifier (seeFigure 3). Optimizing gamma versus C (the coefficient for thepenalty of misclassification), we finally saw an improvement. Weachieved a testing accuracy of 63.3% with C 128 and gamma 0.000488 as parameters of the Gaussian kernel. We experimentedwith sigmoid and polynomial kernels and various parameter sets,but computing resources limited the range of parameters tried, sowe did not achieve better accuracy in our preliminaryoptimizations.Figure 3. Optimizing gamma and C parameters of theLibSVM Gaussian kernel.5. CLASSIFICATION ACROSS MULTIPLELANGUAGESWe proceeded to process a dozen accents from our dataset,choosing only ones that have at least 200 utterances. We obtaineda classification accuracy of 13.26% by reselecting parameters forLibSVM, which is a significant improvement over the baselineaccuracy of random guessing (8%). Further, the confusion matrixacross these twelve accents displayed interesting results. Figure 4plots the percentage of cases in which each language on the y-axiswas classified as a language on the x-axis. While we do not see aparticularly distinct diagonal indicating correct classifications, thisplot does illuminate some interesting relationships in our accentdatabase.The resulting figure shows that the Cantonese accent is verydistinctive in our dataset and is easiest to classify with ourfeatures. It suggests that our Hindi accent samples share manysimilar aspects with other languages such that many instances ofthe other accents were classified as Hindi, while the opposite istrue for German. This suggests that our initial choice of Germanand Mandarin for the two-class problem may have resulted inbetter results if we had chosen other accents.This figure also hints at the similarity of accents from countries ofgeographic proximity. For example, the German accent is mostfrequently confused as the French and the Swedish accents, andthe Japanese accent was often confused with the Cantonese andMandarin accents. However, it also reveals that geographicproximity does not absolutely determine accent semblance. Forexample, the French accent is actually least likely to be confusedwith the German accent despite the fact that France and Germanyare bordering countries.

Figure 4. Confusion matrix for 12-way accent classification.6. FUTURE WORK7. CONCLUSIONWe tried many different approaches in order to arrive at the bestpossible accent classifier using a set of features based solely onpitch. In the end, our training error was still significantly higherthan our testing error, so these results might still be improved. Todo this, we would want to use a larger data set with strongeraccents. Performing more intensive feature selection using SubsetEvaluation on LibSVM, which was infeasible with our limitedcomputing and time resources, would likely prove helpful, aswould performing more intensive parameter selection for differentkernels.There is much need for improvement before an accent classifiercould be used definitively in a speech recognition system. In ourwork, however, we have made progress in this area and have alsouncovered insights into the relationships between accents andtheir origins. This suggests that in the future, there is hope forfurther improvement and an increased understanding of how wespeak and where we come from.In addition, the accent classification problem could besignificantly different from other speech classification problems,and thus, other feature sets might be more informative. At thispoint, we would need to work with linguists and sociologists togenerate these relevant features from scratch.Thanks to Andrew Maas for his support and advice in this projectthroughout the process!Altering the problem slightly, we could cluster accents from acommon geographic region and work to identify between thosegroups. Inversely, further analysis of our current classificationresults and how those are correlated with geographic andhistorical data could uncover or reinforce these insights into thestructures and origins of different languages and the histories ofdifferent peoples.8. ACKNOWLEDGMENTS9. REFERENCES[1] T. Chen, C. Huang, C. Chang, and J. Wang, “On the use ofGaussian mixture model for speaker variability analysis,”presented at the Int. Conf. SLP, Denver, CO, 2002[2] E. Gouws, K. Wolvaardt, N. Kleynhans, and E. Barnard,“Appropriate baseline values for HMM-based speechrecognition,” in Proceedings of PRASA, November 2004,pp. 169–172

[3] J. H. L. Hansen and L. M. Arslan, “Foreign accentclassification using source generator based prosodicfeatures,” in Proc. IEEE Int. Conf. Acoustics, Speech, SignalProcessing, vol. 1, 1995, pp. 836–839.[4] C. Huang, T. Chen, S. Li, E. Chang and J.L. Zhou, “Analysisof Speaker Variability,” in Proc. Eurospeech, 2001, vol.2,pp.1377-1380, 2001[5] T. Lander, 2007, CSLU: Foreign Accented English Release1.2. Linguistic Data Consortium, Philadelphia[6] D. Robinson, K. Leung, and X. Falco, “Spoken LanguageIdentifcation with Hierarchical Temporal LeungRobinson.pdf

can be promising for accent classification. 3. DATA AND PREPROCESSING All training and testing were done with the CSLU: Foreign Accented English v 1.2 dataset (Linguistic Data Consortium catalog number LDC2007S08) [5]. This corpus consists of American