Automatic Speech Recognition System For Somali In The Interest Of .

Transcription

Automatic Speech Recognition System for Somaliin the interest of reducing Maternal Morbidity andMortality.Second-cycleAuthors: Joycelyn Laryea (h19joyla@du.se) & Nipunika Jayasundara (h18rupja@du.se)Supervisors: Nausheen Saeed & Kenneth CarlingExaminer: Moudud AlamSubject/main field of study: Business IntelligenceCourse code: MI4002Higher education credits: 15 CreditsDate of examination: 27th May 2020At Dalarna University it is possible to publish the student thesis in full text in DiVA. Thepublishing is open access, which means the work will be freely accessible to read anddownload on the internet. This will significantly increase the dissemination and visibilityof the student thesis.Open access is becoming the standard route for spreading scientific and academicinformation on the internet. Dalarna University recommends that both researchers aswell as students publish their work open access.I give my/we give our consent for full text publishing (freely accessible on the internet,open access):Yes No Dalarna University – SE-791 88 Falun – Phone 4623-77 80 00

AbstractDeveloping an Automatic Speech Recognition (ASR) system for the Somali language, though notnovel, is not actively explored; hence there has been no success in a model for conversationalspeech. Neither are related works accessible as open-source. The unavailability of digital data iswhat labels Somali as a low resource language and poses the greatest impediment to thedevelopment of an ASR for Somali. The incentive to develop an ASR system for the Somalilanguage is to contribute to reducing the Maternal Mortality Rate (MMR) in Somalia. Researchersacquire interview audio data regarding maternal health and behaviour in the Somali language; tobe able to engage the relevant stakeholders to bring about the needed change, these audios mustbe transcribed into text, which is an important step towards translation into any language. Thiswork investigates available ASR for Somali and attempts to develop a prototype ASR system toconvert Somali audios into Somali text. To achieve this target, we first identified the availableopen-source systems for speech recognition and selected the DeepSpeech engine for theimplementation of the prototype. With three hours of audio data, the accuracy of transcription isnot as required and cannot be deployed for use. This we attribute to insufficient training data andestimate that the effort towards an ASR for Somali will be more significant by acquiring about1200 hours of audio to train the DeepSpeech engine.Key Words: Maternal Mortality and Morbidity Rate (MMMR), Automatic Speech Recognition(ASR), DeepSpeech, Natural Language Processing (NLP), Word Error Rate (WER), CharacterError Rate (CER)

Table of Contents1.Introduction . 12.Research Objectives and Work Plan. 33.State of the art . 53.1 Speech Recognition . 53.1.1 Types of Speech Recognition . 53.1.2 Hidden Markov Model (HMM) and End to End Model Automatic SpeechRecognition . 83.2 ASR Performance Measures . 103.2.1 Word Error Rate . 103.3Related work and Research strategy . 113.4 Open-Source ASR . 173.4.1 Kaldi. 173.4.2 DeepSpeech. 173.4.3 DeepSpeech2. 173.5Somali as a Low Resource Language . 183.64.5.Proposed Automatic Speech Recognition System for Somali . 18Methodology . 214.1Data Collection and Preprocessing . 224.2Experimental Set up . 234.3Proof of concepts . 26Results . 275.1Results for ASR model. 275.2 Results for Interview and Youtube models . 285.2.1 Results for Interview model: . 285.2.2 Results for Youtube model: . 296.Discussion. 306.1Future Work . 337.Conclusion . 358.References . 369.Appendix . 40

List of FiguresFigure 1: Work plan to achieve the objective of this work . 3Figure 2: ASR architecture. . 9Figure 3: DeepSpeech Diagram . 20Figure 4: Workflow Diagram to achieve a Somali ASR . 21Figure 5: Experimental setup with GPU enabled Google cloud resources. . 24Figure 6: Experimental interface ( Google Colaboratory) on importing audios and at start oftraining. 26List of TablesTable 1: Summary of the reviewed literature . 16Table 2: Data collected . 22Table 3: Data as processed by the DeepSpeech Engine . 23Table 4: DeepSpeech installation process. 25Table 5: Table of results for all three models. . 29

1. IntroductionGlobally, an improvement has been seen in child delivery by a professional birth attendant.That of Sub-Saharan Africa remains at 59% in 2018 compared to the world score of 81% (Platform,2019). Sub-Saharan Africa suffers the most maternal mortality, averaging 533 per 100,000 livebirths. Out of 46 Sub-Saharan African countries, Somalia’s maternal mortality rate (MMR) ratiois estimated at 829 deaths per 100,000 live births, being ranked as very high and only better thanChad and South Sudan (WHO, 2019).As its contribution to achieving the third Sustainable Development Goal (SDG) of 70deaths per 100,000 live births, Dalarna University, Sweden, is involved in training Somali healthprofessionals. However, it is still observed a low probability of women seek the services of theavailable skilled staff. It is also acknowledged that other factors contribute to maternal mortalitythat transcends the management of delivery. Thus, a research team from Dalarna University hassought to find out the reason for the status quo, noting that reducing the maternal mortality ratewill remain unlikely unless sentiments and attitudes towards seeking timely professional carechange.The female literacy rate in Somalia; estimated at 25.8% (Cline, 2018) limits the avenuesfor data collection by sampling or soliciting sentiments via social media or questionnaires. Theviable option is to depend on audio data. The research team from Dalarna University conductedindividual and focus group interviews with Somali women. These interviews were either inEnglish or Somali.There is the need to transcribe these Somali audios to text as it is a good intermediary stepto translation. It is easier and widespread to automatically translate text into another language, forexample, Somali text to Swedish or English than it is to translate speech. This will enablequalitative analysis, sharing with fellow researchers as well as engaging relevant policymakers andhealth professionals. The challenge though is, these audios are transcribed manually into text,which is time-consuming and laborious work; it takes an expert 30 to 40 minutes to transcribe 10minute Somali audio and about two hours for a novice (Osman, 2020).To bridge this gap, it is necessary to find out the available open-source systems, tools, orapplications which can transcribe audio into text and might be applied for the Somali language andthe accuracy of these tools. Learnings from this could be relevant to other under-represented1

languages not only in humanitarian settings but also in preserving endangered languages andtraditions that have been passed on orally, which is an area of interest to the United NationsEducational, Scientific and Cultural Organization.2

2. Research Objectives and Work PlanAs discussed in the previous section, the learning on the actions necessary to increase theuse of professional care by Somali pregnant women to reduce mortality is hampered by the difficultand ineffective way of acquiring data from the relevant informants. These we believe can bemitigated if an automatic speech recognition (ASR) system for Somali were available. Hence, thisthesis aims to explore the applicability of such systems to the Somali language with the hope ofbeing able to implement a prototype.Figure 1: Work plan to achieve the objective of this work3

Reference to figure 1 and detailed in the sections to come, to achieve the stated target,we:· Review the existing open-source tools, if any, to transcribe Somali audio into text;Peruse literature on speech recognition systems, their structure and approach. Is therean open source ASR for Somali? In the absence of an open-source ASR for Somali,identify ASR tools that can be used for the Somali language.· Assess the accuracy of these tools; based on the documented ways of evaluating ASRperformance, identify the necessary evaluation parameters by which the ASR isassessed.· Identify improvements that are needed for these tools. Such as creating a languagecorpus (data set) for the Somali language.· Outline the possible technical obstacles that could prevent the adoption of ASR forthe stated purpose by noting the requirements for a successful ASR with the selectedtool. Find infrastructure environments to test open source tools such as free cloud GPUenabled services, operating systems, required software and libraries.· If feasible, develop a prototype with the most promising approach. Acquire andprocess data and the needed resources to build a model for a Somali ASR. Presentresults with findings and recommendations.4

3. State of the artThe following paragraphs discuss speech recognition, the available literature on automaticspeech recognition (ASR) for the Somali language or similar low resource languages that caninform our work, and open-source automatic speech recognition tools.3.1 Speech RecognitionSpeech Recognition is the ability to match a voice against an acquired or provided languageor vocabulary set (Saini et al., 2013). Speech recognition systems, also known as AutomaticSpeech Recognition (ASR), Computer Speech Recognition, or Voice Recognition, are anapplication of Natural Language Processing (NLP), which is understanding human language usingcomputers (Arumugam et al., 2018).Speech recognition systems enable one to communicate with an electronic device by theuse of voice as input. This could result in initiating an action, getting a voice response as would beexpected in conversation with another person, or in the transcription of an utterance into text as isexperienced in the speech to text (STT) versions of speech recognition systems.In the same vein as other technological systems, as an enabler, it is to reduce effort andincrease productivity, as well as provide a level playing field where otherwise one would havebeen disadvantaged. This refers to the initial adaptation of speech recognition in civil society whichwas to aid people with musculoskeletal disabilities to productively use computers(GlobalSecurity.org, 2011). Since then, speech recognition has been adopted widely in manyfields, from telephony to medicine, to smart systems, the Internet of Things, among others.A well-established speech recognition system comprises a microphone (for speaker input),speech recognition software (for a machine to interpret the speech), a good quality sound card forinput and or output (Saini et al., 2013).3.1.1 Types of Speech RecognitionWe discuss speech recognition observed in literature along the lines of technology, size ofvocabulary and bandwidth and speaker characteristics.I.TechnologyGhai et al. (2012) posit five different approaches to automatic speech recognition;5

The Acoustic-Phonetic approach extracts speech with the “relevant acoustic propertiessuch as nasality, frication, voiced-unvoiced classification and continuous features such as formantlocations and the ratio of high and low frequencies”. This approach has not been usedcommercially.The Pattern recognition approach has become the dominant method of speech recognition.It requires pattern training and pattern comparison. There are two sub-approaches to patternrecognition; the Template-based approach and the Stochastic approach. In the Template-based approach, a group of speech patterns serve as a prototype and isstored as a reference file of the candidate’s dictionary of words. In testing, the pattern thatbest matches the unknown spoken utterance is selected. The Stochastic approach is based on the use of probabilistic models which cater for unsureor incomplete information such as misrepresented sounds, variation in speaker, effects dueto context, and homophone word.Knowledge-based is the third approach. This approach models a person relying on herintelligence to visualize, analyze and characterize speech based on a set of measured acousticfeatures. This approach is a solution to drawbacks in acoustic-phonetic and template-basedapproaches since both models fail to discover a significant understanding of human speechprocessing by itself. The knowledge-based approach can identify the variability of inter and intraspeakers' speech (Samoulian, 1994). Research by Tripathy et al. (2008) also indicates theknowledge-based approach, using a fuzzy inference algorithm for the classification of Englishvowels fares better than the Mel-frequency cepstral coefficients (MFCCs) feature analysis.The Connectionist approach is the youngest development of speech recognition. In thisapproach, knowledge representation and knowledge source integrations are considered. Theknowledge and constraints are distributed across many simple computing units. This simplicityand uniformity of the underlying process elements make the connectionist approach moreattractive.The Support Vector Machine (SVM) is the last approach and it is considered one of thestate-of-the-art approaches for speech pattern recognition. However, one drawback of thisapproach is, it can only classify fixed-length data, thus it cannot be used efficiently for variablelength data classification. Nevertheless, variable-length data can be transformed into fixed-lengthvectors before being used in the SVM approach. Padrell-Sendra et al. (2006) have used the SVM6

approach and token passing algorithm to obtain a chain of recognized words. The result of thisresearch showed that SVM improves the recognition accuracy of speech with a small database, butto replicate this in a large database requires a huge computational effort (Ghai et al., 2012).II.Vocabulary Size and BandwidthVocabulary size and bandwidth specify the types of automatic speech recognition systemsas:Isolated Word Speech Recognition (IWR) which is defined by two assumptions. The firstbeing, it recognizes speech with a single word or phrase and the second assumption states thereshould be a clearly defined beginning and end for each spoken word and phrase. Kumar et al.(2011) have created a Hindi IWR ASR on the Linux platform with a vocabulary of 30 words.Connected Word Recognition (CWR) is the second approach based on vocabulary size andbandwidth. This type can recognize words or phrases which are separated by pauses. This CWRcan recognize small to moderate size vocabulary such as a combination of alphanumeric, spellletter sequences, and digit strings, and therefore, it is identified as a class of fluent speech strings.Continuous Speech Recognition (CSR) is the third, and this deals with speech where wordsare connected instead of separated by pauses. However, the performance of these types of ASR isdependent on the unknown boundary information of words, rate of speech and coarticulation.Spontaneous speech recognition is the final and most sophisticated type and it recognizesnatural-sounding, not pre-learned speech. This type of ASR can recognize speech with noise. Razaet al. (2009) have worked with the Spontaneous speech recognition system for the Urdu languageby employing a varying mixture of read and spontaneous data. This resulted in a reduction of WordError Rate (WER) after incorporating read data at 1:1 ratio as spontaneous data. However,thereafter WER was increased with the increase in read data (Ghai et al., 2012).Another school of thought simply classifies speech recognition based on vocabulary sizesimply as isolated word and continuous speech recognition (Arumugam et al., 2018).III.Speaker CharacteristicSpeech recognition defined by speaker dependence learns to identify characteristics of theuser's voice. It requires that it is set up to familiarize itself with its user's voice traits before it canbe used. Speaker independent speech recognition software is designed to recognize any voice and7

should not require training to suit each user before use. This, unfortunately, leads to pooraccuracy in comparison with speaker-dependent tools. For it to be effective, speaker-independentsystems can be made to operate with limited vocabulary by identifying only keywords (Saini etal., 2013).Based on the purpose for its development, an Automatic Speech Recognition system couldbe broadly classified as either speaker-independent or speaker-dependent irrespective of any of theother categories it may fall under. The adoption of a speaker-independent ASR has gone beyondthe transcription of keywords in well-resourced languages to keywords in under-resourcedlanguages and on-demand to identify an unlimited number of words i.e. large vocabularycontinuous speech recognition (LVCSR).In a Stanford University, University of Washington and Baidu research with a phone(Knight, 2015), not only was the speech recognition system using DeepSpeech2 three times fasterthan human beings in transcribing voice to text, but it had a lower transcription error rate than ahuman using a keyboard; 20.4% and 63.4% lower in English and Mandarin respectively. Itsproliferation is to be expected on the heels of such research (Shahani, 2016).3.1.2 Hidden Markov Model (HMM) and End to End Model Automatic SpeechRecognitionThe task of speech recognition is made complex by the variations observed in speech data.Notable variations are in speaker characteristics, the level of surrounding noise and vocabularysize to be identified among others (Arumugam et al., 2018).Large vocabulary continuous speech recognition can be divided into 2 categories, HiddenMarkov Model (HMM) based and End to End models. HMM based models have been the mainASR systems until the introduction of deep learning techniques to accomplish end to end speechrecognition which substitutes engineering processes with learning processes and is simpler toconstruct and train (Wang et al., 2019).An Automatic Speech Recognition System is made up of five main components:·Signal processing and feature extraction: as suggested by its name, it takes a segmentedspeech signal as input and extracts features that are meant to reflect the spoken content,eradicating all non-relevant information such as noise, speaker tone among others.8

·Acoustic model: uses features from the component above to generate a series of potentialphone transcripts and the probability of the sequence based on the extracted features.Phones are a subdivision (unit) of speech that captures the base set of sounds within alanguage. Other units could be words or syllables.·Pronunciation model: this serves as a dictionary of valid ways of pronouncing words.·Language model: the purpose of the language model is to be able to estimate the nextword given a sequence of preceding words and·Hypothesis search decoder: establishes the goal of the ASR by returning the mostprobable word sequence given the voice input (Keshet, 2018).An HMM based automatic speech recognition system is distinguished by three independent parts;the acoustic, language and pronunciation models (dictionary) (Wang et al., 2019).Figure 2: ASR architecture. Receives an acoustic signal as input. Each of the components outputs aprobability that is used by the decoder to determine the most probable word sequence Keshet (2008).The end to end model consists of an encoder that maps speech input to features sequence,an aligner which realizes the alignment between feature sequence and language and a decoder thatpresents the identified results. However, it is not easy to clearly distinguish these apart unlike inthe HMM models (Wang et al., 2019).9

3.2 ASR Performance MeasuresThe performance of an ASR is assessed with test data; this data is different from thatwhich was used in training the model (Saini et al., 2013). Frequently occurring accuracymeasures observed in literature are Out of Vocabulary (OOV) rate, Perplexity (PPL) and WordError Rate (WER) (Cucu et al., 2011) (Menon et al., 2018) (Biswas et al., 2019). Others areCharacter Error Rate (CER), loss function or loss, word accuracy and confidence score whichindicates the probability that the result returned by the system is as the speaker said.Perplexity (PPL) is a measure of how well the language model can predict the unseentext set whilst Out of Vocabulary (OOV) rate indicates the words in the test corpus that cannotbe recognized as they are not in the vocabulary of the model.The Loss is a parameter used to assess classification tasks and it indicates how much theprediction differs from the true class. The wider the deviation, the larger the value of the lossfunction (Hannun et al., 2014).A choice of these to evaluate ASRs is determined by the purpose the ASR is to serve.The word error rate (WER) runs through literature as a means of assessing Automatic SpeechRecognition (ASR) systems. Perplexity and OOV rates assess the language models within theASR. The speed of the ASR could also be used as a measure in comparing one ASR to another.3.2.1 Word Error RateThe WER refers to the amount of text the model failed to transcribe correctly. Given theoutput of an ASR, the word error rate is computed by a sum of the errors (i.e. insertions,substitutions, and deletions) and divided by the total number of words in the script. This is done incomparison with a reference text (Koo, 2019).𝑊𝐸𝑅 𝑆 𝐷 𝐼𝑁Where:S Number of substitutionsD Number of deletionsI Number of insertionsN Number of words in the reference text10

Though seemingly simple to compute, the WER is faulted for being reliant on the exactreproduction of the transcripts and could exceed 100% if the errors are more than the count ofwords in the transcript. It is also argued that errors may vary in the level of impact on asuccessful outcome thus should be weighted accordingly (Hunt, 1990).The CER is computed in the same manner as the WER however it refers to all charactersbe it alphabets, spaces or punctuation marks and translates directly to the accuracy of the model.i.e. CER of 5% indicates an accuracy of 95% (Alvermann, 2019).3.3Related work and Research strategyAbdillahi et al. (2006) present a paper detailing their initial efforts to build an automaticspeech to text transcription system to preserve African oral patrimony, particularly the Djibouticultural heritage. They focus on Somali which makes up half of the Djiboutian audio archives thatare to be transcribed.Via radio broadcast, databases of cultural audio archives have been generated in mostAfrican countries. The need, therefore, is to save this patrimony by digitalizing the recordings andfind a way to process this data. Efforts in the former are well advanced. However, automatictranscription and indexing tools come into play in accessing the rich store of data. The number ofAfrican languages involved also poses a challenge as a system must be developed for eachlanguage.The main problem the researchers faced in developing the Automatic Speech Recognition(ASR) system was the lack of textual corpus in African languages. They overcome this by usinginternet documents with a total of three million Somali words. These were split up for speechcorpus recording and language model training. The Somali texts were read by 10 native Somalispeakers in a controlled environment, resulting in 10 hours of audio corpus they called Asaas,which translates as “beginnings” in Somali.Abdillahi et al. (2006) during data preprocessing undertook normalization; novel writtenlanguages such as Somali, which are languages that have only recently been put into text may havedifferent ways of spelling the same word. The process of normalization takes away the variationand decides on a standard for the written text, in this instance the frequently occurring form of theword was decided on as the accepted form of the word. As well, the researchers created a series oftransducers. These are machines that produce output strings determined by the input they receive11

(Mohri, 1997). The transducers transform into a textual form the different abbreviations andnumbers (dates, phone numbers, money, etc.) which appear in the corpus.The learning process for the Somali acoustic model was with the aid of the LaboratoireInformatique d'Avignong (LIA) acoustic modeling toolkit. They first built a baseline acousticmodel for the Somali language from a French model. This was by setting up a concordance tablebetween French and Somali phonemes. They also acquired an automatic baseline model for Somalifrom the French model by using a confusion matrix between French and Somali phonemes. Bothhad similar outcomes.A language model predicts the most likely sequence of words to form a sentence. Theirlanguage model consisted of both bigram and trigram sequences with an extract from a trigramSomali language model that had been trained with the Carnegie Mellon Statistical LanguageModeling (CMU) toolkit which is a set of Unix software tools to aid language modeling. Indefining a sequence of characters or words, an N-gram refers to a sequence made up of “N”elements; hence bi or tri grams refer to a sequence of two or three words or characters respectively(Arumugam et al., 2018).With 9.5 hours training audio, three million words for language model, and the use ofLaboratoire Informatique d'Avignong’s large vocabulary speech recognition system, Speeral, theresearchers could obtain a word error rate (WER) of 20.9% on a 30-minute test corpus. However,without the normalization process, the error rate was 32%, attesting to necessity of thenormalization process for Somali language and other recently written languages (Abdillahi et al.,2006).The next review is on work by Menon et al. (2018), as part of a United Nations (UN) effortto spot keywords to aid in monitoring humanitarian relief programmes. Within the constraint oftime and linguistic expertise, which afforded them only 1.57 hours of speech data, they made aneffort to produce the best possible Somali Automatic Speech Recognition (ASR) system,comparing different acoustic models in the process as well as augmenting with language modeldata.This initiative was to replicate the success of the radio browsing system as deployed inUganda. Ugandan like Somali is

English or Somali. There is the need to transcribe these Somali audios to text as it is a good intermediary step to translation. It is easier and widespread to automatically translate text into another language, for example, Somali text to Swedish or English than it is to translate speech. This will enable