Research Article Automatic Transliteration Of Proper . - ThaiScience

Transcription

Research ArticleAutomatic Transliteration of ProperNames from Somali to EnglishAhmed Muktar Omar*, Jian Qu and Sumeth YuenyongSchool of Information Technology, Shinawatra University99 Moo 10, Bangtoey, Samkhok, Pathum Thani 12160, ThailandAbstractTransliterating of proper names is the process of converting words from source naturallanguage (such as Somali) to a target natural language (such as English) while maintaininglanguage pronunciation. Proper names and technical words are challenging in bilingualtranslation systems and also in Cross-Language Information Retrieval (CLIR) applications, dueto their absence from most dictionaries. In this paper, we study an automatic transliterationfrom Somali to English; which is an under-studied problem. Our Somali-English transliterationsystem uses transliteration rules based on the orthographic mapping of the source languagecharacters to the characters of the target language. We also propose an alignment method thatmaps the Somali characters when there is no direct matching character to get accuratetransliteration in English. Our novel approach particularly enhances Somali-Englishtransliteration.Keywords: Somali-English; Somali Transliteration Table; Grapheme-Based1. IntroductionSomali is a Cushitic language whichbelongs to the family of Afro-Asiaticlanguages (or Hamito-Semitic). The Somalilanguage is similar to Semitic languages suchas Arabic and Hebrew. It is a mother tonguefor ethnic Somalis in Greater Somalia and isby far the most well-documented of allCushitic languages [1].Somali is the official language ofSomalia and Djibouti and a workingLanguage in the Somali regions of Ethiopiaand Kenya. Somali uses different writingsystems, and the Latin alphabet has been theofficial writing system in the FederalRepublic of Somalia and Djibouti since 1972[2]. It mostly uses the Roman alphabetexcept for “p,v z” without diacritic signs or*Correspondence : muktar7@gmail.comspecial characters, although the “ ‘ ” glottalstop stands for the (Arabic Hamza).Somali also has three digraphconsonants (kh ( ) ﺥ , sh ( ) ﺵ , dh ( ط or )) ظ which are based on similar Arabic sounds.Somali orthography corresponds mostly toRoman alphabets except where somecharacters are modified for the usage inSomali characters, where the letters c and xdesigned to accommodate the voiced andvoiceless pharyngeal fricatives, comparableto (h ) ح and (ʕ [ ) ع 1]. Somali longvowels usually are written by doubling of thevowel itself.The purpose of general transliterationis not to introduce new sounds to the targetlanguage which the target language does notprovide. However, its purpose is to substituteDOI 10.14456/tijsat.2016.26

Thammasat International Journal of Science and Technologythe original letter to the nearest letter in thetarget language. The concept of long andshort vowels letters exists in Somali, andlong vowels are usually indicated byrepeating the vowel itself such as “aa”, “ee”,“ii”, “oo”, “uu.”For example, the Somali wordSoomaaliya usually transliterated to Englishas Somalia by omitting the long vowels.Figure 1 demonstrated a basic wordtransliteration.SomaliSOOMEnglishSOMAA LALtransliteration approaches have been studiedin the literature, each of which brings outvarious processes in different languages.These methods vary by the direction oftransliteration, writing systems of differentlanguages, or intended applications.Classification of these works is notstraightforward.Earlier work has been done forMachine Transliteration grapheme-basedapproaches or phoneme-based approaches.Lee and Choi proposed a source channelmodel (SCM) a grapheme-based approachfor English-Korean transliteration [3]. Theyused a direct orthographical mapping fromsource graphemes to target graphemes.Knight and Graehl proposed Japanese toEnglish back-transliteration using thesimilarity of SCM [4]. Wan and Verspoormodeled a technique to transliterate propernames from English to Chinese using aphonetic procedure [5]. They proposed analgorithm for mapping from Englishcharacters to Chinese characters based onheuristics relationships between Englishspelling and pronunciation, and stablerelationships between English phonemes andChinese characters.Kang and Kim explored a forwardtransliteration and back-transliteration forEnglish-Korean using a direct and pivotmethod and then they used chunks ofphonemes to perform the transliteration andback-transliteration [6]. Kang and Choi alsostudiedanEnglish-Koreanbacktransliteration using a decision-tree learning[7]. The English-Korean word alignmentprocedure they used is similar to Lee andChoi [3].Oh and Choi also studied a model forEnglish-Koreantransliterationusingpronunciation and contextual rules [8]. Theirmethod was composed of two phases:alignment and transliteration. In their firstphase, they aligned an English pronunciationunit (EPU) taken from a pronunciationphrasebook and aligned it to KoreanphonemestofindthepossibleIY AIVol.21, No.4, October-December 2016AFigure 1. Basic Transliteration.In this paper we propose a SomaliEnglish transliteration system that usestransliteration rules based on theorthographic mapping of the source languagecharacters to the characters of the targetlanguage. We also propose an alignmentmethod that maps the Somali characterswhen there is no direct matching character toget accurate transliteration in English.The structure of the paper is asfollows. In Section II, we describe theprevious study of machine transliteration. InSection III, we describe our charactermapping method for Somali-Englishtransliteration; we also discuss ourtransliteration rules of Somali proper namesto English. In Section IV, we detail ourexperiments, evaluation metrics and theresults we obtained; and Section V concludesthe paper.2. Related WorkTransliterationreferstoanorthographical transformation or phoneticchange across two languages with differentscripts.Manydifferentgenerative18

Thammasat International Journal of Science and Technologycorrespondence between the EPU andphonemes. Virga and Khudanpur presentedEnglish-Chinese transliteration using aphonetic representation of English namesinto Chinese to support Cross-LingualSpeech and Text Processing Applications[9].AbdulJaleel and Larkey proposed agenerative statistical transliteration modelfor English-Arabic transliteration using ngram methods [10]. The n-gram modelgenerates strings of Arabic characters from astring of English characters. Malik proposeda rule based Punjabi machine transliterationby transliterating a word between two scriptsof Punjabi [11].Grapheme-basedtransliterationsconsider transliteration as an orthographicprocess rather than phonetic process andmaps groups of graphemes/characters in thesource language word directly to groups ofgraphemes/characters in the target languageword [12]. This approach also is known as(spelling-based or direct methods) as itdirectly transforms the source languagegraphemes into the target languagegraphemes without any phonetic knowledgeof the source and target languages.Instantaneously the phoneme-based methodsrequire some steps in the transliterationprocess. However, most of the graphemebased methods directly depend on theinformation that is attainable from thecharacters of the words.Forwardtransliterationistransliterating a word as it is written in thesource language such as Somali to a foreignlanguage such as English. For example,forward transliteration of a Somali name“Ceelmacaan” to English is literation is transliterating a word fromits transliterated version back to the languageof origin. For example, back-transliterationof “Elma’an” from English to Somali is“Ceelmacaan”. This example is shown inFigure 2.Vol.21, No.4, October-December 2016Somali C EE L - M A C AA NEnglishEL-M A‘ A NFigure 2. Forward Transliteration.Most of the transliteration methodshave been proposed between English andother common languages such as Arabic,Chinese, or Japanese. Somali, not being acommon language, is under-studied for bothtransliteration systems and cross-languageinformation retrieval applications.3. Mapping and Transliteration rulesTo align Somali/English characters,we use a direct orthographic mappingbetween the Somali and English characters; acharacter alignment is given in Somali and itsorthographic equivalent in English to find themost probable letters.We start by the alignment of theidentical letters; in most cases, Somali wordsare longer than their corresponding Englishtransliterated words. The mapping type iseither one-to-one letter or many-to-one letterto avoid null mapping.For example, as shown in Figure 3,the Somali word Ceel-cadde is usuallytransliterated into English as El-Adde.SomaliEnglishC E E L - C A DD EEL-A DD EFigure 3. Missing equivalent letters.The drawback of the above directmapping is the absence of some Somali19

Thammasat International Journal of Science and Technologyletters and long vowels in English. This is theproblem addressed by our method.Vol.21, No.4, October-December 2016Table 1. Somali Transliteration Table.consonant mapping3.1 Somali Transliteration Tableand its ProblemsAs can been seen from Table 1,Somali and English both use Romanalphabets, though Somali has 24 letters, 19consonant monographs and five vowelmonographs as well as three consonantdigraphs and five long vowels.The mapping in Table 1 maps onlyequivalent letters, and it is not enough to getgood transliteration, so we developed SomaliTransliteration Table (STT) similar toBuckwalter transliteration table [13]. Itallows us to map the Somali letters withsounds that are either not present or useddifferently in English. In this case, we wouldbe able to increase the performance of thetransliteration.Consonant mapping: Consonantscan be divided into two constants that havesimilar phonetic properties and consonantsthat are either not present in English orpronounced differently, for example, theSomali “b” letter matches English “b” and“p” letters.Consonants which are unique toSomali are (c, x and the “ ‘ ”glottal stop) andfor the Arabic Hamza, these consonantsfrequently occur in words. Somali syllablestructure is based on Consonant VowelConsonant (CVC) and clusters of twoconsonants that do not occur at the beginningor the end of a word, but only happen atsyllable boundaries. Somali glottal stop “ ’ ”or the Arabic Hamza usually is not writtenunless it happens at the border of a syllableor in the middle of the word.vowel lish[b][t][j][h][d][r][s][ʕ] hshdhaaeeiioouuaeiouVowel mapping: Somali has fivevowel monographs; Somali vowels have oneto one correspondence with English vowels.However, Somali is different regarding longvowels, which are short vowels repeatedtwice. We map the Somali diphthong vowelswith double English vowels.As shown in Table 1, the totalnumber of letters in Somali and English arenot equal. The Somali letters “C”, and the “ ‘” glottal stop have no equivalent mapping inEnglish. These letters will never be mappedin Somali to English transliteration using adirect orthographic mapping. Anotherproblem is the use of long vowels in Somali.We came up with novel dependency ruleswhich address the problem of no directequivalent letter in Somali to Englishtransliteration.20

Thammasat International Journal of Science and TechnologyVol.21, No.4, October-December 2016If “Y” happens in the middle of theword after an “E” vowel, but not betweentwo “E” vowels, then “Y” is omitted.Finally, if “Y” occurs in the middleof the word after and “A” vowel, but notbetween two “A” vowels, then “Y” isomitted.Long vowels: Somali long vowelsare twice as long as short vowels and arewritten as double vowels.The rules of the Somali long vowelstransliterated by transforming the longvowels to short vowels.For example, if long vowel “AA”occurs in a word it is substituted with shortvowel “A” and the rest of the long vowelsfollow the same procedure.The exact order of application ofthese dependency rules can be seen inAlgorithm 1.3.2 Dependency rulesCharacter to character mapping onlyis not satisfactory to get an acceptable resultfor the transliteration. We need to add aparticular dependency or appropriate rulesfor constructing accurate transliteration.Consonants: Somali consonantsare transliterated into their correspondingEnglish consonants; here we discuss theconsonants that are unique to Somali andhow to transliterate them into English.Starting with “C” letter called“Ceyn” in Somali, when a “C” occurs at thebeginning, and the end of a word then “C”will be omitted.“C” also is omitted when it occursbetween two different vowels, but it isreplaced with the ‘glottal stop.“C” is also transliterated into “ ‘ ”glottal stop if C appears at the end of midsyllable and the next syllable is a consonant.If “C” occurs at the beginning of aword and is followed by “U” then “C” isomitted and “U” is transliterated into “O.”The letter “X” is assigned to aSomali sound, so there is no equivalentEnglishletter. Therefore, “X”istransliterated into “H” which is the nearestEnglish letter.“X” always transliterated into “H”no matter the position in which it occurs.If “X” occurs in the middle of a wordbehind the letter “U” then “X” is replacedwith “H” and “U” is replaced with “O”.Hamza “ ’ ” is shown only if it occursbetween the same vowels, or when it takesplace in a single syllable word, but most ofthe cases are not written.The letter “Y” in Somali is treated asa consonant, but in English it is regarded asboth vowel and consonant.If “Y” occurs in the middle of theword and is followed by “I” but not betweentwo “I” vowels, then “Y” is omitted.If “Y” occurs in the middle of theword after “I”, but not between two “I”vowels, then “Y” is omitted.21

Thammasat International Journal of Science and TechnologyVol.21, No.4, October-December 20163.3 Somali transliteration ProcessTheSomalitransliterationarchitecture and its functionality arediscussed in this section. Figure 4 describesthe structure of our proposed transliterationmethod, the rules for transliteration, and itsimplementation using regular expressionpattern matching algorithm (Algorithm1).Algorithm1: Somali transliteration rulesRequire Somali dependency rulesInsert string S (Somali word)Search for pattern in RegexForeach all dependency rulesIf S contain C letterWhile C is matchedDoIf ((C) letter occurinitial of a word) && next vowelnot "u"Omit C;Else if (vowel is "u")Omit C && replace "u", "o";Else if (C in between samevowels) or ultimate of syllable)Replace (" C “," ' ")ElseOmit C;End whileIf S contain X letterWhile X matchedDo if next vowel to X is"u" replace "x" with "h" && "u"with "o"ElseReplace "x" with "h".If S contain long vowelswhile long vowels "aa", "ee" ,"ii", "oo" , "uu" is matchedDoReplace all with their shortvowels "a”, "e", "i", "o”, and“u"ElseIf (Y in mid-word vowel beforeis "i" or vowel after Y is "i"not between two "i")Omit Y;else if (Y in mid-word vowelbefore is "e" or vowel after Yis "e" not between two "e")Replace "y" , "i"else if (Y in mid-word vowelbefore is "a" and next letter isconsonant)Replace "y" , "i"EndwhileEndif No pattern to matchEnd foreachOutput S as T (English word)Somali OOVMappingTransliterationTableCheck forrulesEnglsih OOVFigure 4. Somali TransliterationArchitecture.The system takes Somali text asinput and searches for the letters to align eachletter to its corresponding letter in English asa pattern to match in the transliteration tablewhich consists of the character mappings anddependency rules. If there are no similarletters in the alignment, it looks for thedependency rules to check if there is anapplicable rule, which it then applies andsends to the transliteration unit. Thetransliteration unit replaces the matchedpattern to transliterate and then outputs theword as an English text.22

Thammasat International Journal of Science and TechnologyWe evaluate the BLEU score usingthe Interactive BLEU tool developed byMadnani, N [15]. The system metricmeasures the n-gram (n 1 to 4) precisions ofthe hypothesis against the reference.4. ExperimentsIn this section, we describe theexperimental data which includes the dataused for testing as well as the data used forvalidation purpose.4.3 ResultsAfter the selected input Somali textsthey are transliterated into English texts byusing the Somali Transliteration tool, thetransliterated English texts are verified formistakes and accuracy. Measurement isaccomplished with the help of thetransliteratedwordsretrievedfromSomaliNames of Somali and English [16].4.1 Data SetWe obtained 600 Somali words andEnglish transliteration from SomaliNameswebsite [14] which contains bilingualinstances of the most frequently used Somalinames with their English transliterations.The data were divided randomly into twosets: 400 names were used for developing therules and the remaining 200 words werechosen as testing for the transliteration rules.The English transliterated words were usedas a reference to verify the correctlytransliterated Somali names.Table 2. Results of Somali transliteration.Type4.2Evaluation metricsThe results of Somali transliterationto English were measured by the number ofthe transliterated words that correctlymatched the transliterated words obtainedfrom the Somali website divided by the totalnumber of the phrase in the validation set.Word accuracy (WA), also knownas transliteration accuracy, measures theproportion of transliterations that are correct.𝑊𝐴 Vol.21, No.4, October-December 2016AccuracyDOM34.62%STT64.73%STT with DR96.53%From the experimental results shownin Table 2, it is clear that the Somalitransliteration (STT) with Dependence rulesthat we have developed gives more than96.53% accuracy. Also, the Table STTindicates better outcomes at 64.73% than thedirect orthographical mapping (DOM) whichgave 34.62% accuracy on the Somali nameslist. So our transliteration systemaccomplishestherequirementoftransliteration across Somali to English.Number of correctly transliterated wordsTotal Number of Reference wordsThe transliteration accuracy or wordaccuracy is a measurement of the percentageof transliterated Somali words to English.BLEU(BilingualEvaluationUnderstudy) allowances for multiplereference translations, it is used to evaluatemany term to term bilingual translationresearches [14].The BLEU technique offers a scorebetween 0 and 1, which is a scale showinghow alike the target language word is to thereference word; 0 is the least score and 1 isthe best score.23

Thammasat International Journal of Science and TechnologySpelling-based approaches areconsidered to be easier to implement andshow better performance than the phoneticbased approaches because they do notdepend on pronunciation dictionaries whichmay not consist of the pronunciations of allwords. Last but not least, we have chosentransliteration rules rather than machinelearning methods due to the unavailability ofa large corpus containing Somali-Englishpaired-words. The transliteration table alongwith the dependency rules proposed in thispaper improved accuracy of transliterationsignificantly.iBLEU results98.87% 99.15%85.58% .13%UnigramBigramDOMTrigram Four-gramSTTVol.21, No.4, October-December 20166. ReferencesSTT with DR[1]Figure 5. iBLEU Score.Figure 5 demonstrates the resultsgiven by the BLEU interactive tool tocompare the DOM and our proposed STTand dependency rules over the 600 propernames from the Somali Names data. Using aunigram, bigram 3-gram and 4-gram, DOMand STT DR score for all grams shows above0.9664 (almost 1) as the BLEU score isbetween 0 to 1. The STT alone scored0.6269, nearly twice as high as the DOM.We see that the higher the N-gram,the higher the result, with the DOM scores of0.3613, 0.6011, 0.7123 and 0.7753 forUnigram Bigram, Trigram and four-gram,respectively.[2][3][4][5]5. ConclusionIn this paper, we explored theautomatic forward transliteration for Somaliproper names to English. This language-pairis not well studied in any automatictransliteration work. All our transliterationrules and alignments procedure presentedhere are based on a direct orthographictransformation. Our method avoids theintermediate phonetic interpretation used inphoneme-based methods and reduces thetransliteration error rate.[6]24Lecarme, J. and Maury, C., ASoftware Tool for Research inLinguisticsandLexicography:Application to Somali, Computers andTranslation, Vol.2 No, 1, pp.21-36,1987.Andrzejewski, B.W., The Introductionof a National Orthography for Somali,School of Oriental and AfricanStudies, 1974.Lee, J.S. and Choi, K.S., English toKorean Statistical Transliteration forInformationRetrieval, ComputerProcessing of Oriental Languages,Vol. 12, No. 1, pp.17-37, 1998.Knight, K. and Graehl, J., MachineTransliteration, ComputationalLinguistics, Vol.24, No.4, pp.599-612,1998.Wan, S. and Verspoor, C.M.,Automatic English-Chinese NameTransliteration for Development ofMultilingualResources,InProceedings of the 17th Volume 2, pp. 1352-1356,1998.Kang, I.H. and Kim, G., English-toKorean Transliteration Using MultipleUnbounded Overlapping Phonemechunks, In Proceedings of the 18thconferenceonComputational

Thammasat International Journal of Science and Technology[7]linguistics-Volume 1, pp. 418-424,2000.Kang, B.J. and Choi, K.S., TwoApproaches for the Resolution ofWord Mismatch Problem Caused byEnglish Words and Foreign Words inKoreanInformationRetrieval, International Journal ofComputer Processing of OrientalLanguages, Vol.14, No.2, pp.109-131,2001.Vol.21, No.4, October-December 2016[12][13][8]Oh, J.H. and Choi, K.S., An EnglishKorean Transliteration Model UsingPronunciation and Contextual Rules,In tational linguistics-Volume 1,pp. 1-7, 2002.[9] Virga, P. and Khudanpur, S.,Transliteration of Proper Names inCross-lingual Information Retrieval,In Proceedings of the ACL 2003Workshop on Multilingual and MixedLanguage Named Entity RecognitionVolume 15, pp. 57-64, 2003.[10] AbdulJaleel, N. and Larkey, L.S.,Statistical Transliteration for EnglishArabic Cross Language InformationRetrieval, In Proceedings of theTwelfth International Conference onInformationandKnowledgeManagement, pp. 139-146, 2003.[11] Malik, M.G., Punjabi MachineTransliteration, In Proceedings of the21st International Conference on[14][15][16]25Computational Linguistics and the44th Annual Meeting of theAssociationforComputationalLinguistics, pp. 1137-1144, 2006.Karimi,S.,2008,MachineTransliteration of Proper Namesbetween English and Persian, Ph.D.Thesis, RMIT University, Melbourne,216p.Buckwalter, T.A., LexicographicNotation of Arabic Noun PatternMorphemes and Their InflectionalFeatures, In Proceedings of theSecond Cambridge Conference onBilingual Computing in Arabic andEnglish, pp. 5-7, 1990.Papineni, K., Roukos, S., Ward, T. andZhu, W.J., BLEU: A Method forAutomatic Evaluation of MachineTranslation, In Proceedings of the40th Annual Meeting on Associationfor Computational Linguistics, pp.311-318, 2002.Madnani, N., iBLEU: InteractivelyDebugging and Scoring StatisticalMachineTranslationSystems,In Semantic Computing (ICSC), 2011FifthIEEEInternationalConference, pp. 213-214, 2011.Somali Names List with Their EnglishTransliterationRetrievedSomalinames Directory. hp/somali-names/soma-names.csv

transliteration and back-transliteration for English-Korean using a direct and pivot method and then they used chunks of phonemes to perform the transliteration and back-transliteration [6]. Kang and Choi also studied an English-Korean back-transliteration using a decision-tree learning [7]. The English-Korean word alignment