Machine Translation Approaches: Issues And Challenges

1y ago

37 Views

1 Downloads

848.76 KB

7 Pages

Report/dmca

Download PDF

Transcription

IJCSI International Journal of Computer Science Issues, Vol. 11, Issue 5, No 2, September 2014ISSN (Print): 1694-0814 ISSN (Online): 1694-0784www.IJCSI.org159Machine Translation Approaches: Issues andChallengesM. D. OkporDepartment of Computer Science, Delta State Polytechnic, OzoroDelta State, NigeriaAbstractIn the modern world, there is an increased need for languagetranslations owing to the fact that language is an effectivemedium of communication. The demand for translation hasbecome more in recent years due to increase in the exchange ofinformation between various regions using different regionallanguages. Accessibility to web document in other languages,for instance, has been a concern for information Professionals.Machine translation (MT), a subfield under ArtificialIntelligence, is the application of computers to the task oftranslating texts from one natural (human) language to another.Many approaches have been used in the recent times to developan MT system. Each of these approaches has its ownadvantages and challenges. This paper takes a look at theseapproaches with the few of identifying their individual features,challenges and the best domain they are best suited to.Keywords: Machine Translation, Rule-based Approach,Corpus-based Approach, Statistical Approach, Transfer-BasedApproach1. IntroductionSciences formed the Automatic Language Processing AdvisoryCommittee (ALPAC) to study MT (1964). Real progress wasmuch slower, however, and after the ALPAC report (1966),which found that the ten-year-long research had failed to fulfillexpectations, funding was greatly reduced. The idea of usingdigital computers for translation of natural languages wasproposed as early as 1946 by A. D. Booth and possibly others.MT on the web started with SYSTRAN Offering freetranslation of small texts (1996), followed by AltaVistaBabelfish, which racked up 500,000 requests a day (1997).Franz-Josef Och (the future head of Translation DevelopmentAT Google) won DARPA‘s speed MT competition (2003).More innovations during this time included MOSES, the opensource statistical MT engine (2007), a text/SMS translationservice for mobiles in Japan (2008), and a mobile phone withbuilt-in speech-to-speech translation functionality for English,Japanese and Chinese (2009). Recently, Google announced thatGoogle Translate translates roughly enough text to fill 1 millionbooks in one day (2012).Machine translation sometimes referred to by the abbreviationMT (not to be confused with computer-aided translation,machine-aided human translation (MAHT) or interactivetranslation) is automated translation. It is the process by whichcomputer software is used to translate a text from one naturallanguage (such as English) to another (such as Ibo).1.1 Translation processThe idea of machine translation may be traced back to the 17thcentury. In 1629, René Descartes proposed a universallanguage, with equivalent ideas in different tongues sharing onesymbol. The field of ―machine translation‖ appeared in WarrenWeaver‘s Memorandum on Translation (1949). The firstresearcher in the field, Yehosha Bar-Hillel, began his researchat MIT (1951). A Georgetown MT research team followed(1951) with a public demonstration of its system in 1954. MTresearch programmes popped up in Japan and Russia (1955),and the first MT conference was held in London (1956).Researchers continued to join the field as the Association forMachine Translation and Computational Linguistics wasformed in the U.S. (1962) and the National Academy ofThe human translation process, for instance, may be describedas:To process any translation, human or automated, the meaningof a text in the original (source) language must be fully restoredin the target language, i.e., the translation. While on thesurface, this seems straightforward, it is far more complex,1.2.Decoding the meaning of the source text; andRe-encoding this meaning in the target language.Behind this ostensibly simple procedure lies a complexcognitive operation. To decode the meaning of the source textin its entirety, the translator must interpret and analyse all thefeatures of the text, a process that requires in-depth knowledgeof the grammar, semantics, syntax, idioms, etc., of the sourcelanguage, as well as the culture of its speakers. The translatorCopyright (c) 2014 International Journal of Computer Science Issues. All Rights Reserved.

IJCSI International Journal of Computer Science Issues, Vol. 11, Issue 5, No 2, September 2014ISSN (Print): 1694-0814 ISSN (Online): 1694-0784www.IJCSI.orgneeds the same in-depth knowledge to re-encode the meaningin the target language.1602. Literature Review2.1 Machine Translation ApproachesSince natural languages are highly complex, MT becomes adifficult task. Many words have multiple meanings, sentencesmay have various readings, and certain grammatical relations inone language might not exist in another language. Thefollowing diagram shows all the phases involved in the processof Machine Translation.A machine translation (MT) system first analyses the sourcelanguage input and creates an internal representation. Thisrepresentation is manipulated and transferred to a form suitablefor the target language. Then at last output is generated in thetarget language.MT systems can be classified according to their coremethodology. Under this classification, two main paradigmscan be found: the rule-based approach and the corpus-basedapproach. In the rule-based approach, human experts specify aset of rules to describe the translation process, so that anenormous amount of input from human experts is required. Onthe other hand, under the corpus-based approach the knowledgeis automatically extracted by analysing translation examplesfrom a parallel corpus built by human experts. Combining thefeatures of the two major classifications of MT systems gavebirth to the Hybrid Machine Translation Approach2.1.1Fig. 1 A Typical Machine Translation Process(source: worldofcomputing.net)A machine translation (MT) system first analyses the sourcelanguage input and creates an internal representation. Thisrepresentation is manipulated and transferred to a form suitablefor the target language. Then at last output is generated in thetarget language. On a basic level, MT performs simplesubstitution of words in one natural language for words inanother, but that alone usually cannot produce a goodtranslation of a text because recognition of whole phrases andtheir closest counterparts in the target language is needed.Therein lies the challenge in machine translation: how toprogram a computer that will "understand" a text as a persondoes, and that will "create" a new text in the target languagethat "sounds" as if it has been written by a person.This problem may be approached in a number of ways. Thispaper takes a look at these approaches and their attendantchallenges.Rule-Based Machine Translation (RBMT)ApproachRule-Based Machine Translation (RBMT), also known asKnowledge-Based Machine Translation and ClassicalApproach of MT, is a general term that denotes machinetranslation systems based on linguistic information aboutsource and target languages basically retrieved from (bilingual)dictionaries and grammars covering the main semantic,morphological, and syntactic regularities of each languagerespectively. Having input sentences (in some sourcelanguage), an RBMT system generates them to outputsentences (in some target language) on the basis ofmorphological, syntactic, and semantic analysis of both thesource and the target languages involved in a concretetranslation task.2.1.1.1 Basic Principles of RBMT ApproachRBMT methodology applies a set of linguistic rules in threedifferent phases: analysis, transfer and generation. Therefore, arule-based system requires: syntax analysis, semantic analysis,syntax generation and semantic generation. Speaking in generalterms, RBMT generates the target text given a source textfollowing the steps shown in Fig. 2Copyright (c) 2014 International Journal of Computer Science Issues. All Rights Reserved.

IJCSI International Journal of Computer Science Issues, Vol. 11, Issue 5, No 2, September 2014ISSN (Print): 1694-0814 ISSN (Online): 1694-0784www.IJCSI.orgSource text Morphological analyserPart of speech tagger161Failure to adapt to new domains. Although RBMTsystems usually provide a mechanism to create newrules and extend and adapt the lexicon, changes areusually very costly and the results, frequently, do notpay off.2.1.1.3 Sub Approaches of RBMTLexical selectionStructural transferlexical transferMorphological generatorPost-generatorTarget textThere are three different approaches under the rule-basedmachine translation Approach. They are Direct, Transfer-Basedand Interlingua Machine Translation Approaches respectively.Though they all belong to the RBMT, they differ in the depthof analysis of the source language and the extent to which theyattempt to reach a language-independent representation ofmeaning or intent between the source and target languages.Their dissimilarities can be obviously observed through theVauquois Triangle which illustrates these levels of analysis asshown in Fig. 3 below.Fig. 2 Architecture of the RBMT Approach (source:The main approach of RBMT systems is based on linking thestructure of the given input sentence with the structure of thedemanded output sentence, necessarily preserving their uniquemeaning. The following example can illustrate the generalframe of RBMT:A girl eats an apple. Source Language English;Demanded Target Language IboMinimally, to get an Ibo translation of this English sentenceone needs:1.2.3.A dictionary that will map each English word to anappropriate Ibo word.Rules representing regular English sentence structure.Rules representing regular Ibo sentence structure.And finally, we need rules according to which one can relatethese two structures together.2.1.1.2 Issues of RBMT ApproachThe following are the shortcomings that are associated withRBMT approach: Insufficient amount of really good dictionaries.Building new dictionaries is expensive.Some linguistic information still needs to be setmanually.It is hard to deal with rule interactions in big systems,ambiguity, and idiomatic expressions.Fig. 3: Bernard Vauquois‘ pyramid showing comparative depths ofintermediary representation, interlingual machine translation at the n.(Source:http://en.wikipedia.org/wiki/Machine translation#Interlingual)o Direct Machine Translation (DMT) ApproachStarting with the shallowest level at the bottom of the pyramidis the Direct Machine Translation Approach. DMT approach isthe oldest and less popular approach. Direct translation is madeat the word level. Machine translation systems that use thisapproach are capable of translating a language, called sourcelanguage (SL) directly to another language, called targetlanguage (TL). Words of the SL are translated without passingthrough an additional/intermediary representation. The analysisof SL texts is oriented to only one TL. Direct translationsystems are basically bilingual and uni-directional. Directtranslation approach needs only a little syntactic and semanticanalysis. SL analysis is oriented specifically to the productionof representations appropriate for one particular TL. DMT is aword-by-word translation approach with some simplegrammatical adjustments.Copyright (c) 2014 International Journal of Computer Science Issues. All Rights Reserved.

IJCSI International Journal of Computer Science Issues, Vol. 11, Issue 5, No 2, September 2014ISSN (Print): 1694-0814 ISSN (Online): 1694-0784www.IJCSI.orgChallenges of a DMT System The limitation of this approach is obvious. It can becharacterized as ‗word-for-word‘ translation with somelocal word-order adjustment. It gave the kind of translationquality that might be expected from someone with a verycheap bilingual dictionary and only the most rudimentaryknowledge of the grammar of the target language: frequentmistranslations at the lexical level and largelyinappropriate syntax structures which mirrored too closelythose of the source language. The linguistic and computational naivety of this approachis also an issue. From a linguistic point of view what ismissing is any analysis of the internal structure of thesource text, particularly the grammatical relationshipsbetween the principal parts of the sentences. The lack ofcomputational sophistication was largely a reflection of theprimitive state of computer science at the time, but it wasalso determined by the unsophisticated approach tolinguistics in MT projects of the late 1950s.oInterlingual Machine Translation ApproachThe failure of the first generation systems led to thedevelopment of more sophisticated linguistic models fortranslation. In particular, there was increasing support for theanalysis of source language texts into some kind ofintermediate representation — a representation of its ‗meaning‘in some respect — which could form the basis of generation ofthe target text. Interlingual machine translation is one instanceof rule-based machine-translation approaches. In this approach,the source language, i.e. the text to be translated, is transformedinto an interlingual language, i.e. a ―language neutral‖representation that is independent of any language. The targetlanguage is then generated out of the interlingua. One of themajor advantages of this system is that the interlingua becomesmore valuable as the amount of target languages it can beturned into increases. However, the only interlingual machinetranslation system that has been made operational at thecommercial level is the KANT system (Nyberg and Mitamura,1992), which is designed to translate Caterpillar TechnicalEnglish (CTE) into other languages. The interlingua approachis clearly most attractive for multilingual systems. Eachanalysis module can be independent, both of all other analysismodules and of all generation modules.Challenges of Interlingual Machine Translation Approach There are the difficulties in defining an interlingua, evenfor closely related languages (e.g. the Romance languages:French. Italian, Spanish, Portuguese). A truly ‗universal‘and language-independent interlingua has defied the bestefforts of linguists over the years.162 It is difficult to extract meaning from texts in the originallanguages to create the intermediate representation. Semantic differentiation is target-language specific andmaking such distinctions is comparable to lexical transfernot all distinctions needed for translationo Transfer-based Machine Translation ApproachBecause of the disadvantage of the Interlingua approach, abetter rule-based translation approach was discovered, calledthe Transfer-based Approach. Transfer-based machinetranslation is similar to interlingual machine translation in thatit creates a translation from an intermediate representation thatsimulates the meaning of the original sentence. Unlikeinterlingual MT, it depends partially on the language pairinvolved in the translation. On the basis of the structuraldifferences between the source and target language, a transfersystem can be broken down into three different stages: i)Analysis, ii) Transfer and iii) Generation. In the first stage, theSL parser is used to produce the syntactic representation of aSL sentence. In the next stage, the result of the first stage isconverted into equivalent TL-oriented representations. In thefinal step of this translation approach, a TL morphologicalanalyzer is used to generate the final TL texts. It is possiblewith this translation approach to obtain fairly high qualitytranslations, with accuracy in the region of of 90%.Challenges of Transfer-based Machine Translation One of the problems with transfer Based Machinetranslation approach is that rules must be applied at everystep of translation. There are rules for source languageanalysis (syntactic/semantic), rules for source-to-targettransfer and rules for target language generation It is difficult to do as much work as possible in reusablemodules of analysis and synthesis. It is difficult to keep transfer modules as simple as possible2.1.2 Corpus-based Machine TranslationApproachCorpus based machine translation (also referred as data drivenmachine translation) is an alternative approach for machinetranslation to overcome the problem of knowledge acquisitionproblem of rule based machine translation. Corpus BasedMachine Translation (CBMT) uses, as it name points, abilingual parallel corpus to obtain knowledge for new incomingtranslation. This approach uses a large amount of raw data inthe form of parallel corpora. This raw data contains text andtheir translations. These corpora are used for acquiringtranslation knowledge. Corpus based approach is furtherclassified into following two sub approaches: StatisticalMachine Translation and Example-based Machine TranslationApproach.Copyright (c) 2014 International Journal of Computer Science Issues. All Rights Reserved.

IJCSI International Journal of Computer Science Issues, Vol. 11, Issue 5, No 2, September 2014ISSN (Print): 1694-0814 ISSN (Online): 1694-0784www.IJCSI.orgo Statistical Machine Translation ApproachStatistical machine translation (SMT) is generated on the basisof statistical models whose parameters are derived from theanalysis of bilingual text corpora. The initial model of SMT,based on Bayes Theorem, proposed by Brown et al. takes theview that every sentence in one language is a possibletranslation of any sentence in the other and the mostappropriate is the translation that is assigned the highestprobability by the system. The idea behind SMT comes frominformation theory. A document is translated according to theprobability distribution function indicated by p(e f), which isthe Probability of translating a sentence f in the SL F (forexample, English) to a sentence e in the TL E (for example,Ibo). The problem of modeling the probability distributionp(e f) has been approached in a number of ways. One intuitiveapproach is to apply Bayes theorem. That is, if p(f e) and p(e)indicate translation model and language model, respectively,then the probability distribution p(e f) p(f e)p(e). Thetranslation model p(f e) is the probability that the sourcesentence is the translation of the target sentence or the waysentences in E get converted to sentences in F. The languagemodel p(e) is the probability of seeing that TL string or thekind of sentences that are likely in the language E. Thisdecomposition is attractive as it splits the problem into two subproblems. Finding the best translation is done by picking upthe one that gives the highest probability:Data Dilution: This is a common anomaly caused whenattempting to construct a new statistical model (engine) torepresent a distinct terminology (for a specific corporate brandor domain). Training sets used from alternative sources to thespecific brand to compensate for a limited quantity of brandspecific corpora may ‗dilute‘ brand terminology, choice ofwords, text format and style.Idioms: Depending on the corpora used, idioms may nottranslate "idiomatically".Different word orders: Word order in languages differ. Someclassification can be done by naming the typical order ofsubject (S), verb (V) and object (O) in a sentence and one cantalk, for instance, of SVO or VSO languages. There are alsoadditional differences in word orders, for instance, wheremodifiers for nouns are located, or where the same words areused as a question or a statement.Challenges of Statistical Machine Translation Approach . For a rigorous implementation of this one would have toperform an exhaustive search by going through all stringsinthe native language. Performing the search efficiently is thework of a machine translation decoder that uses the foreignstring, heuristics and other methods to limit the search spaceand at the same time keeping acceptable quality. Thus SMTdepends on a language model, a translation model and adecoding algorithm. The translation model ensures that themachine translation system produces target hypothesiscorresponding to the source sentence. The language modelensures the grammatically correct output.Issues with statistical machine translation include: SentenceAlignment: In parallel corpora single sentences in onelanguage can be found translated into several sentences in theother and vice versa. Sentence aligning can be performedthrough the Gale-Church alignment algorithm.Statistical Anomalies: Real-world training sets may overridetranslations of, say, proper nouns. An example would be that "Itook the train to Berlin" gets mis-translated as "I took the trainto Paris" due to an abundance of "train to Paris" in the trainingset.163oCorpus creation can be costly for users with limitedresources.The results are unexpected. Superficial fluency can bedeceiving.Statistical machine translation does not work wellbetween languages that have significantly differentword orders (e.g. Japanese and European languages).The benefits are overemphasized for Europeanlanguages.Example-based Machine Translation ApproachExample-based machine translation (EBMT) is characterizedby its use of bilingual corpus with parallel texts as its mainknowledge, in which translation by analogy is the main idea.An EBMT system is given a set of sentences in the sourcelanguage (from which one is translating) and correspondingtranslations of each sentence in the target language with pointto point mapping. These examples are used to translate similartypes of sentences of source language to the target language.There are four tasks in EBMT: example acquisition, examplebase and management, example application and synthesis. Atthe foundation of example-based machine translation is the ideaof translation by analogy. The principle of translation byanalogy is encoded to example-based machine translationthrough the example translations that are used to train such asystem.Copyright (c) 2014 International Journal of Computer Science Issues. All Rights Reserved.

IJCSI International Journal of Computer Science Issues, Vol. 11, Issue 5, No 2, September 2014ISSN (Print): 1694-0814 ISSN (Online): 1694-0784www.IJCSI.orgExample of bilingual corpusEnglishIboHow much is that red shoe?ego one bu akpukwu ododonuHow much is that smallcamera?Ego one bu obele kamera na.Table 1: Example of bilingual corpusExample-based machine translation systems are trained frombilingual parallel corpora, which contain sentence pairs like theexample shown in the table. Sentence pairs contain sentences inone language with their translations into another. The particularexample shows an example of a minimal pair, meaning that thesentences vary by just one element. These sentences make itsimple to learn translations of subsentential units.Challenges of EBMT approachEBMT is an attractive approach to translation because it avoidsthe need for manually derived rules. However, it requiresanalysis and generation modules to produce the dependencytrees needed for the examples database and for analyzing thesentence. Another problem with EBMT is computationalefficiency, especially for large databases, although parallelcomputation techniques can be applied.2.1.3 Hybrid Machine Translation ApproachBy taking the advantage of both statistical and rule-basedtranslation methodologies, a new approach was developed,called hybrid-based approach, which has proven to have betterefficiency in the area of MT systems. At present, severalgovernmental and private based MT sectors use this hybridbased approach to develop translation from source to targetlanguage, which is based on both rules and statistics. Thehybrid approach can be used in a number of different ways. Insome cases, translations are performed in the first stage using arule-based approach followed by adjusting or correcting theoutput using statistical information. In the other way, rules areused to pre-process the input data as well as post-process thestatistical output of a statistical-based translation system. Thistechnique is better than the previous and has more power,flexibility, and control in translation.Hybrid approaches integrating more than one MT paradigm arereceiving increasing attention. The METIS-II MT system is anexample of hybridization around the EBMT framework; itavoids the usual need for parallel corpora by using a bilingualdictionary (similar to that found in most RBMT systems) and amonolingual corpus in the TL (Dirix et al., 2005). An exampleof hybridization around the rule-based paradigm is given byOepen. It integrates statistical methods within an RBMT164system to choose the best translation from a set of competinghypotheses (translations) generated using rule-based methods.3.Results and DiscussionEach machine translation approach has its advantages anddisadvantages. What one approach possesses, the other oneseems to be lacking and vice versa. Rule-based methods focuson trying to understand the grammar rules while the statisticalapproach pays very minimal or no attention to the grammar of aparticular language. The rule-based machine translationapproach has been implemented in computational linguisticssince its very early days. Human involvement in this approachis significant as it is the human agent who creates the rules. Inother words, the humans use their knowledge and experience toprepare the rules. The benefit of this approach is that itanalyzes the input on the syntactic and - to an extent - semanticlevels. The downside to rule-based machine translation is that itrequires a deep linguistics knowledge as well as long time toprepare the rules. In the end, to capture all rules would beextremely hard. Nevertheless, the rule-based approach is veryvaluable for machine translation, especially from the syntacticpoint of view. Rule-based machine translation can beconstantly modified as one can analyze the rules that do notproduce a desirable output and thus focus on fixing theproblem. This approach can be a great starting point for thoselanguages where a parallel bilingual corpus does not exist yet.For Corpus-based approach the knowledge is automaticallyextracted by analysing translation examples from a parallelcorpus built by human experts. The advantage is that, once therequired techniques have been developed for a given languagepair, MT systems should – theoretically – be quickly developedfor new language pairs using provided training data. Addingmore examples to a Corpus-based system can improve thesystem since it is based on the data, though the accumulationand management of the huge bilingual data corpus can also becostly.Hybrid machine translation is a method of machine translationthat is characterized by the use of multiple machine translationapproaches within a single machine translation system. Themotivation for developing hybrid machine translation systemsstems from the failure of any single technique to achieve asatisfactory level of accuracy. Many hybrid machine translationsystems have been successful in improving the accuracy of thetranslations. Nowadays, the most widely used MT systems(Hybrid) use the rule-based and the statistical approaches.There have been several research works which combine bothapproaches.Copyright (c) 2014 International Journal of Computer Science Issues. All Rights Reserved.

IJCSI International Journal of Computer Science Issues, Vol. 11, Issue 5, No 2, September 2014ISSN (Print): 1694-0814 ISSN (Online): 1694-0784www.IJCSI.org4. ConclusionMachine translation has been an active research subfield ofartificial intelligence for years. Machine translation (MT) is ahard problem, because natural languages are highly complex,many words have various meanings and different possibletranslations, sentences might have various readings, and therelationships between linguistic entities are often vague. Inaddition, it is sometimes necessary to take world knowledgeinto account. The number of relevant dependencies is much toolarge and those dependencies are too complex to take them allinto account in a machine translation system. Given theseboundary conditions, a machine translation system has to makedecisions (produce translations) given incomplete knowledge.This problem may be approached in a number of ways. Thispaper took a look at these approaches and their attendantchallenges. The work shows that there is no perfect approach,though the problems associated with some of the approachesare very minimal. Combining some of the best features of someapproaches to form a hybrid approach helps in taking care ofthe challenges posed by many approaches.Reference[1]M.R. Costa-Jussa, M. Farrus, J.B. Marino and J.A.Fonollosa), ―Study and Comparison of Rule-based andStatistical Catalan- Spanish MachineTranslationSystems‖, Computing and Informatics, Vol. 31, 2011,pp 245-270[2]V. Laximi and H. Kaur. ―A Survey of MachineTranslation Approaches‖, International Journal ofScience, Engineering and Technology Research, Vol. 2,Issue 3, 2013, pp 716-719[3]P. Dirix, I. Schuurman and V. Vandeghinste, ―Metis II:Example-based MachineTranslationusingMonolingual Corpora - System Description‖, InProceedings of the 2nd Workshop on Example-BasedMachine Translation, 2005, pp 43-50.[4]L. Dugast, J. Senellart and P. Koehn, ―Statistical PostEditing on SYSTRAN‘s Rule-based TranslationSystem‖, In Proceedings of the Second Workshop onSMT, 2007, pp 220-223.[5} D. Groves and A.Way, ―Hybrid Example-based SMT:the Best of Both Worlds‖, In Proceedings of the ACLWorkshop on Building and Using Parallel Texts, pp 183190.[6]P. Koehn and H. Hoang, ―Factored Translation Models‖,In Proceedings of the 2007 Joint Conference onEmpirical Methods in NLP and Computational NaturalLanguage Learning, 2007, pp, 868-876165[7]P. Brown and B. Alli, "A Statistical Approach toMachine Translation", Computational Linguistics, Vol.16, No.2, 1990, pp 79-85.[8]S. Tripathi and J.K. Sarkhel, ―Approaches to MachineTranslation‖, Annals of Libr

Starting with the shallowest level at the bottom of the pyramid is the Direct Machine Translation Approach. DMT approach is the oldest and less popular approach. Direct translation is made at the word level. Machine translation systems that use this approach are capable of translating a language, called source