Independent Evaluation Of Commercial Machine Translation Engines

Transcription

Independent evaluationof commercial machinetranslation enginesDomain: COVIDLanguage pair: EN-RUIn partnership withJuly 2020

DisclaimerThe systems used in this report were trained and used for translation from June26 to June 30, 2020. They may have been changed many times since then.—This report demonstrates the performance of those systems exclusively on thedataset used for this report (English-Russian, TAUS Corona Crisis Corpus).—We have run multiple evaluations in the same domain for other language pairsand observed different rankings of the MT systems.—There’s no “best” MT system. Performance depends on how your data is similarto what a provider used to train their baseline models and their algorithms.—Remember: Each case is different. There’s no one size fits all. Intento, Inc.July 2020

AboutThe purpose of this research is to help improve the quality of translatedCOVID-related content across the world. The COVID challenge isglobal and the correctness of disseminated information is crucial.Using our expertise in machine translation evaluation, we haveassessed several machine translation engines to identify the ones thatwork best for COVID-related content in different language pairs.This study is independent and available to all.The research has been performed on Corona Crisis Corpus madeavailable by TAUS.Intento is a vendor-agnostic platform that helps global companiesevaluate, procure and utilize the best-fit machine translation. Intento, Inc.July 2020

Intento Enterprise MT HubOne place to evaluate and manage MTUniversal API toall MT enginesConnects to many CAT, TMSand CMSMAON Y BPR E DEIVA PLTE OYCL ED Intento, Inc.OUDSingle MTdashboardWorks with files ofany sizeSmart Routingwith retriesand failoversGet yourAPI key atinten.toJuly 2020

Overview5Domain-AdaptiveNMT Engines1. Benchmark description2. Reference-based scores3. Human LQA results4. How much does it cost?5. How safe is my Data13Stock NMTEngines1Language Pair:en-ru1Domain:Healthcare Intento, Inc.July 2020

Machine Translation Engines Evaluateden rucustomNMTstockNMTGoogle CloudAutoML TranslationIBM WatsonLanguage Translator v3MicrosoftTranslator API v3ModernMTEnterprise APIYandexTranslate APIAlibabaE-commerce EditionAmazonTranslateBaiduTranslate APIDeepLAPIGoogle CloudTranslation APIGTCOMYeecloud MachineTranslation APIIBM CloudLanguage Translator v3MicrosoftTranslator Text API v3ModernMTEnterprise APIPROMTCloud APISYSTRANPNMTTencentTMT APIYandexTranslate API Intento, Inc.July 2020

Evaluation methodologyClean the data, extract 4000 segments as a test set, use the rest for training customizedengines.—Use different metrics to compute the similarity between reference translations and theMT output for stock and customized engines for 4000 segments. Identify the topperforming engines.—Select a set of typical segments translated by the top engines. Perform human expertLQA of these translations.—Perform human expert analysis of each engine’s weak spots — segments where anengine can fail.—Perform human expert LQA of machine translations of an entire document aboutCOVID-19.—Analyze the results of LQA and choose the most suitable engines for translation. Intento, Inc.July 2020

DatasetEnglish-Russian, HealthcareTAUS Corona Crisis Corpus Original dataset volume: 192,614 segmentsWe clean the data and remove bad segmentsNext we run filtering using BERT/LASER embeddings to remove segmentswhere the source text and the translation are likely to be mismatchedMedical terminology:It is now possible to reverse-genetic-engineer the very largest RNA viruses, such as thecoronavirus which causes SARS. Intento, Inc.July 2020

DatasetAbout 33,000 segments were removed incleaningMore than 9,500 segments were duplicated.—More than 15,000 segments contained only a part of sentence (started from “:”, containedlists separated with “;” in a single row, and so on).—And 29 more filters.Example:source: “15 March 2007, was routinely hospitalized with a diagnosis of CHD (coronary heartdisease), on 3 June 2008 diagnosed with exertional angina at the CELT, undergone coronaryangiography, coronary angioplasty and stenting. Discharged with good clinical result.”reference translation: “20.02.1942г., дата включения в исследование 31.01.2007г.госпитализирован в ЦЭЛТ 11.09.2007г, проведена коронарография, выявившаяпоражение коронарных артерий, проведена коронарная ангиопластика истентирование. Выписан.” Intento, Inc.July 2020

DatasetAbout 6,000 segments were removed afterfiltering with BERT/LASER embeddingsFor example, segments where the source and the reference translations did notmatch:source: “100% after co-payment Emergency room visit, non-emergency care”reference translation: “100 процентов после оплаты участником его доли всумме 50 долл. США” Intento, Inc.July 2020

DatasetOther issues in the dataWrong domain:source: “Then when the sun came up, God sent a burning east wind: and so great was the heat of thesun on his head that Jonah was overcome, and, requesting death for himself, said, Death is better for methan life.”reference: “Когда же взошло солнце, навел Бог знойный восточный ветер, и солнце стало палитьголову Ионы, так что он изнемог и просил себе смерти, и сказал: лучше мне умереть, нежелижить.”—Wrong language in source or reference:source: “And when the thing had been looked into, it was seen to be true, and the two of them were putto death by hanging on a tree: and it was put down in the records before the king.”reference: “El hecho fue investigado y hallado cierto, por lo que ambos fueron colgados en una horca.Esto fue escrito en el libro de las crónicas, en presencia del rey.” Intento, Inc.July 2020

DatasetSplitting into training and testing setsTAUS Corona Crisis Corpus Intento, Inc.After cleaning 159,181 segments remainWe extract 4000 segments as a test setWe train custom MT model on a training set of about 155,000 segmentsJuly 2020

Evaluation MetricsWe use three metrics for automatic evaluation of machine translations: hLEPOR,TER, and SacreBLEU.—The scores are mostly well-correlated, and we rely on hLEPOR to build an initialranking of MT engines and select the top 71.pdf Intento, Inc.July 2020

Different metrics are correlated Intento, Inc.July 2020

Average hLEPOR Scoresstock modelscustomized modelstop engines Intento, Inc.July 2020

Average TER ScoresSame top runnersstock modelscustomized modelstop enginesWe use hLEPOR for the analysis below. Intento, Inc.July 2020

Customization analysisCorpus scoresWe have taken a closer look atthe customized models and howthey compare to stock models.—The custom Google model showsthe most score improvement overstock.—The IBM and Microsoft custommodels have a slightly loweraverage score than stock. Intento, Inc.July 2020

Customization analysisIn order to understand where a custom model differs from a stock model, we need tolook at specific translations, not the whole test set.—We have developed a method of selecting segments that have significantly differentscores in the stock and custom translations, so that we can compare the translations.—This allows us to see how the custom model has improved over stock and where itfails. Intento, Inc.July 2020

ModernMT custom vs. stockThe custom model has higher scores for nearly athird of all segments.—The segments whose scores have degraded areactually good too; some are better than referencetranslations, for example:source: “The Committee is concerned that someprisons are faced with overcrowding, inadequatefacilities and poor health conditions.”reference translation: “Комитет озабочен тем, чтов некоторых тюрьмах ь, ые услуги и существуют плохиеусловия для состояния здоровья.”stock MT: “Комитет обеспокоен тем, что внекоторых тюрьмах ь, неадекватные условиясодержания и плохие санитарные условия.”custom MT: “Комитет ю некоторых тюрем,неадекватными услугами и плохими санитарногигиеническими условиями.” Intento, Inc.July 2020

Google custom vs. stockIn many segments, the custommodel handles terminology betterthan the stock model.—There are some minor omissionsand mistranslations in the customtranslations. Below is an exampleof mishandled terminology as wellas an omission:source: “Carestream Health, 2010.CARESTREAM and DIRECTVIEWare trademarks of CarestreamHealth.”custom MT: “CareПроисхождение Care CareHealth, 2010. CARERTREAM иDIRECTVIEW являютсятоварными знаками компании.” Intento, Inc.July 2020

IBM custom vs. stockThe custom model has not noticeably improvedcompared to stock.Both stock and custom translations havemistranslations, grammatical errors, corruptedwords, for example:—source: “The Convention requires Statesparties to take effective measures to abolishsocial practices prejudicial to the health ofchildren.”stock MT: “Конвенция требует от государствучастников принятия эффективных мер поликвидации социальной практики,наносятой ущерб здоровью детей.”custom MT: “В Конвенции содержитсятребование к �нять эффективные меры по отменесоциальной практики, наносявой ущербздоровью детей.” Intento, Inc.July 2020

Microsoft custom* vs. stockMixed dynamics, no systematic improvement.—Both stock and custom translations haveissues, for example:source: “A bronchoscopy with airway hygienewas performed with further microbiologicalexamination of washings (in bronchial washings- “ Enterococcus faecium, Klebsiellapneumoniae).”stock MT: “Бронхоскопия с гигиенойдыхательных путей была проведена сдальнейшим ванием стирок (в бронхиальныхстироках - Enterococcus faecium, Klebsiellapneumoniae).”custom MT: “Бронхоскопия с гигиенойдыхательных путей проводилась придальнейшем вании промываний (прибронхиальных промывочниях – энтерококкаfaecium, пневмоциллы клебсиеллы).”* We trained the model in the Medicine domain Intento, Inc.July 2020

Yandex custom vs. stockSmall average improvement, but manysegments sound more fluent in thecustom translation.—Improved custom translation:source: “Countries in subSaharan Africaare at different stages of addressing thepandemic, with mixed results.”stock MT: “Страны субсахарскойАфрики находятся на разныхстадиях борьбы с этой пандемией, ирезультаты этой борьбынеоднозначны.”custom MT: “Страны Африки к югу отСахары находятся на разных этапахборьбы с пандемией, и результатыэтой борьбы неоднозначны.” Intento, Inc.July 2020

Reference-Based Scores andCustomizationDiscussionRanking engines by different metrics provides similar results.—The Google engine trained on the dataset has the largest score improvement.There are some minor omissions and mistranslations in the custom translations,however, nearly a third of all segments have improved scores.—The customized Yandex and ModernMT engines show some improvement overstock: custom translations sound more fluent than stock, and sometimes evenreference translations.—IBM and Microsoft custom translations are not significantly better than stocktranslations, both contain issues of different severity. Intento, Inc.July 2020

hardsegment difficultyeasylowMT engineagreementFor our evaluation, we pick the seven topperforming engines.—Since customization leads to translationimprovement, we exclude the stockversions of custom engines from theevaluation.—Engines provide similar translations forsome segments, but disagree on others.—Corpus-level scores show quality formedian segments.—To perform human review, we need toselect median segments as well as thosethat will demonstrate the differencesbetween MTs.number of segmentsHuman Linguistic Quality Analysishigh Intento, Inc.July 2020

Extracting groups of segments forreviewWe calculate average hLEPORscores for all test segmentsacross the top-performingengines.—Median segments are those thathave the average hLEPOR andvariance close to the median.—Weak spots are segments thatmost engines handled well, butone or more engines translatedbadly. These segments canspotlight a particular engine’sweaknesses.—These groups of segments wereanalysed by linguists withexpertise in the domain ofmedicine and healthcare and bothEnglish and Russian languages. Intento, Inc.weak spotsmedianJuly 2020

Human Linguistic Quality AnalysisWe continue our evaluation with human review of three kinds of texts: typicalsegments from the test set, weak spots (non-typical translations), and a wholedocument on COVID-19.—We are enormously grateful to our LSP partners, who performed the LQA: AUMand MedConsult Intento, Inc.July 2020

Segments ReviewBlind within-subjects reviewLQA was performed by our two partners: AUM and MedConsult. The experts receivedthe source segments and all translations (including the human reference) without labels.—Experts rated the estimated effort to post-edit every segment. For all segments wherepost-editing make sense (i.e. neither perfect nor useless segments), reviewers werealso asked to provide a suggested translation.—In our analysis, we explore the level of reviewer agreement and provide two models torank the engines (based on the PE distance and on the estimated cost savingcompared to human translation). Intento, Inc.July 2020

Segments ReviewEditing effort ratings, validated e is absolutely nothing to improve. The translation sounds like it was producedby aprofessional human translator who understands the context in which the source segmentappears.Level TwoGOODThe translation conveys the meaning of the source accurately and does not containanyBodygrammatical errors but it does not sound quite natural. Style and tone need someimprovement.Level FourBodyEstimated EffortLevelSaving*One0,9Body Level Three0,75Body Level FiveFAIRThe translation adequately conveys the meaning of the source sentence. There aresome mistakes that are easy to fix, the effort is similar to reviewing human translation orfuzzy TM matches.0,5BADThe translation adequately conveys the meaning of the source sentence. There are mistakes ofdifferent severity. Fixing these mistakes requires careful examination of the source sentenceand significant effort. However, the machine translation still provides speed-up.0,2USELESS Intento, Inc.The translation is completely irrelevant to the source, it's either useless or misleading,the meaning of the source sentence is lost, it should be translated from scratch.0July 2020

Reviewer agreementOne reviewer has more PERFECT andUSELESS segmentsBody Level OneBody Level TwoBody Level ThreeBody Level FourBody Level Five Intento, Inc.July 2020

Reviewer agreementAccording to one reviewer, slightlymore effort is needed for manysegments.—To deal with reviewerdisagreement, we computeaverage ratings across reviewers. Intento, Inc.July 2020

Segments ReviewMT model average rankingperfect segmentsgood segmentsfair segmentsbad segmentsuseless segments Intento, Inc.July 2020

Segments ReviewWhy isn’t Human Translation the firstMany human translations are not veryliteral.While Reviewer 1 rates them PERFECT,Reviewer 2 suggests editing for alltranslations, to make them more literal. Intento, Inc.July 2020

Segments ReviewExample of a suggested translationReference translations is re-phrased, suggested translation is more literal:source: “Information generated from the newly established Reproductive and ChildHealth Division of the Ministry of Health and Sanitation classified the causes of maternalmortality into three categories viz root causes, underlying causes, and immediatecauses.”reference translation: “На основе информации, подготовленной недавно созданнымв Министерстве здравоохранения и санитарии Отделом репродуктивногоздоровья и здоровья детей, причины материнской смертности сведены в трикатегории: первопричины, глубинные причины и непосредственные причины.”suggested translation: “Информация, полученная от недавно созданного Отделарепродуктивного здоровья и здоровья детей Министерства здравоохранения исанитарии, классифицировала причины материнской смертности по тремкатегориям: первопричины, глубинные причины и непосредственные причины.” Intento, Inc.July 2020

Segments ReviewCustom engines leadModernMT custom andYandex custom offer thesame potential effort savingas Human Translation. Intento, Inc.July 2020

Ranking by post-editing distanceYandex custom is the undisputed leaderModernMT custom is behindtwo stock engines.—Human translation has alarge edit distance fromsuggested translations. Intento, Inc.July 2020

Segments ReviewDiscussionYandex custom is in the first place, with very high potential effort saving and thelowest editing distance to suggested translations.—ModernMT custom offers the same effort saving but a larger edit distance fromsuggested translations.—The Human translation is not rated the best, because the machine translations, aswell as translations suggested by the reviewers, are more literal. Intento, Inc.July 2020

MT Weak Spots AnalysisWeak spots are segments for whicha particular engine received a lowscore, while the average scoreacross all top-performing providersis high.—Weak spots showcase the contextswhere an engine can fail.—Some weak spots are validalternative translations — they arenot included in this chart. Intento, Inc.July 2020

MT Weak Spots AnalysisBased on data from two lationXXtranslated DNTXgrammatical errorBody Level OneBody Level TwoXGoogle(custom)XXXBody Level ThreeXBody Level FourBody Level andex(custom)X Intento, Inc.untranslated wordsXXXXJuly 2020

MT Weak Spots AnalysisExamplesYandex custom — mistranslation:source: “Increased eosinophil count to eosinophilic myelocytes.”reference translation: “Увеличено количество эозинофилов до MT: “Увеличение количества эозинофилов в эозинофильных миелоцитах.”—ModernMT custom — omission repetition:source: “In recent years it has focused a lot of attention and invested tremendously in theprovision of health care services to women in the area of family planning, pregnancy, confinementand during the post-natal period.”MT: “В последние годы она уделяет большое внимание вопросам планирования семьи,беременности, родов и послеродового периода и уделяет им огромное внимание.” Intento, Inc.July 2020

MT Weak Spots AnalysisDiscussionResults of human review of the weak spots show that many of them are acceptablealternative translations or even better than reference translations. However,engines have some actual weak spots too.—Yandex custom has one mistranslation of terminology and one mishandledacronym, as well as a couple of grammatical errors.—ModernMT custom has more weak spots, including several omissions andmistranslations. Intento, Inc.July 2020

HOLISTIC REVIEW RESULTSBlind within-subjects reviewWe have translated a whole document on COVID-19 with the seven top-runningengines.—Experts have reviewed the translations’ readability, adequacy, consistency, andcommented on any issues they have found.—We have analyzed the results and computed the rating for each engine’stranslation. Intento, Inc.July 2020

HOLISTIC REVIEW RESULTSFragment of the test documentCovid-19 is caused by SARS-CoV-2, a member of the coronavirus family. Itsclosest relatives are the SARS-CoV virus, with which it shares roughly 79%genomic similarity, and MERS-CoV virus, with 50% similarity. They are envelopedviruses with a positive-sense single-stranded RNA genome and a nucleocapsid ofhelical symmetry. The genome size of coronaviruses ranges from approximately 27to 34 kilobases (29.9 for SARS-CoV-2), the largest among known RNA viruses.Compared to the seasonal flu virus, SARS-CoV-2 is characterised by both higherinfectivity (basic reproductive number 2.0–2.5 vs 1.3 for flu) and higher diseaseseverity, both in hospitalization rate ( 20% vs 2%) and case fatality rate ( 3% vs 0.1%). SARS-CoV-2 also starkly contrasts with its closest relatives in theseregards, causing much less severe symptoms than both SARS-CoV and MERSCoV (with fatality rates around 10% and 35% respectively). Intento, Inc.July 2020

HOLISTIC REVIEWCriteria Intento, Inc.criteriaexplanationuntranslated elementsnumber of words left untranslatedtranslation adequacythe translation matches the sourcetranslation consistencyneighboring sentences are translated consistentlyreadabilitythe translation is easy to read for a humanspecific issuesformatting, punctuation, lists, links, added junkJuly 2020

HOLISTIC REVIEW RESULTSAverage of two reviewers’ ratings#MT epLstockGoodEasy to fix / GoodEasy to fix /GoodNoneNone2PROMTstockNoneGoodEasy to fix / GoodGoodFew / None3GooglecustomFew / NoneGoodEasy to fix / GoodEasy to fix /GoodFew / None4ModernMTcustomNoneUnderstandable /GoodEasy to fixGoodFew / Nonemistranslatedterminology155YandexcustomFew / NoneUnderstandable /Easy to fix / GoodGoodEasy to fix /GoodFew / Nonemistranslatedterminology14.56AmazonstockFew / NoneUnderstandableEasy to fixEasy to fix /GoodFew / Nonemistranslatedterminology13.57GTCOMstockFew / NoneUnderstandableToo bad / Hard tofixEasy to fix /Hard to fixFew / Nonenot satisfactory11 Intento, Inc.untranslatedsegmentsreadabilityspecific issuesreviewer commentsfluent translation, somemistranslatedterminology,inconsistentfluent t 15July 2020

HOLISTIC REVIEW RESULTSDiscussionStock engines DeepL and PROMT lead in holistic review.—All engines have trouble with terminology to some extent: it is mistranslated ortranslated inconsistently within the document. These issues are easy to noticeand fix. Intento, Inc.July 2020

LINGUISTIC QUALITY ANALYSISSummary#MT Engine1Yandexcustom2ModernMTcustom3Holistic review points(whole document)Effort Saving(typical segments)PE Distance per word(typical segments)Weak spots(non-typical segments)Further improvement71 %1.24mistranslations, grammatical(Human Translation — 70(Human Translation — 2.11)errors%)further improvement ispossible with more data1571 %(Human Translation — 70%)1.58one omission,mistranslationsfurther improvement ispossible with more dataDeepLstock160,681.54omissions, mistranslations,untranslated ated terminology5GTCOMstock110,61.62many omissions andmistranslations6Amazonstock13.50,581.63two omissions, onemistranslation, translatedDNTfurther improvement ispossible with a glossary7Googlecustom2.15mistranslations, translatedDNTfurther improvement ispossible with more dataand a glossary Intento, Inc.14.5150,55July 2020

LINGUISTIC QUALITY ANALYSISConclusionsHuman LQA shows that the custom engines ModernMT and Yandex offer thelargest potential effort saving, over 70%.—Yandex has the smallest edit distance from suggested translations, and few weakspots, nothing severe.—ModernMT has a larger edit distance from suggested translations and slightlymore serious weak spots, e.g. an omission.—Both these engines have issues with terminology in the translation of a larger textabout COVID-19. PROMT stock handles terminology better. Intento, Inc.July 2020

LINGUISTIC QUALITY ANALYSISConclusionsNot all segments in the dataset are strictly in the medicine/healthcare domain —many are on the fringes of this domain, dealing with equal access to healthcare, forexample:“Every person legally residing in Luxembourg had equal access to nearly free healthcare.”“Protection of the health of mothers and children is a priority stated by the law.”Such segments do not contain medical terminology, so the customized models couldnot learn how to handle it.—Training the models on more data, strictly medical, would result in better translations. Intento, Inc.July 2020

Price ComparisonBody Level OneBody Level TwoBody Level ThreeBody Level FourBody Level Five Intento, Inc.July 2020

Price ComparisonBody Level OneBody Level TwoBody Level ThreeBody Level FourBody Level Five Intento, Inc.July 2020

HOW SAFE IS MY DATA?Data protected by ToS: Google (link).—Data protected by Data Protection and Privacy Policy: ModernMT (link).—Amazon may store and use input data to improve its technologies (link).—DeepL does not store input data and uses it only to provide the translation (link).—Yandex does not store input data and uses it only to provide the translation (link). Intento, Inc.July 2020

Intento Web DemoEnd-to-EndFast and SafeTrustedGet a portfolio of Machine Translation engines optimal for yourlanguage pairs, domains, and available training data.—4-5 weeks from assorted TMs and glossaries to winning MT engineswith ROI estimation for Post-Editing and Real-Time MachineTranslation.—We run 15-20 MT Procurement projects per month for global retail,travel, and technology companies under strict Security, Quality andData Protection requirements. ISO 27001 certified.REACH USat hello@inten.to Intento, Inc.July 2020

Intento Plugins and ConnectorsXLIFF Intento, Inc.Microsoft Office (Outlook, Word, Excel)—Google Chrome and Microsoft Edge (extension)—memoQ (included in 9.4, also private plugin)—SDL Trados (SDL AppStore)—XTM (XLIFF API Connector)—MateCat (private plugin)—Any Enterprise TMS via XLIFF connector.—Miss some connector? Reach us at hello@inten.to!July 2020

Any questions?Get in touch!If you’d like to have a closer look at the data or to reproduce the results, fillfree to contact us at hello@inten.to. The following data assets are available: Intento, Inc.this report in PDFthe training setthe test settest set translations by MT enginessegments for human review, commented by the reviewersMT engines’ weak spots, commented by the reviewersJuly 2020

AUM Translation ServicesAboutAUM Translation Services is areliable provider of RussianServices Offered›TEP›PEMT includinglanguage translation services›MT linguistic testingfrom Novosibirsk, Russia.›TM cleaningPEMT Experience›PEMT since 2016›Over 17.7 M words post-edited in›life sciencesSoftware & orodok, we combine›SEO translation›IT & Telecomstate-of-the-art translation›Transcreationtechnology with expert›TerminologyLocated close to the scientific›center of Siberia,knowledge and dedication toquality. Intento, Inc.›Over 2.2 M TM segmentscleanedmanagement›SME review›LQA›July 2020

MedConsult is Russia’s first translation agencyspecialized in medical and pharmaceutical translation15of the world’s top 20 companiesuse our translation services17years of experience in medicaland pharmaceutical translation forinternational and Russian companiesinfo@medconsult.ru 7 495 771 1884100 Intento, Inc.per cent of our translators andeditors are certified doctors andpharmacistsJuly 2020

Intento, Inc.hello@inten.to2150 Shattuck AveBerkeley CA 94704https://inten.to Intento, Inc.July 2020

Yandex Translate API Alibaba E-commerce Edition Amazon Translate Baidu Translate API DeepL API Google Cloud Translation API GTCOM Yeecloud Machine Translation API IBM Cloud Language Translator v3 Microsoft Translator Text API v3 ModernMT Enterprise API PROMT Cloud API SYSTRAN PNMT Tencent TMT API Yandex Translate API