Corpus-based Chinese Hungarian Dictionary Building - Nytud

Transcription

Corpus-based Chinese Hungariandictionary buildingJudit Ácsjudit@aut.me.huMathematical Linguistics GroupNov 21, 2014Judit Ács, HASCorpus-based Chinese Hungarian dictionary building1/11

MotivationJudit Ács, HASCorpus-based Chinese Hungarian dictionary building2/11

MotivationIJudit Ács, HASMachine readable dictionariesCorpus-based Chinese Hungarian dictionary building2/11

MotivationIIJudit Ács, HASMachine readable dictionariesManual building takes too much e ortCorpus-based Chinese Hungarian dictionary building2/11

MotivationIIIJudit Ács, HASMachine readable dictionariesManual building takes too much e ortZhongxiu Liu (Aurora) and Yidi Zhang underthe supervision of Attila ZséderCorpus-based Chinese Hungarian dictionary building2/11

MethodJudit Ács, HASCorpus-based Chinese Hungarian dictionary building3/11

Method1. Data collectionJudit Ács, HASCorpus-based Chinese Hungarian dictionary building3/11

Method1. Data collection2. PreprocessingJudit Ács, HASCorpus-based Chinese Hungarian dictionary building3/11

Method1. Data collection2. Preprocessing3. Sentence alignmentJudit Ács, HASCorpus-based Chinese Hungarian dictionary building3/11

Method1.2.3.4.Judit Ács, HASData collectionPreprocessingSentence alignmentDictionary extractionCorpus-based Chinese Hungarian dictionary building3/11

Data collection and preprocessingJudit Ács, HASCorpus-based Chinese Hungarian dictionary building4/11

Data collection and preprocessingIJudit Ács, HASFreely available booksCorpus-based Chinese Hungarian dictionary building4/11

Data collection and preprocessingIIJudit Ács, HASFreely available booksBooks become public domain after the author'sdeath 50/70 yearsCorpus-based Chinese Hungarian dictionary building4/11

Data collection and preprocessingIIIJudit Ács, HASFreely available booksBooks become public domain after the author'sdeath 50/70 years50 books collected in Chinese and HungarianCorpus-based Chinese Hungarian dictionary building4/11

Data collection and preprocessingIIIIJudit Ács, HASFreely available booksBooks become public domain after the author'sdeath 50/70 years50 books collected in Chinese and HungarianChinese text was segmented to wordsCorpus-based Chinese Hungarian dictionary building4/11

Data collection and preprocessingIIIIIJudit Ács, HASFreely available booksBooks become public domain after the author'sdeath 50/70 years50 books collected in Chinese and HungarianChinese text was segmented to wordsHungarian text was stemmedCorpus-based Chinese Hungarian dictionary building4/11

Sentence alignmentJudit Ács, HASCorpus-based Chinese Hungarian dictionary building5/11

Sentence alignmentIJudit Ács, HASOne sentence is often translated to severalsentencesCorpus-based Chinese Hungarian dictionary building5/11

Sentence alignmentIIJudit Ács, HASOne sentence is often translated to severalsentencesHunalign (Varga et al. 2005)Corpus-based Chinese Hungarian dictionary building5/11

Sentence alignmentIIIJudit Ács, HASOne sentence is often translated to severalsentencesHunalign (Varga et al. 2005)Many-to-many alignment between correspondingChinese and Hungarian segmentsCorpus-based Chinese Hungarian dictionary building5/11

Sentence alignmentIIIIJudit Ács, HASOne sentence is often translated to severalsentencesHunalign (Varga et al. 2005)Many-to-many alignment between correspondingChinese and Hungarian segmentshu zh segment pairs with appr. the samemeaningCorpus-based Chinese Hungarian dictionary building5/11

Dictionary extractionJudit Ács, HASCorpus-based Chinese Hungarian dictionary building6/11

Dictionary extractionIJudit Ács, HASSimilar words often appear in correspondingsegmentsCorpus-based Chinese Hungarian dictionary building6/11

Dictionary extractionIIJudit Ács, HASSimilar words often appear in correspondingsegmentsFrequent words appear in many segmentsCorpus-based Chinese Hungarian dictionary building6/11

Dictionary extractionIIIJudit Ács, HASSimilar words often appear in correspondingsegmentsFrequent words appear in many segmentsWord co-occurrence statistics: Dice coe cientCorpus-based Chinese Hungarian dictionary building6/11

Dictionary extractionIIIIJudit Ács, HASSimilar words often appear in correspondingsegmentsFrequent words appear in many segmentsWord co-occurrence statistics: Dice coe cientHundict (Ács et al. 2013)Corpus-based Chinese Hungarian dictionary building6/11

Dice coe cientJudit Ács, HASCorpus-based Chinese Hungarian dictionary building7/11

Dice coe cientBetween the Hungarian wordword wzh :Judit Ács, HASwhuand the ChineseCorpus-based Chinese Hungarian dictionary building7/11

Dice coe cientBetween the Hungarian wordword wzh :Dice (w , whuJudit Ács, HASzh) whuand the Chinese2 Whu Wzh , Whu Wzh Corpus-based Chinese Hungarian dictionary building7/11

Dice coe cientBetween the Hungarian wordword wzh :Dice (w , whuIJudit Ács, HASzh) whuand the Chinese2 Whu Wzh , Whu Wzh where Whu is the set of segment pairs in whichwhu appearsCorpus-based Chinese Hungarian dictionary building7/11

Dice coe cientBetween the Hungarian wordword wzh :Dice (w , whuIIJudit Ács, HASzh) whuand the Chinese2 Whu Wzh , Whu Wzh where Whu is the set of segment pairs in whichwhu appearsWzh is the set of segment pair in which whuappearsCorpus-based Chinese Hungarian dictionary building7/11

Example于是 三百 亩 小麦 , 一百 亩 马铃薯 , 一百五十 亩 苜蓿 , 没有 一 亩 地 荒废 了 。Akkor lesz háromszáz gyeszjatyina búzája , százburgonyája , százötven lóheréje , s egyetlen kimerültgyeszjatyinája sem .Judit Ács, HASCorpus-based Chinese Hungarian dictionary building8/11

ResultsJudit Ács, HASCorpus-based Chinese Hungarian dictionary building9/11

ResultsI34k wordsScore1.01.00.930.75Judit Ács, rcsillagász天文学家 astronomermérnök工程师engineerhuszonnegyedik 二十四日 (on the) 24thCorpus-based Chinese Hungarian dictionary building9/11

ConclusionsJudit Ács, HASCorpus-based Chinese Hungarian dictionary building10/11

ConclusionsIJudit Ács, HASMedium to high quality translationsCorpus-based Chinese Hungarian dictionary building10/11

ConclusionsIIJudit Ács, HASMedium to high quality translationsSuitable for machine translation, informationretrieval etc.Corpus-based Chinese Hungarian dictionary building10/11

ConclusionsIIIJudit Ács, HASMedium to high quality translationsSuitable for machine translation, informationretrieval etc.Evaluation is neededCorpus-based Chinese Hungarian dictionary building10/11

ConclusionsIIIIJudit Ács, HASMedium to high quality translationsSuitable for machine translation, informationretrieval etc.Evaluation is neededLarger corpora: web pages, WikipediaCorpus-based Chinese Hungarian dictionary building10/11

ConclusionsIIIIIJudit Ács, HASMedium to high quality translationsSuitable for machine translation, informationretrieval etc.Evaluation is neededLarger corpora: web pages, WikipediaComparable corporaCorpus-based Chinese Hungarian dictionary building10/11

BibliographyIJudit Ács, Katalin Pajkossy, and András Kornai.Building basic vocabulary across 40 languages.InProceedings of the Sixth Workshop on Building and UsingComparable Corpora,pages 52 58, So a, Bulgaria, August 2013.Association for Computational Linguistics.IDaniel Varga, Peter Halacsy, Andras Kornai, Viktor Nagy, LaszloNemeth, and Viktor Tron.Parallel corpora for medium density languages.In N Nicolov, K Bontcheva, G Angelova, and R Mitkov, editors,Recent Advances in Natural Language Processing IV. Selected papersfrom RANLP-05,Judit Ács, HASpages 247 258. Benjamins, Amsterdam, 2007.Corpus-based Chinese Hungarian dictionary building11/11

Data collection and preprocessing I Freely available books I Books become public domain after the author's death 50/70 years I 50 books collected in Chinese and Hungarian I Chinese text was segmented to words I Hungarian text was stemmed Judit cs, HAS Corpus-based Chinese Hungarian dictionary building 4/11