基于大型平行语料库的机器翻译词素偏好研究 A Parallel-corpora Study of Morpheme Preferences in Machine Translation毕业论文
This paper compares the lexicon preference of human and machine translation based on the statistical analysis of a 300-million-word parallel corpus. Multiple corpora sources were examined and filtered, after which a large sum of English text from the United Nations Parallel Corpus are selected to be pretreated, and later translated into Chinese using Google Translate. The translation result is segmented and POS tagged for comparison with its human-translated counterpart and English original. Results display a high level of similarity between the two, with differences lying mostly in structural or functional word usage. The comparison of morpheme frequency also lead to discoveries of syntactical variance. Several notable trends and correlations are discovered and analyzed in detail with examples given. This study can help computational linguists determine improvement plans for machine translation algorithms and natural language processing systems.
Key Words: Machine Translation; Parallel corpus; Morpheme
1 Introduction 1
2 Research method and data Source 2
2.1 Overview 2
2.2 Tools used 4
2.3 Data source 5
3 Pretreatment and processing 6
3.1 Pretreatment 6
3.2 Processing and obtaining results 6
4 Analysis 10
4.1 Overall 10
4.2 Specific cases 11
5 Conclusion 23
A Parallel-corpora Study of Morpheme Preferences in Machine Translation
The Evaluation of machine translation (MT) is an important field of work for both computer scientists and linguists. By exploring how MT algorithms process natural languages, it becomes clearer how machines can be improved, and how the human language work (Doddington, 2002). Past research have made much progress in different approaches of MT evaluation.
Manual evaluation by professional linguists is a mature method characterized by high levels of accuracy but low efficiency (Papineni, 2002).
To achieve a similar purpose, the focus of computer scientists varies from optimizing evaluation criteria to utilizing specially trained AI for error detection.
As suggested in more recent research, the combined effort of human language professionals and computer evaluation can yield more promising results (Popović, 2014). Both morphological and syntactical studies have been conducted before, yet none has done so on the United Nations Parallel corpus, nor are they all up-to-date concerning the current development of neural machine translation, whose machine learning process is substantially different from that of statistical or rule-based systems in the past (Wolk, 2015). With regard to the above, the author conducts this research using the most recent version of Google Translate (Google Translate 2018), based on a sample of more than 300 million words, in an attempt to disclose the patterns by which the MT system chooses morphemes differently from human translators.
Research method and data Source
The goal of this study is to differentiate on a large scale the styles by which human translators and machines use different types of Chinese morphemes. On a sample of over 300 million words, it is considered impractical to measure the exact usage difference of each morpheme within each context (Papineni, 2002), doing so is extremely time consuming and will cause the focus to be lost among thousands of different words or phrases and the unclear connections within. Hence, the subject of this research is chosen to be the parts of speech (POS) in translation, which effectively depicts the morpheme types. POS differences do not necessarily reflect the correctness of text, but are evident enough for the purpose of distinguishing stylistic bias. (Tanawongsuwan 2010)
The language pair being studied is English-Chinese.