【毕业设计day02】

本文探讨了毕业设计中关于可比语料库的预处理,强调其在翻译过程中的重要性。语料库需要经过停用词处理,以便成为有用资源。可比语料库由具有相同属性的文本组成,尤其是双语可比语料库,能避免平行语料库的局限性,有望用于提取双语对应词对。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

未完全理解,摸索中...

当语料库预处理完毕后,才能进行翻译。


3种对象:

- 语料库:语料库信息须预处理后,被TreeTagger进行词性标注
- 词典文件:词典中单词,编号并合并。

- 停用词:

1种功能:

- 对新词进行翻译:
对获取的词语进行标注,更新词语编号信息;
在文档层面上,获取所有的 word-count & word-context 信息
合并语料库信息
在语料库水平上,提取相关的 word-count & word-context 信息,为后续的词典提取做准备

工具:
- TreeTagger :词性标注器,对句子中的词语进行词性标注
- straberry perl:windows下的perl程序,可编译后缀名为pl的脚本文件。

可比语料库 新词翻译 系统:
1. 将从语料库目录中导入的两种语言的语料库,分别进行预处理,
2. 将两种语言的停用词从语料库中删去,提高预处理效率。
3. 词典文件中,


语料库相关资料:

1. 语料库中存放的是在语言的实际使用中真实出现过的语言材料;语料库是以电子计算机为载体承载语言知识的基础资源;真实语料需要经过加工(分析和处理),才能成为有用的资源。

2. 可比语料库(comparable corpus)是由具有某些相同属性的文本构成,双语可比语料库是由具有某些相同属性的文本构成,双语可比语料库是由某些具有相似性的两种语言文本构成,比如,不同网站同一天、同一主题的中文和英文新闻,其中中文和英文均为原文,双语可比语料库的两种语言的文本,完全是不同撰稿人或记者用母语对事件的描述。可比语料库不存在平行语料库中译文受原文限制的缺点,极有希望从双语可比语料库中提取真正对应的双语词对。



TreeTagger文本标注 附录二 TreeTagger 赋码集 (TreeTagger tagset) CC Coordinating conjunction CD Cardinal number DT Article and determiner EX Existential there FW Foreign word IN Preposition or subordinating conjunction JJ Adjective JJR Comparative adjective JJS Superlative adjective LS List item marker MD Modal verb NN Common noun, singular or mass NNS Common noun, plural NP Proper noun, singular NPS Proper noun, plural PDT Predeterminer POS Possessive ending PP Personal pronoun PP$ Possessive pronoun RB Adverb RBR Comparative adverb RBS Sup erlative adverb RP Particle SYM Symbol TO to UH Exclamation or interjection VB BE verb, base form (be) VBD Past tense verb of BE (was, were) VBG Gerund or present participle of BE verb (being) VBN Past participle of BE verb (been) VBP Present tense (other than 3rd person singular) of BE verb (am, are) VBZ Present tense (3rd person singular) of BE verb (is) VD DO verb, base form (do) VDD Past tense verb of DO (did) VDG Gerund or present participle of DO verb (doing) VDN Past participle of DO verb (done) VDP Present tense (other than 3rd person singular) of DO verb (do) VDZ Present tense (3rd person singular) of DO verb (does) VH HAVE verb, base form (have) VHD Past tense verb of HAVE (had) VHG Gerund or present participle of HAVE verb (having) VHN Past participle of HAVE verb (had) VHP Present tense (other than 3rd person singular) of HAVE verb (have) VHZ Present tense (3rd person singular) of HAVE verb (has) VV Lexical verb, base form (e.g. live) VVD Past tense verb of lexical verb (e.g. lived) VVG Gerund or present participle of lexical verb (living) VVN Past participle of lexical verb (lived, shown) VVP Present tense (other than 3rd person singular) of lexical verb (live) VVZ Present tense (3rd person singular) of lexical verb (lives) WDT Wh-determiner WP Wh-pronoun WP$ Possessive wh-pronoun WRB Wh-adverb
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值