mosesdecoder/tokenizer.perl 使用说明

最新推荐文章于 2024-12-11 11:29:06 发布

原创最新推荐文章于 2024-12-11 11:29:06 发布 · 1.9k 阅读

7 ·

CC 4.0 BY-SA版权

文章标签：

#python

自然语言处理专栏收录该内容

22 篇文章

订阅专栏

tokenizer.perl是Moses统计机器翻译系统中的一个工具，用于英文和德文的分词。默认情况下，它将标点符号分开并转义引号，但不会分割连字符。通过添加参数，如`-l`指定语言，`-a`进行激进的连字符分割，或`-no-escape`禁止转义特殊字符，用户可以自定义分词行为。该工具有助于预处理文本数据，为机器翻译和其他自然语言处理任务做准备。

tokenizer.perl是统计机器翻译系统moses的一个小工具，可以用来对英文德文等进行分词。

使用方法：

$ perl tokenizer.perl -l en < [待分词文件] > [分词结果]

其中: -l en 表示的输入的文件是英文

例如：

$ perl tokenizer.perl -l en < train.en > train.tok.en

参数说明：

if ($HELP)
{
	print "Usage ./tokenizer.perl (-l [en|de|...]) (-threads 4) < textfile > tokenizedfile\n";
        print "Options:\n";
        print "  -q     ... quiet.\n";
        print "  -a     ... aggressive hyphen splitting.\n";
        print "  -b     ... disable Perl buffering.\n";
        print "  -time  ... enable processing time calculation.\n";
        print "  -penn  ... use Penn treebank-like tokenization.\n";
        print "  -protected FILE  ... specify file with patters to be protected in tokenisation.\n";
	print "  -no-escape ... don't perform HTML escaping on apostrophy, quotes, etc.\n";
	exit;
}

解释：

不指定任何参数的话会默认认为是英文，同时把标点分开，把引号转成 &apos。但是连字符是不进行分割的。

$ echo "A Republican 'strategy' to counter the re-election of Obama." | perl ~/script/mosesdecoder/scripts/tokenizer/tokenizer.perl

 >>>A Republican &apos; strategy &apos; to counter the re-election of Obama .

-l ：指定的是语言，一共支持哪些语言我也不太清楚，只知道英语和德语

-a：会把连字符的单词分开，同时也会把标点符号分开，例如：

$ echo "A Republican strategy to counter the re-election of Obama." | perl ~/script/mosesdecoder/scripts/tokenizer/tokenizer.perl -a

>>> A Republican strategy to counter the re @-@ election of Obama .

-no-escape：会只分开标点，连字符和引号都不进行转义：

$ echo "A Republican 'strategy' to counter the re-election of Obama." | perl ~/script/mosesdecoder/scripts/tokenizer/tokenizer.perl -no-escape


>>>A Republican ' strategy ' to counter the re-election of Obama .

暂时记录到这里，遇到新问题会补充。

ps:

# detokenizer
cat train.en | perl ~/script/mosesdecoder/scripts/tokenizer/detokenizer.perl -threads 40 > train.raw.en

# tokenizer
perl ~/script/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en -no-escape < train.raw.en > train.tok.en