FastText工具使用案例_fasttext load-优快云博客

文章详细介绍了如何在Windows环境下使用Python和FastText库训练词向量。首先，从MattMahoney的网站下载并清洗英文维基百科数据，然后使用FastText的train_unsupervised方法进行训练，调整超参数如模型类型、词嵌入维度和学习率。最后，保存模型并验证其效果，展示如何找到相似词。

FastText使用案例

window下面python环境安装

pip install fasttext

论文: https://arxiv.org/abs/1607.01759
代码：https://github.com/facebookresearch/fastText
测试案例

https://zhuanlan.zhihu.com/p/32965521 这个地址讲解了fastText的原理和核心思想。

fasttext训练词向量

词向量知识

现代机器学习中，喜欢使用向量来表示文本中的词汇（或者字符），这样能够更好的捕捉到语言之间的关系，从而提升基于词的各种NLP任务的效果。

fasttext训练词向量的过程

获取数据
训练向量
模型超参数设定
模型效果校验
模型保存与加载

第一步准备数据

# 在这里, 我们将研究英语维基百科的部分网页信息, 它的大小在300M左右
# 这些语料已经被准备好, 我们可以通过Matt Mahoney的网站下载.
# 首先创建一个存储数据的文件夹data
$ mkdir data
# 使用wget下载数据的zip压缩包, 它将存储在data目录中
$ wget -c http://mattmahoney.net/dc/enwik9.zip -P data
# 使用unzip解压, 如果你的服务器中还没有unzip命令, 请使用: yum install unzip -y
# 解压后在data目录下会出现enwik9的文件夹
$ unzip data/enwik9.zip -d data

将wiki的xml，html格式数据清洗

perl wikifil.pl data/enwik9 > data/fil9

# wikifile.pl 脚本内容
#!/usr/bin/perl

# Program to filter Wikipedia XML dumps to "clean" text consisting only of lowercase
# letters (a-z, converted from A-Z), and spaces (never consecutive).  
# All other characters are converted to spaces.  Only text which normally appears 
# in the web browser is displayed.  Tables are removed.  Image captions are 
# preserved.  Links are converted to normal text.  Digits are spelled out.

# Written by Matt Mahoney, June 10, 2006.  This program is released to the public domain.

$/=">";                     # input record separator
while (<>) {
  if (/<text /) {$text=1;}  # remove all but between <text> ... </text>
  if (/#redirect/i) {$text=0;}  # remove #REDIRECT
  if ($text) {

    # Remove any text not normally visible
    if (/<\/text>/) {$text=0;}
    s/<.*>//;               # remove xml tags
    s/&amp;/&/g;            # decode URL encoded chars
    s/&lt;/</g;
    s/&gt;/>/g;
    s/<ref[^<]*<\/ref>//g;  # remove references <ref...> ... </ref>
    s/<[^>]*>//g;           # remove xhtml tags
    s/\[http:[^] ]*/[/g;    # remove normal url, preserve visible text
    s/\|thumb//ig;          # remove images links, preserve caption
    s/\|left//ig;
    s/\|right//ig;
    s/\|\d+px//ig;
    s/\[\[image:[^\[\]]*\|//ig;
    s/\[\[category:([^|\]]*)[^]]*\]\]/[[$1]]/ig;  # show categories without markup
    s/\[\[[a-z\-]*:[^\]]*\]\]//g;  # remove links to other languages
    s/\[\[[^\|\]]*\|/[[/g;  # remove wiki url, preserve visible text
    s/\{\{[^\}]*\}\}//g;         # remove {{icons}} and {tables}
    s/\{[^\}]*\}//g;
    s/\[//g;                # remove [ and ]
    s/\]//g;
    s/&[^;]*;/ /g;          # remove URL encoded chars

    # convert to lowercase letters and spaces, spell digits
    $_=" $_ ";
    tr/A-Z/a-z/;
    s/0/ zero /g;
    s/1/ one /g;
    s/2/ two /g;
    s/3/ three /g;
    s/4/ four /g;
    s/5/ five /g;
    s/6/ six /g;
    s/7/ seven /g;
    s/8/ eight /g;
    s/9/ nine /g;
    tr/a-z/ /cs;
    chop;
    print $_;
  }
}

查看清洗之后的数据内容，数据是一个个单词由空格来分隔。

# 查看前80个字符
head -c 80 data/fil9

# 输出结果为由空格分割的单词
 anarchism originated as a term of abuse first used against early working class

第二步训练词向量

# 代码运行在python解释器中
# 导入fasttext
>>> import fasttext
# 使用fasttext的train_unsupervised(无监督训练方法)进行词向量的训练
# 它的参数是数据集的持久化文件路径'data/fil9'
>>> model = fasttext.train_unsupervised('data/fil9')

查看对应的词向量

通过get_word_vector方法来获取对应输入词语的向量表示
>>> model.get_word_vector("the")

array([-0.03087516,  0.09221972,  0.17660329,  0.17308897,  0.12863874,
        0.13912526, -0.09851588,  0.00739991,  0.37038437, -0.00845221,
        ...
       -0.21184735, -0.05048715, -0.34571868,  0.23765688,  0.23726143],
      dtype=float32)

第三步模型参数调节

# 在训练词向量过程中, 我们可以设定很多常用超参数来调节我们的模型效果, 如:
# 无监督训练模式: 'skipgram' 或者 'cbow', 默认为'skipgram', 在实践中，skipgram模式在利用子词方面比cbow更好.
# 词嵌入维度dim: 默认为100, 但随着语料库的增大, 词嵌入的维度往往也要更大.
# 数据循环次数epoch: 默认为5, 但当你的数据集足够大, 可能不需要那么多次.
# 学习率lr: 默认为0.05, 根据经验, 建议选择[0.01，1]范围内.
# 使用的线程数thread: 默认为12个线程, 一般建议和你的cpu核数相同.

>>> model = fasttext.train_unsupervised('data/fil9', "cbow", dim=300, epoch=1, lr=0.1, thread=8)

Read 124M words
Number of words:  218316
Number of labels: 0
Progress: 100.0% words/sec/thread:   49523 lr:  0.000000 avg.loss:  1.777205 ETA:   0h 0m 0s

第四步模型保存

model.save_model("model/wiki_file.bin")

第五步模型效果检验

def test_wiki_model():
    # 加载模型
    model = fasttext.load_model("model/wiki_file.bin")
    res = model.get_nearest_neighbors("sports") # 查找运动的临近单词
    for val in res:
        print("邻近词语: ", val)
======================================================
邻近词语:  (0.9347376823425293, 'sportscars')
邻近词语:  (0.9008225798606873, 'sportsmen')
邻近词语:  (0.8989560008049011, 'sportivo')
邻近词语:  (0.8979462385177612, 'sportsplex')
邻近词语:  (0.8970243334770203, 'sport')
邻近词语:  (0.894062340259552, 'sportsnet')
邻近词语:  (0.8835816979408264, 'sportscasters')
邻近词语:  (0.8737423419952393, 'sportswear')
邻近词语:  (0.8702874779701233, 'sportiva')
邻近词语:  (0.8670809864997864, 'sportscast')