InferSent：通用句子表示学习

孔卿菡Warrior

于 2025-03-26 13:43:49 发布

阅读量432

点赞数 11

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/gitblog_00160/article/details/146527482

InferSent：通用句子表示学习

InferSent InferSent sentence embeddings 项目地址: https://gitcode.com/gh_mirrors/in/InferSent

1. 项目介绍

InferSent 是一种用于学习英语句子语义表示的方法。它基于自然语言推理数据训练，能够很好地泛化到多种不同的任务中。项目提供了预训练的英语句子编码器以及用于评估句子表示的 SentEval 工具包。

2. 项目快速启动

以下是快速启动 InferSent 的步骤：

首先，确保您的环境中安装了以下依赖项：

Python 2/3
Pytorch（最新版本）
NLTK 版本 >= 3

接下来，下载所需的预训练模型和词向量：

# 创建存储模型的目录
mkdir encoder

# 下载预训练模型
curl -Lo encoder/infersent1.pkl https://dl.fbaipublicfiles.com/infersent/infersent1.pkl
curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

# 创建词向量目录并下载
mkdir GloVe
curl -Lo GloVe/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip GloVe/glove.840B.300d.zip -d GloVe/

# 或者下载 fastText 词向量
mkdir fastText
curl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
unzip fastText/crawl-300d-2M.vec.zip -d fastText/

然后，运行以下 Python 代码加载模型并准备进行句子编码：

# 确保已经下载了 NLTK 的 tokenizer
import nltk
nltk.download('punkt')

# 加载预训练模型
from models import InferSent
V = 2
MODEL_PATH = 'encoder/infersent%s.pkl' % V
params_model = {
    'bsize': 64,
    'word_emb_dim': 300,
    'enc_lstm_dim': 2048,
    'pool_type': 'max',
    'dpout_model': 0.0,
    'version': V
}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))

# 设置词向量路径
W2V_PATH = 'fastText/crawl-300d-2M.vec'
infersent.set_w2v_path(W2V_PATH)

# 构建词汇表
sentences = ["This is an example sentence.", "Each sentence is converted."]
infersent.build_vocab(sentences, tokenize=True)

# 编码句子
embeddings = infersent.encode(sentences, tokenize=True)