开源项目教程：Instructor Embedding-优快云博客

开源项目教程：Instructor Embedding

【免费下载链接】instructor-embedding [ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings 项目地址: https://gitcode.com/gh_mirrors/in/instructor-embedding

1. 项目介绍

Instructor Embedding 是一个基于深度学习的文本嵌入模型，旨在通过接收任务指令生成适用于各种任务（如分类、检索、聚类、文本评估等）和领域（如科学、金融等）的文本嵌入，而无需进行任何额外的微调。该模型在各种嵌入任务上均取得了最先进（SOTA）的性能。

2. 项目快速启动

首先，确保您的环境中已安装了Python（建议版本为3.7）。以下是在本地机器上快速启动项目的步骤：

# 创建虚拟环境
conda env create -n instructor python=3.7

# 克隆项目仓库
git clone https://github.com/xlang-ai/instructor-embedding.git

# 安装依赖
cd instructor-embedding
pip install -r requirements.txt

# 安装InstructorEmbedding包
pip install -e .

启动虚拟环境：

conda activate instructor

下载预训练模型后，您可以使用以下代码进行文本嵌入：

from InstructorEmbedding import INSTRUCTOR

# 加载预训练模型
model = INSTRUCTOR('hkunlp/instructor-large')

# 准备带有指令的文本
text_instruction_pairs = [
    {
        "instruction": "Represent the Science title:",
        "text": "3D ActionSLAM: wearable person tracking in multi-floor environments"
    },
    {
        "instruction": "Represent the Medicine sentence for retrieving a duplicate sentence:",
        "text": "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear."
    }
]

# 计算嵌入
customized_embeddings = model.encode(
    [{"instruction": pair["instruction"], "text": pair["text"]} for pair in text_instruction_pairs]
)

# 打印结果
for pair, embedding in zip(text_instruction_pairs, customized_embeddings):
    print("Instruction:", pair["instruction"])
    print("Text:", pair["text"])
    print("Embedding:", embedding)
    print()

3. 应用案例和最佳实践

以下是使用Instructor Embedding的一些应用案例和最佳实践：

计算自定义文本的嵌入

如果您想为特定句子计算自定义嵌入，可以按照以下模板编写指令：

Represent the <domain> <text_type> for <task_objective>:

其中 <domain> 是文本的领域（可选），<text_type> 是编码单元（必需），<task_objective> 是嵌入的目标（可选）。

计算文本间的相似度

您可以使用Instructor Embedding计算两组句子之间的相似度：

from sklearn.metrics.pairwise import cosine_similarity

sentences_a = [
    ['Represent the Science sentence:', 'Parton energy loss in QCD matter'],
    ['Represent the Financial statement:', 'The Federal Reserve on Wednesday raised its benchmark interest rate.']
]

sentences_b = [
    ['Represent the Science sentence:', 'The Chiral Phase Transition in Dissipative Dynamics'],
    ['Represent the Financial statement:', 'The central bank has decided to keep the interest rates unchanged.']
]

# 计算嵌入
embeddings_a = model.encode(
    [{"instruction": s[0], "text": s[1]} for s in sentences_a]
)

embeddings_b = model.encode(
    [{"instruction": s[0], "text": s[1]} for s in sentences_b]
)

# 计算相似度
similarities = cosine_similarity(embeddings_a, embeddings_b)

print(similarities)

4. 典型生态项目

目前，Instructor Embedding已经在多个领域中得到了应用，包括但不限于自然语言处理、推荐系统、知识图谱等领域。以下是一些典型的生态项目：

自然语言处理：用于情感分析、文本分类、机器翻译等任务。
推荐系统：利用文本嵌入技术，为用户推荐相关性更高的内容。
知识图谱：通过嵌入技术，将知识图谱中的实体和关系转化为向量表示，便于查询和分析。

Instructor Embedding 的灵活性和强大的嵌入能力使其在多个场景下都表现出色，为开发者提供了丰富的使用空间。

【免费下载链接】instructor-embedding [ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings 项目地址: https://gitcode.com/gh_mirrors/in/instructor-embedding

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考