（笔记+作业）书生大模型实战营春节卷王班---茴香豆：企业级知识库问答工具-优快云博客

本文链接：https://blog.youkuaiyun.com/haidizym/article/details/147061926

学员闯关手册：https://aicarrier.feishu.cn/wiki/QtJnweAW1iFl8LkoMKGcsUS9nld
课程视频：https://www.bilibili.com/video/BV13U1VYmEUr/
课程文档：https://github.com/InternLM/Tutorial/tree/camp4/docs/L0/Python
关卡作业：https://github.com/InternLM/Tutorial/blob/camp4/docs/L0/Python/task.md
开发机平台：https://studio.intern-ai.org.cn/
开发机平台介绍：https://aicarrier.feishu.cn/wiki/GQ1Qwxb3UiQuewk8BVLcuyiEnHe
书生浦语官网：https://internlm.intern-ai.org.cn/
github网站：https://github.com/internLM/
InternThinker: https://internlm-chat.intern-ai.org.cn/internthinker
快速上手飞书文档：https://www.feishu.cn/hc/zh-CN/articles/945900971706-%E5%BF%AB%E9%80%9F%E4%B8%8A%E6%89%8B%E6%96%87%E6%A1%A3
提交作业：https://aicarrier.feishu.cn/share/base/form/shrcnUqshYPt7MdtYRTRpkiOFJd；
作业批改结果：https://aicarrier.feishu.cn/share/base/query/shrcnkNtOS9gPPnC9skiBLlao2c
internLM-Chat 智能体：https://github.com/InternLM/InternLM/blob/main/agent/README_zh-CN.md
lagent：https://lagent.readthedocs.io/zh-cn/latest/tutorials/action.html#id2
网络搜索API：https://serper.dev/，https://serper.dev/login
茴香豆：https://github.com/InternLM/HuixiangDou/
在这里插入图片描述

茴香豆

茴香豆特点：
三阶段 Pipeline （前处理、拒答、响应），提高相应准确率和安全性
打通微信和飞书群聊天，适合国内知识问答场景

茴香豆是基于LLMs的RAG应用框架，
包括多源知识检索、混合大模型、多重评分拒答工作流、安全检查全链路，
web端使用： https://openxlab.org.cn/apps/detail/tpoisonooo/huixiangdou-web
自己代码部署：https://github.com/InternLM/HuixiangDou

在这里插入图片描述
Web 版茴香豆功能
添加/删除文档：支持 pdf、word、markdown、excel、ppt、html 和 txt 格式文件
编辑正反例
打通微信和飞书群：pip install -r requirements-lark-group.txt，教程 https://github.com/InternLM/HuixiangDou/blob/main/docs/add_lark_group_zh.md
开启网络搜索功能
聊天测试

搭建自己的 web 版茴香豆

教程：https://github.com/InternLM/HuixiangDou/blob/main/web/README.md
镜像：Cuda11.7-conda ，资源类型选择 30% A*100

#1、创建环境，下载茴香豆，安装依赖项，下载模型文件
studio-conda -o internlm-base -t huixiangdou
conda activate huixiangdou
cd /root

# 克隆代码仓库
git clone https://github.com/internlm/huixiangdou && cd huixiangdou
git checkout 79fa810

# parsing `word` format requirements
apt update
apt install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig libpulse-dev
# python requirements
pip install BCEmbedding==0.1.5 cmake==3.30.2 lit==18.1.8 sentencepiece==0.2.0 protobuf==5.27.3 accelerate==0.33.0
pip install -r requirements.txt
# python3.8 安装 faiss-gpu 而不是 faiss

# 创建模型文件夹
cd /root && mkdir models

# 复制BCE模型
ln -s /root/share/new_models/maidalun1020/bce-embedding-base_v1 /root/models/bce-embedding-base_v1
ln -s /root/share/new_models/maidalun1020/bce-reranker-base_v1 /root/models/bce-reranker-base_v1

# 复制大模型参数（下面的模型，根据作业进度和任务进行**选择一个**就行）
ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-7b /root/models/internlm2-chat-7b

#更改配置文件，让茴香豆使用本地模型
sed -i '9s#.*#embedding_model_path = "/root/models/bce-embedding-base_v1"#' /root/huixiangdou/config.ini
sed -i '15s#.*#reranker_model_path = "/root/models/bce-reranker-base_v1"#' /root/huixiangdou/config.ini
sed -i '43s#.*#local_llm_path = "/root/models/internlm2-chat-7b"#' /root/huixiangdou/config.ini

#2、知识库创建
conda activate huixiangdou
cd /root/huixiangdou && mkdir repodir
git clone https://github.com/internlm/huixiangdou --depth=1 repodir/huixiangdou
git clone https://github.com/open-mmlab/mmpose    --depth=1 repodir/mmpose
# Save the features of repodir to workdir, and update the positive and negative example thresholds into `config.ini`
mkdir workdir
python3 -m huixiangdou.service.feature_store
#编辑正反例，正例位于 /root/huixiangdou/resource/good_questions.json 文件夹中，反例位于/root/huixiangdou/resource/bad_questions.json。

#3.1、命令行运行测试知识助手
conda activate huixiangdou
cd /root/huixiangdou
python3 -m huixiangdou.main --standalone

#3.2、Gradio UI 界面测试
#端口映射：ssh -CNg -L 7860:127.0.0.1:7860 root@ssh.intern-ai.org.cn -p <你的ssh端口号>
conda activate huixiangdou
cd /root/huixiangdou
python3 -m huixiangdou.gradio

茴香豆高阶应用

开启网络搜索

#开启网络搜索功能需要用到 Serper 提供的 API，https://serper.dev/，https://serper.dev/login，复制 API-key，替换 /huixiangdou/config.ini 中的 ${YOUR-API-KEY} 为自己的API-key
[web_search]
check https://serper.dev/api-key to get a free API key
x_api_key = "${YOUR-API-KEY}"
domain_partial_order = ["openai.com", "pytorch.org", "readthedocs.io", "nvidia.com", "stackoverflow.com", "juejin.cn", "zhuanlan.zhihu.com", "www.cnblogs.com"]
save_dir = "logs/web_search_result"

远程模型

茴香豆中有 3 处调用了模型，分别是嵌入模型（Embedding）、重排模型（Rerank）和大语音模型（LLM）

远程向量&重排序模型

https://siliconflow.cn/zh-cn/，https://account.siliconflow.cn/zh/login?redirect=https%3A%2F%2Fcloud.siliconflow.cn%2Faccount%2Fak%3F
将 API，填入到 /huixiangdou/config.ini 文件中 api_token 处，同时注意如图所示修改嵌入和重排模型地址(embedding_model_path, reranker_model_path)
在这里插入图片描述

远程大模型

enable_local = 0 # 关闭本地模型
enable_remote = 1 # 启用云端模型
在这里插入图片描述

多模态功能

#1、下载/更新茴香豆
conda activate huixiangdou
cd huixiangdou 
git stash # 弃用之前的修改，如果需要保存，可将冲突文件另存为新文件名
git checkout main
git pull
git checkout bec2f6af9 # 支持多模态的最低版本

#2、安装多模态模型和依赖
# 设置环境变量
export HF_ENDPOINT='https://hf-mirror.com' # 使用 huggingface 中国镜像加速下载，如果在国外，忽略此步骤

# 下载模型
## 模型文件较大，如果遇到下载报错，重新运行命令就好
huggingface-cli download BAAI/bge-m3 --local-dir /root/models/bge-m3
huggingface-cli download BAAI/bge-visualized --local-dir /root/models/bge-visualized
huggingface-cli download BAAI/bge-reranker-v2-minicpm-layerwise --local-dir /root/models/bge-reranker-v2-minicpm-layerwise
# 需要手动将视觉模型移动到 BGE-m3 文件夹下
mv /root/models/bge-visualized/Visualized_m3.pth /root/models/bge-m3/

#3、安装最新的 FlagEmbedding
conda activate huixiangdou
cd /root/
# 从官方 github 安装最新版
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install  .
# 复制 FlagEmbedding 缺失的文件，注意 huixiangdou/lib/python3.10/site-packages 是教程开始设置的环境，如果个人有更改，需要根据自己的环境重新填入对应的地址
cp ~/FlagEmbedding/FlagEmbedding/visual/eva_clip/model_configs /root/.conda/envs/huixiangdou/lib/python3.10/site-packages/FlagEmbedding/visual/eva_clip/
cp ~/FlagEmbedding/FlagEmbedding/visual/eva_clip/bpe_simple_vocab_16e6.txt.gz /root/.conda/envs/huixiangdou/lib/python3.10/site-packages/FlagEmbedding/visual/eva_clip/
# 其他依赖包
pip install timm ftfy peft 

#4、修改配置文件
sed -i '6s#.*#embedding_model_path = "/root/models/bge-m3"#' /root/huixiangdou/config-multimodal.ini
sed -i '7s#.*#reranker_model_path = "/root/models/bge-reranker-v2-minicpm-layerwise"#' /root/huixiangdou/config-multimodal.ini
sed -i '31s#.*#local_llm_path = "/root/models/internlm2-chat-7b"#' /root/huixiangdou/config-multimodal.ini
sed -i '20s#.*#enable_local = 1#' /root/huixiangdou/config-multimodal.ini
sed -i '21s#.*#enable_remote = 0#' /root/huixiangdou/config-multimodal.ini
#更改一下多模态向量知识库的位置
sed -i '8s#.*#work_dir = "workdir-multi"#' /root/huixiangdou/config-multimodal.ini
sed -i '61s#.*#enable_cr = 0#' /root/huixiangdou/config-multimodal.ini # 关闭指代消岐功能

#5、建立多模态知识库
# 新的向量知识库文件夹
mkdir workdir-multi
# 提取多模态向量知识库
python3 -m huixiangdou.service.feature_store --config_path config-multimodal.ini

#6、启动 Gradio UI 界面，试用多模态功能
conda activate huixiangdou
cd /root/huixiangdou
python3 -m huixiangdou.gradio --config_path config-multimodal.ini