本地AI助手搭建

原创已于 2025-10-29 16:55:31 修改 · 918 阅读

24 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能

于 2025-10-21 15:45:16 首次发布

AI 专栏收录该内容

2 篇文章

订阅专栏

ollama

安装

国内下载很慢，需要加速：

1. 用 gh.llkk.cc 镜像
export OLLAMA_MIRROR="https://gh.llkk.cc/https://github.com/ollama/ollama/releases/latest/download"
2. 直接跑官方脚本，变量会被自动识别
curl -fsSL https://ollama.com/install.sh | sed "s|https://ollama.com/download|$OLLAMA_MIRROR|g" | sh

重试个一两次就能成功下载。

模型下载

设置模型镜像网站：

export HF_ENDPOINT=https://hf-mirror.com

下载模型：

ollama pull qwen2.5:1.5b

模型为：
通义千问 2.5 系列 · 15 亿参数 · 4-bit K-quants 中等精度的轻量化中文大模型

运行模型：

ollama run qwen2.5:1.5b

在交互工作台里输入：

python三方包javalang的accept函数说明

得到响应。

键入bye可退出交互工作台。

下载报错

如报错：

Error: pull model manifest: file does not exist

到https://ollama.com/library 查一下模型名是否存在。

sentence-transformer

把句子转成向量的bert模型。

安装

全量安装要下载很多不必要的包，非常耗时耗空间。简化安装如下：

# 临时换清华源 + 官方 PyTorch CPU 仓库
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple \
            -U sentence-transformers torch --extra-index-url https://download.pytorch.org/whl/cpu

只会装cpu版本的torch，不会装GPU版本及相关的一堆nvidia包了。

使用

使用首次会自动下载模型(有时会失败，重试几次就好了)：

import os

os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'  # 国内可用
os.environ['HF_HUB_DISABLE_SYMLINKS'] = '1'  # 避免 Windows 符号链接报错

from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")  # 首次自动下载
emb = model.encode("今天天气真好")
print(emb.shape)  # (384,)

模型默认下载到：

~/.cache/huggingface/hub/models--sentence-transformers--paraphrase-multilingual-MiniLM-L12-v2/

向量数据库

不同于传统数据库对文字的模糊匹配，向量数据库找的是语义距离相近的TOP-N记录。有时候，两个句子在文字上没有一个相同，但语义上却很相似，例如：
“电脑无法开机”和“主板故障”
如果传统数据库查找，是找不出它们的关联的。但向量数据库就可以找到。
使用本地向量数据库测试的话，一般用chromadb，它基于sqlite，不过，它对sqlite版本有要求，要么升级python到比较新的版本，要么安装pysqlite-binary，手工劫持python自带的sqlite：

import sys

sys.modules["sqlite3"] = __import__("pysqlite3")  # 劫持旧模块

RAG(Retrieval-Augmented Generation)

RAG最早是Facebook的研究人员提出的，用于解决大模型幻觉和知识过时的问题。关键思路是LLM外接一个向量知识库，LLM不会的问题，先查询该向量知识库，再将查询结果交给LLM学习汇总并最终答复。
使用langchain完成RAG。代码如下：

def rag_and_qa():
	# --------- 1. 加载向量库 ---------
    embeddings = HuggingFaceEmbeddings(
        model_name=EMBEDDING_MODEL)  # 首次自动下载
    LOGGER.info("loading sbert model success")

    # 向量库的embedding_function指定了向量空间，保存和加载使用的向量空间必须一致！
    vectorstore = Chroma(
        persist_directory=CHROMA_PERSIST_DIR,
        embedding_function=embeddings,
        client_settings=Settings(
            persist_directory=CHROMA_PERSIST_DIR,
            is_persistent=True
        )
    )
    # rag的时候，先从向量库选语义最接近的3条记录
    retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

    # --------- 2. LLM（Ollama 本地 模型） ---------
    llm = Ollama(model="qwen2.5:1.5b")  # 确保已 ollama pull qwen2.5:1.5b-q4_K_M
    LOGGER.info("run llm success")

    # --------- 3. 构建 RAG 链 ---------
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",  # stuff简单合并上下文
        retriever=retriever,
        return_source_documents=True)
    LOGGER.info("attach rag success")

    # --------- 4. 运行问答 ---------
    while True:
        q = input("\n问题（输入 exit 退出）：")
        if q.strip().lower() == "exit":
            break
        res = qa_chain.invoke({"query": q})
        print("AI：", res["result"])
        print("\n来源：")
        for doc in res["source_documents"]:
            print("  -", doc.metadata["source"], "第", doc.metadata.get("page", 0), "页")

RAG基本原理

用户提问 ──► retriever.invoke(query) ──► 拿到 k 段相关文本
│
▼
┌──────────────────────────────────────────────────────┐
│ prompt = f"““Use the following context to answer │
│ the question at the end. │
│ Context: {把所有检索结果拼接在一起} │
│ Question: {query} │
│ Answer:””" │
└──────────────────────────────────────────────────────┘
│
▼
LLM.invoke(prompt)
│
▼
返回答案字符串

上图以chain_type=stuff为例，其它chain_type略有不同。

源码分析

检索
retriever.get_relevant_documents(query) → List[Document]

这里的retriever基于向量数据库（例如chromadb），实现在langchain/chains/retrieval_qa/base.py里：

    def _get_docs(
     self,
     question: str,
     *,
     run_manager: CallbackManagerForChainRun,
 ) -> list[Document]:
     """Get docs."""
     if self.search_type == "similarity":
         docs = self.vectorstore.similarity_search(
             question,
             k=self.k,
             **self.search_kwargs,
         )
     elif self.search_type == "mmr":
         docs = self.vectorstore.max_marginal_relevance_search(
             question,
             k=self.k,
             **self.search_kwargs,
         )
     else:
         msg = f"search_type of {self.search_type} not allowed."
         raise ValueError(msg)
     return docs

vectorstore的search实现为(langchain_core/vectorstores/in_memory.py)：

@override
 def similarity_search_with_score(
     self,
     query: str,
     k: int = 4,
     **kwargs: Any,
 ) -> list[tuple[Document, float]]:
     embedding = self.embedding.embed_query(query)
     return self.similarity_search_with_score_by_vector(
         embedding,
         k,
         **kwargs,
     )

先把query用embedding模型转成向量，接着到向量数据库里查top K相似，最后把这top K向量对应的原文取出。

填充上下文
StuffDocumentsChain 把 k 个 Document.page_content 用 \n\n 拼成一段长文本 context
构造 prompt

默认模板由chain_type参数指定，当chain_type=stuff时，见 langchain/chains/question_answering/stuff_prompt.py
```
Use the following pieces of context to answer ...
Context:
{context}

Question: {question}
Helpful Answer:
```
还有其它的chain_type：refine/map_reduce等，后面章节详细讨论。
LLM 生成
把 prompt 交给 LLM，得到答案字符串
返回结果
RetrievalQA 返回 {"result": "答案", "source_documents": [检索到的原始 Document]}

不同的chain_type

refine（迭代润色）

原理
按顺序逐段处理：

第一段 → 生成初始答案
后续每段把 当前答案 + 新文档 一起喂给 LLM，要求“在原答案基础上 refine”

流程图

doc1 ──► prompt ──► LLM ──► ans1
doc2 + ans1 ──► refine prompt ──► LLM ──► ans2
doc3 + ans2 ──► refine prompt ──► LLM ──► ans3
...

refine模式使用的prompt为：

DEFAULT_REFINE_PROMPT_TMPL = (
    "The original question is as follows: {question}\n"
    "We have provided an existing answer: {existing_answer}\n"
    "We have the opportunity to refine the existing answer "
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{context_str}\n"
    "------------\n"
    "Given the new context, refine the original answer to better "
    "answer the question. "
    "If the context isn't useful, return the original answer."
)

很明显，相比stuff，refine的提示词里多了个{existing_answer}，代表了上次得出有待精炼的答案。

相比stuff，refine要慢上很多，而且所谓“精炼”的结果不一定更好，有可能额外加入了一些看似精细实则错误的描述。

map_reduce（先各自答再汇总）

原理
分两阶段：

Map：每段文本独立生成一个“候选答案”
Reduce：再把所有候选答案合并成最终回复

流程图

doc1 ──► prompt1 ──► LLM ──► ans1
doc2 ──► prompt2 ──► LLM ──► ans2   } 并行/顺序调用
...
[ans1,ans2,...] ──► 合并 prompt ──► LLM ──► 最终答案

map_reduce模式有两个prompt，一个是map prompt，跟stuff模式下类似：

question_prompt_template = """Use the following portion of a long document to see if any of the text is relevant to answer the question.
Return any relevant text verbatim.
{context}
Question: {question}
Relevant text, if any:"""

还有一个reduce prompt：

system_template = """Given the following extracted parts of a long document and a question, create a final answer.
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
______________________
{summaries}"""

负责把多个anwser汇总成新的提示词。

map_reduce的效果也不一定比stuff好，我测试的时候就得到过完全错误的结果。

不同chain_type对比

这是kimi给的一个chain_type对比，权且记录在最后。

chain_type	调用次数	是否怕超长	精度/细节	延迟/费用	最佳场景
stuff	1	✅ 怕	中	最低	短文档、快速问答
map_reduce	k+1	❌ 不怕	中	高	长文档、需综合
refine	k	❌ 不怕	最高	最高	法规/医疗高精度
map_rerank	k	❌ 不怕	单段最佳	中	可解释置信度
compact	≤k+1	自适应	高	中	通用服务、长度不定

agent

LLM有tool calling机制，能根据输入自行判断是否需要调用tool（匹配tool的description），若需调用tool，它会向应用返回要调用的工具方法名和入参，这样应用就可以自行实施方法调用（方法调用可以是本地的，也可以是远端的），获得结果R1。接着，应用将R1添加到prompt中，发起第二次LLM调用，LLM就会返回最终提交给用户的结果。

LLM返回的工具方法名和入参格式随模型不同而不同，为减轻应用适配多个模型的工作量，anthropic提出MCP协议，统一了LLM返回的工具方法名和入参格式。

从某种意义上说，LLM对agent的调用，也是一次意图分类。