{doc.metadata[‘title‘]}, 内容: {doc.page_content}“)

{doc.page_content}")


**特点**:
- 集成 NLP 能力,支持自然语言查询。
- 可匹配企业内部文档检索场景。

---

### 示例 4: 在 AWS S3 中加载文件

AWS S3 是一种对象存储服务,可以通过 LangChain 加载存储在 S3 的文档。

安装包:
```bash
pip install langchain-community boto3

代码示例:

from langchain_community.document_loaders import S3FileLoader

# 初始化 S3 文件加载器
s3_loader = S3FileLoader(bucket_name="your-bucket-name", key="path/to/your/file.pdf")

# 加载文档
docs = s3_loader.load()

# 输出加载的文档内容
for doc in docs:
    print(doc.page_content)

特点

  • 适合动态加载企业内部非结构化数据。

4. 应用场景分析

  1. 智能客服

    • 使用 Amazon Bedrock 托管的模型快速接入对话功能。
    • 利用 Amazon Kendra 检索企业知识库内容,提供精准回复。
  2. 企业内部搜索

    • 将企业文档存储到 S3,结合 Kendra 和 OpenSearch 实现快速查找。
    • 利用 Bedrock 的生成能力增强回答自然流畅性。
  3. 个性化推荐系统

    • 利用 SageMaker 部署的推荐模型提供实时响应。
    • 使用 MemoryDB 进行会话存储与上下文管理。
  4. AI 应用开发

    • 通过 AWS Lambda 部署 LangChain 工具链,实现无服务器扩展能力。

5. 实践建议

  1. 安全性与隐私:确保 AWS 配置中的访问权限(IAM Role)适配最小权限原则(Least Privilege Principle)。
  2. 成本优化:利用 AWS 的托管服务(如 Bedrock Serverless)动态扩展,避免保持过多空闲资源。
  3. 调试与监控:利用 AWS CloudWatch 监控应用运行状态,定位潜在问题。
  4. 依赖管理:保持 boto3langchain 相关库的最新版本,以获得最新功能和性能优化。

在 AWS 强大基础设施的支持下,LangChain 能够帮助开发者在生成式 AI 应用中快速从概念到实现。如果在实际操作中有任何问题,欢迎在评论区交流!

—END—

对这段代码进行详细的熟悉和解析:class MarkdownTextSplitter(RecursiveCharacterTextSplitter): def __init__( self, headers_to_split_on: List[Tuple[str, str]], return_each_line: bool = False, strip_headers: bool = False, chunk_size: int = 2000, chunk_overlap: int = 0, length_function: Callable[[str], int] = len, keep_separator: bool = True, separators: Optional[List[str]] = None, ): super().__init__( chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=length_function, keep_separator=keep_separator, separators=separators, ) self._headers_to_split_on = sorted(headers_to_split_on, key=lambda h: len(h[0]), reverse=True) self._return_each_line = return_each_line self._strip_headers = strip_headers self._header_levels = {sep: sep.count("#") for sep, _ in headers_to_split_on} self._header_names = [name for _, name in headers_to_split_on] self._level_to_name = {sep.count("#"): name for sep, name in headers_to_split_on} # 获取当前段落的最大标题层级 def _get_header_level_from_metadata(self, metadata: Dict[str, str]) -> int: max_level = 0 for sep, name in self._headers_to_split_on: if name in metadata: level = sep.count("#") if level > max_level: max_level = level return max_level # # 解析 Markdown 文本,提取行内容及其对应标题元数据 def _parse_markdown_to_lines_with_metadata(self, text: str) -> List[LineType]: md = MarkdownIt() tokens = md.parse(text) lines_with_metadata: List[LineType] = [] active_metadata: Dict[str, str] = {} header_stack: List[HeaderType] = [] i = 0 buffer: List[str] = [] # 将 buffer 中的内容添加为一个段落 def flush_buffer(): if buffer: content = "\n".join(buffer).strip("\n") if content: lines_with_metadata.append({ "content": content, "metadata": active_metadata.copy() }) buffer.clear() while i < len(tokens): token = tokens[i] if token.type == "heading_open": flush_buffer() level = int(token.tag[1]) heading_text = "" if i + 1 < len(tokens) and tokens[i + 1].type == "inline": heading_text = tokens[i + 1].content.strip() # 移除当前层级以上的旧标题 while header_stack and header_stack[-1]["level"] >= level: popped = header_stack.pop() if popped["name"] in active_metadata: del active_metadata[popped["name"]] name = self._level_to_name.get(level, f"header{level}") active_metadata[name] = heading_text header_stack.append({"level": level, "name": name, "data": heading_text}) if not self._strip_headers: buffer.append(f"{'#' * level} {heading_text}") i += 2 elif token.type == "fence" or token.type == "code_block": flush_buffer() code = token.markup + token.content + token.markup lines_with_metadata.append({ "content": code.strip("\n"), "metadata": active_metadata.copy() }) i += 1 elif token.type == "paragraph_open": para = "" if i + 1 < len(tokens) and tokens[i + 1].type == "inline": para = tokens[i + 1].content buffer.append(para) i += 3 else: i += 1 flush_buffer() return lines_with_metadata # 从 chunk 内容中提取所有子标题(用于 heading_path) def _extract_subheadings_from_content(self, content: str) -> List[str]: md = MarkdownIt() tokens = md.parse(content) subheadings = [] for i in range(len(tokens)): if tokens[i].type == "heading_open" and i + 1 < len(tokens) and tokens[i + 1].type == "inline": heading_text = tokens[i + 1].content.strip() if heading_text: subheadings.append(heading_text) return subheadings # 合并过小的分块,防止生成碎片过多 def merge_small_chunks(self, docs: List[Document], chunk_size: int, length_function: Callable[[str], int] = len) -> List[Document]: merged_docs = [] current_content = [] current_metadata = None current_length = 0 # 合并 heading_path 元信息并去重 def merge_metadata(meta1, meta2): hp1 = meta1.get("heading_path", []) hp2 = meta2.get("heading_path", []) combined_hp = [] seen = set() for h in hp1 + hp2: if h not in seen: combined_hp.append(h) seen.add(h) merged_meta = meta1.copy() merged_meta["heading_path"] = combined_hp return merged_meta for doc in docs: doc_len = length_function(doc.page_content) if current_length + doc_len <= chunk_size: current_content.append(doc.page_content) if current_metadata is None: current_metadata = doc.metadata.copy() else: current_metadata = merge_metadata(current_metadata, doc.metadata) current_length += doc_len else: if current_content: merged_docs.append( Document( page_content="\n".join(current_content), metadata=current_metadata, vector=None ) ) current_content = [doc.page_content] current_metadata = doc.metadata.copy() current_length = doc_len if current_content: merged_docs.append( Document( page_content="\n".join(current_content), metadata=current_metadata, vector=None ) ) return merged_docs # 处理传入文档集合,返回分块后的文档列表 def split_documents(self, documents: Iterable[Document]) -> List[Document]: results = [] for doc in documents: metadata = doc.metadata.copy() if self._return_each_line: lines = self._parse_markdown_to_lines_with_metadata(doc.page_content) for i, line in enumerate(lines): line_metadata = metadata.copy() heading_path = [line["metadata"][h] for h in self._header_names if h in line["metadata"]] line_metadata["heading_path"] = heading_path line_metadata["position"] = i + 1 line_metadata["content_type"] = self._get_content_type(line["content"]) results.append(Document(page_content=line["content"], metadata=line_metadata)) continue lines = self._parse_markdown_to_lines_with_metadata(doc.page_content) has_any_header = any( any(h in line["metadata"] for h in self._header_names) for line in lines ) if has_any_header: chunks_data = self._recursive_chunk_and_split(lines, {}, 1) for chunk in chunks_data: chunk_metadata = {**metadata, **chunk["metadata"]} heading_path = [chunk_metadata[h] for h in self._header_names if h in chunk_metadata] subheadings = self._extract_subheadings_from_content(chunk["content"]) full_heading_path = heading_path + subheadings seen = set() deduped_heading_path = [] for title in full_heading_path: if title not in seen: deduped_heading_path.append(title) seen.add(title) chunk_metadata["heading_path"] = deduped_heading_path chunk_metadata["content_type"] = self._get_content_type(chunk["content"]) for h in self._header_names: chunk_metadata.pop(h, None) results.append(Document(page_content=chunk["content"], metadata=chunk_metadata)) else: text_length = self._length_function(doc.page_content) if text_length <= self._chunk_size: metadata["heading_path"] = [] metadata["content_type"] = self._get_content_type(doc.page_content) results.append(Document(page_content=doc.page_content, metadata=metadata)) else: sub_texts = super().split_text(doc.page_content) for chunk_text in sub_texts: new_meta = metadata.copy() new_meta["heading_path"] = [] new_meta["content_type"] = self._get_content_type(chunk_text) results.append(Document(page_content=chunk_text, metadata=new_meta)) if not self._return_each_line: results = self.merge_small_chunks(results, self._chunk_size, length_function=self._length_function) for i, doc in enumerate(results): new_metadata = dict(doc.metadata) if doc.metadata else {} new_metadata["heading_path"] = " >>> ".join(new_metadata["heading_path"]) new_metadata["position"] = i + 1 doc.metadata = new_metadata return results # 按指定 Markdown 标题层级进行递归分块 def _recursive_chunk_and_split( self, lines: List[LineType], current_base_metadata: Dict[str, str], target_level: int = 1) -> List[LineType]: processed_segments: List[LineType] = [] # 获取配置中定义的最大标题层级(即最多几个#) max_configured_header_level = 0 if self._headers_to_split_on: max_configured_header_level = max(sep.count("#") for sep, _ in self._headers_to_split_on) if target_level > max_configured_header_level or not lines: if lines: aggregated_content = "\n".join([line["content"] for line in lines]) segment_metadata = {**current_base_metadata, **lines[0]["metadata"]} chunks_from_segment = self._handle_section_splitting( aggregated_content, segment_metadata, current_base_metadata, target_level - 1 ) processed_segments.extend(chunks_from_segment) return processed_segments # 检查当前层级的标题是否存在,决定是否可以分块 can_split_at_this_target_level = False for line_data in lines: stripped_line_content = line_data["content"].strip() for sep, _ in self._headers_to_split_on: if sep.count("#") == target_level and stripped_line_content.startswith(sep) and \ (len(stripped_line_content) == len(sep) or ( len(stripped_line_content) > len(sep) and stripped_line_content[len(sep)] == " ")): can_split_at_this_target_level = True break if can_split_at_this_target_level: break # 初始化当前分段缓冲区 current_segment_lines: List[LineType] = [] # 判断第一行是否是当前标题层级的标题 first_line_is_target_level_header = False if lines: first_line_stripped = lines[0]["content"].strip() for sep, _ in self._headers_to_split_on: if sep.count("#") == target_level and first_line_stripped.startswith(sep) and \ (len(first_line_stripped) == len(sep) or ( len(first_line_stripped) > len(sep) and first_line_stripped[len(sep)] == " ")): first_line_is_target_level_header = True break # 第一行不是标题,作为段前内容 if not first_line_is_target_level_header and lines: current_segment_lines.append(lines[0]) start_index = 1 else: if lines: current_segment_lines.append(lines[0]) start_index = 1 # 遍历每一行,检测是否是当前标题层级的起始,如果是则划分段 for i in range(start_index, len(lines)): line_data = lines[i] stripped_line_content = line_data["content"].strip() header_level_on_this_line = 0 for sep, _ in self._headers_to_split_on: if stripped_line_content.startswith(sep) and \ (len(stripped_line_content) == len(sep) or ( len(stripped_line_content) > len(sep) and stripped_line_content[len(sep)] == " ")): header_level_on_this_line = sep.count("#") break # 如果是当前层级或更浅层级的标题,视为新段起始 if header_level_on_this_line > 0 and header_level_on_this_line <= target_level: if current_segment_lines: aggregated_content = "\n".join([l["content"] for l in current_segment_lines]) segment_metadata = {**current_base_metadata, **current_segment_lines[0]["metadata"]} processed_segments.append({ "content": aggregated_content, "metadata": segment_metadata }) current_segment_lines.clear() current_segment_lines.append(line_data) else: current_segment_lines.append(line_data) # 处理最后一个段落 if current_segment_lines: aggregated_content = "\n".join([l["content"] for l in current_segment_lines]) segment_metadata = {**current_base_metadata, **current_segment_lines[0]["metadata"]} processed_segments.append({ "content": aggregated_content, "metadata": segment_metadata }) # 对每个段落递归处理或判断是否拆分为更小的块 final_chunks: List[LineType] = [] for segment in processed_segments: segment_content = segment["content"] segment_metadata = segment["metadata"] # 获取该段最深标题层级 defined_level_for_handling = self._get_header_level_from_metadata(segment_metadata) # 如果没有标题信息,从父段继承定义层级 if defined_level_for_handling == 0 and len( segment_metadata) == 0: defined_level_for_handling = target_level - 1 elif defined_level_for_handling < target_level: defined_level_for_handling = target_level - 1 # 进一步处理该段 chunks_from_segment = self._handle_section_splitting( segment_content, segment_metadata, current_base_metadata, defined_level_for_handling ) final_chunks.extend(chunks_from_segment) return final_chunks # 用于处理某一个逻辑段落的最终拆分 def _handle_section_splitting( self, content: str, section_metadata: Dict[str, str], parent_base_metadata: Dict[str, str], actual_defined_level: int ) -> List[LineType]: section_length = self._length_function(content) result_chunks: List[LineType] = [] if section_length <= self._chunk_size: result_chunks.append({ "content": content, "metadata": { **section_metadata, "content_type": self._get_content_type(content) } }) else: next_level_to_split_by = actual_defined_level + 1 max_configured_header_level = max(sep.count("#") for sep, _ in self._headers_to_split_on) can_recurse_deeper = False lines_for_sub_splitting = [] if next_level_to_split_by <= max_configured_header_level: temp_splitter_for_parsing = MarkdownTextSplitter( headers_to_split_on=self._headers_to_split_on, return_each_line=self._return_each_line, strip_headers=self._strip_headers, chunk_size=self._chunk_size, chunk_overlap=self._chunk_overlap, length_function=self._length_function, keep_separator=self._keep_separator, separators=self._separators, ) lines_for_sub_splitting = temp_splitter_for_parsing._parse_markdown_to_lines_with_metadata(content) for line_data in lines_for_sub_splitting: for sep_in_config, name_in_config in self._headers_to_split_on: if sep_in_config.count("#") == next_level_to_split_by and name_in_config in line_data[ "metadata"]: can_recurse_deeper = True break if can_recurse_deeper: break if can_recurse_deeper: sub_segments = self._recursive_chunk_and_split( lines_for_sub_splitting, current_base_metadata=section_metadata, target_level=next_level_to_split_by ) for sub_segment in sub_segments: sub_segment_content = sub_segment["content"] sub_segment_metadata = sub_segment["metadata"] sub_segment_defined_level = self._get_header_level_from_metadata(sub_segment_metadata) if sub_segment_defined_level == 0: pass if sub_segment_defined_level < next_level_to_split_by: sub_segment_defined_level = next_level_to_split_by chunks_from_sub_segment = self._handle_section_splitting( sub_segment_content, sub_segment_metadata, section_metadata, sub_segment_defined_level ) result_chunks.extend(chunks_from_sub_segment) else: plain_text_sub_chunks = super().split_text(content) if not plain_text_sub_chunks and content.strip(): plain_text_sub_chunks = [content] if plain_text_sub_chunks is None: plain_text_sub_chunks = [] header_to_prepend_text = "" if not self._strip_headers: deepest_level_in_metadata = 0 deepest_header_name_key = None for header_sep_in_config, header_name_in_config in self._headers_to_split_on: if header_name_in_config in section_metadata: current_level = header_sep_in_config.count("#") if current_level > deepest_level_in_metadata: deepest_level_in_metadata = current_level deepest_header_name_key = header_name_in_config if deepest_header_name_key: separator_to_use = "" for sep, name in self._headers_to_split_on: if name == deepest_header_name_key: separator_to_use = sep break header_value = section_metadata.get(deepest_header_name_key) if separator_to_use and header_value: header_to_prepend_text = f"{separator_to_use} {header_value}\n\n" for pt_chunk in plain_text_sub_chunks: pt_chunk_stripped = pt_chunk.lstrip() if header_to_prepend_text.strip() and pt_chunk_stripped.startswith(header_to_prepend_text.strip()): final_content = pt_chunk else: final_content = header_to_prepend_text + pt_chunk if header_to_prepend_text else pt_chunk result_chunks.append({ "content": final_content, "metadata": { **section_metadata, "content_type": self._get_content_type(final_content) } }) return result_chunks # 将文本切分为多个字符串片段,不包含元数据 def split_text(self, text: str) -> List[str]: lines_with_metadata = self._parse_markdown_to_lines_with_metadata(text) final_chunks_data = self._recursive_chunk_and_split( lines_with_metadata, current_base_metadata={}, target_level=1, ) return [lt["content"] for lt in final_chunks_data] # 识别文本块中的内容类型(纯文本、图片或图文) def _get_content_type(self, content: str) -> List[str]: md = MarkdownIt() tokens = md.parse(content) has_text = False has_image = False for token in tokens: if token.type == "inline": for child in token.children or []: if child.type == "image": has_image = True elif child.type == "text" and child.content.strip(): has_text = True elif token.type == "fence" or token.type == "paragraph_open": has_text = True if has_text and has_image: return ["text", "image"] elif has_image: return ["image"] else: return ["text"]
07-08
import sys import os #os.environ['NO_PROXY'] = '127.0.0.1,localhost' import requests #用于发送http请求 from PyQt5.QtWidgets import ( QApplication, QWidget, QVBoxLayout, QHBoxLayout, QPushButton, QListWidget, QTextEdit, QFileDialog, QLineEdit, QTabWidget, QLabel, QMessageBox ) from PyQt5.QtCore import Qt, QThread, pyqtSignal from docx import Document # LangChain 和 Ollama 模块 from langchain.text_splitter import RecursiveCharacterTextSplitter #分词器 from langchain_ollama import OllamaEmbeddings #转化向量 from langchain_community.document_loaders import TextLoader, Docx2txtLoader #文档加载器 from langchain_community.vectorstores import FAISS #向量库 # ================= 知识库管理界面 ================= class KnowledgeBaseManager(QWidget): def __init__(self): super().__init__() self.setWindowTitle("知识库管理器") self.knowledge_dir = "./knowledge_files" #设置知识库文件的保存路径 os.makedirs(self.knowledge_dir, exist_ok=True) #创建知识库目录,如果目录已存在则不会报错,不存在就创建一个 self.init_ui() self.load_existing_files() def init_ui(self): layout = QVBoxLayout() self.file_list = QListWidget() self.file_list.itemClicked.connect(self.preview_file) layout.addWidget(QLabel("已上传文件:")) layout.addWidget(self.file_list) btn_layout = QHBoxLayout() self.add_btn = QPushButton("添加文件") self.add_btn.clicked.connect(self.add_file) btn_layout.addWidget(self.add_btn) self.remove_btn = QPushButton("移除文件") self.remove_btn.clicked.connect(self.remove_file) btn_layout.addWidget(self.remove_btn) self.build_btn = QPushButton("构建向量数据库") self.build_btn.clicked.connect(self.build_vector_db) btn_layout.addWidget(self.build_btn) layout.addLayout(btn_layout) self.preview_text = QTextEdit() self.preview_text.setReadOnly(True) layout.addWidget(QLabel("文件预览:")) layout.addWidget(self.preview_text) self.setLayout(layout) def add_file(self): file_paths, _ = QFileDialog.getOpenFileNames( #file_paths代表已选中文件的路径,_代表弃用值 self, "选择知识库文件", "", "文档文件 (*.txt *.docx)" ) for file_path in file_paths: #遍历表示可以一次性选择很多个文件 filename = os.path.basename(file_path) #从路径中提取文件名方便后期展示 dest_path = os.path.join(self.knowledge_dir, filename) #os.path.join拼接路径,可得到./knowledge_files/文件名.txt if not os.path.exists(dest_path): #判断是否存在避免重复拷贝 with open(file_path, "rb") as src, open(dest_path, "wb") as dst: dst.write(src.read()) self.file_list.addItem(filename) #如果没有重复复制就加入行列准备变成向量 def load_existing_files(self): for filename in os.listdir(self.knowledge_dir): if filename.endswith((".txt", ".docx")): self.file_list.addItem(filename) def preview_file(self): #为了预览文件,可要可不要,可换成点击文件名自动打开相应文件 selected_item = self.file_list.currentItem() if not selected_item: return filename = selected_item.text() file_path = os.path.join(self.knowledge_dir, filename) try: if filename.endswith(".txt"): with open(file_path, "r", encoding="utf-8") as f: content = f.read() elif filename.endswith(".docx"): doc = Document(file_path) content = "\n".join([para.text for para in doc.paragraphs]) else: content = "不支持的文件格式" except Exception as e: content = f"读取失败: {str(e)}" self.preview_text.setText(content) def remove_file(self): selected_item = self.file_list.currentItem() if not selected_item: return filename = selected_item.text() #获取选中文件的文件名 # file_path = os.path.join(self.knowledge_dir, filename) # 不再删除本地文件:os.remove(file_path) # 删除向量库中与该文件相关的条目 persist_directory = "./faiss_index" #设置 FAISS 数据库路径 if os.path.exists(persist_directory): try: embeddings = OllamaEmbeddings(model="nomic-embed-text") #加载他对应的向量 db = FAISS.load_local(persist_directory, embeddings, allow_dangerous_deserialization=True) # 收集所有 source == filename 的 doc_id to_delete = [] for doc_id, doc in db.docstore._dict.items(): if doc.metadata.get("source") == filename: to_delete.append(doc_id) # 删除并向量库保存 if to_delete: db.delete(to_delete) db.save_local(persist_directory) # 保存更新后的数据库 print(f"✅ 已从向量库中删除 {len(to_delete)} 条与 {filename} 相关的记录。") except Exception as e: print("❌ 删除向量记录时出错:", e) # 从界面列表中移除文件项 self.file_list.takeItem(self.file_list.row(selected_item)) self.preview_text.clear() def build_vector_db(self): print("📚 正在加载知识库文件...") # 检查知识库目录是否存在 if not os.path.exists(self.knowledge_dir): print(f"❌ 知识库文件夹 {self.knowledge_dir} 不存在,请先添加文件。") return # 已处理文件记录路径 processed_files_path = os.path.join(self.knowledge_dir, ".processed_files.txt") #创立一个文件夹用于记录已经转化过的文件 # 读取已处理的文件名列表 if os.path.exists(processed_files_path): with open(processed_files_path, "r", encoding="utf-8") as f: processed_files = set(f.read().splitlines()) else: processed_files = set() documents = [] #存储新增文件加载后的文本内容。 new_processed_files = list(processed_files) # 检测有哪些是已经转化过的 # 遍历知识库目录中的文件 for file in os.listdir(self.knowledge_dir): file_path = os.path.join(self.knowledge_dir, file) # 只处理 .txt 和 .docx 文件 if not (file.endswith(".txt") or file.endswith(".docx")): continue # 如果文件未处理过 if file not in processed_files: print(f"🔄 正在处理新增文件:{file}") try: if file.endswith(".txt"): loader = TextLoader(file_path, encoding="utf-8") #根据文件类型选不同的加载器 else: loader = Docx2txtLoader(file_path) docs = loader.load() for doc in docs: doc.metadata["source"] = file #为每个文档添加一个 source 字段,标记其来源文件。 documents.extend(docs) #加入到需要转化的列表里 new_processed_files.append(file) # 记录为已处理 except Exception as e: print(f"⚠️ 加载文件 {file} 出错:{e}") else: print(f"✅ 文件 {file} 已处理过,跳过。") if not documents: print("✅ 没有新文件需要处理。") #前面都在区分哪些要转哪些不用,从这开始把要转的构建向量 return print("✂️ 正在分割文本...") text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) docs = text_splitter.split_documents(documents) #怎么设置参数 #RecursiveCharacterTextSplitter是 LangChain 提供的一种递归按字符分割器,它会按照指定的分隔符(如 \n\n, \n, " ", "")递归地切分文本, # 直到每个块的长度不超过 chunk_size。chunk_size=500是每个文本块的最大长度,以字符为单位。chunk_overlap=50是相邻文本块之间的重叠字符数。 #为了保留上下文信息,避免因切分导致语义断裂。 #如果回答不准确、信息缺失,尝试 减小 chunk_size 或 增加 chunk_overlap。如果系统响应慢、资源消耗大,尝试 增大 chunk_size 或 减少 chunk_overlap。 # print("🧠 正在生成 embedding 并更新向量数据库...") try: embeddings = OllamaEmbeddings(model="nomic-embed-text") persist_directory = "./faiss_index" # 如果已有数据库,加载它;否则新建 if os.path.exists(persist_directory): db = FAISS.load_local(persist_directory, embeddings, allow_dangerous_deserialization=True) db.add_documents(docs) # 追加新文档 else: db = FAISS.from_documents(docs, embeddings) db.save_local(persist_directory) # 保存更新后的已处理文件列表 with open(processed_files_path, "w", encoding="utf-8") as f: f.write("\n".join(new_processed_files)) print("✅ 向量数据库已更新并保存到:", os.path.abspath(persist_directory)) QMessageBox.information(None, "成功", f"✅ 向量数据库已更新并保存到:\n{os.path.abspath(persist_directory)}") except Exception as e: print("❌ 构建数据库时发生错误:", e) QMessageBox.critical(None, "错误", f"构建数据库失败:\n{str(e)}") # ================= 问答界面 ================= class OllamaWorker(QThread): #异步执行与 Ollama 模型的交互,避免阻塞主线程。 result_ready = pyqtSignal(str) #使得在子线程中执行的任务可以安全地将结果或错误信息返回给主线程。 error_occurred = pyqtSignal(str) status_signal = pyqtSignal(str) # 新增状态信号 def __init__(self, question, knowledge_db): super().__init__() self.question = question self.knowledge_db = knowledge_db def run(self): try: self.status_signal.emit("🔍 正在尝试连接 Ollama 服务...") #使用 status_signal 信号向 UI 发送状态更新,提示正在尝试连接 Ollama 服务。 response = requests.get("http://127.0.0.1:11434/api/version") #检查服务是否运行 if response.status_code != 200: #返回码为200才代表正在运行 raise Exception("❌ Ollama 服务未启动,请先启动服务。") self.status_signal.emit(f"📞 Ollama 响应状态码: {response.status_code}") #发送启动码,用于验证是否启动的 model_response = requests.get("http://127.0.0.1:11434/api/tags") #获取本地已加载的模型列表。 models = model_response.json().get("models", []) #解析 JSON 响应,并提取 models 列表。 if not models: raise Exception("❌ 未找到可用模型,请先拉取模型。") #查找本地是否有模型 self.status_signal.emit(f"🧠 当前加载的模型: {[model['name'] for model in models]}") self.status_signal.emit("🔍 正在检索知识库...") docs = self.knowledge_db.similarity_search(self.question, k=6) #基于语义相似度进行查询 #简单问题(如事实性问题):k=1~3 足够;复杂问题(需综合多个知识点):可尝试 k=5~10 if not docs: raise Exception("❌ 未找到相关知识,请尝试其他问题。") self.status_signal.emit("🔍 检索完成,找到多个结果") context = "\n\n".join([ #构建上下文内容,将每个文档的内容格式化后拼接成字符串,去除 Markdown 格式符号 ** 和 * f"文档{i + 1}:\n{doc.page_content.replace('**', '').replace('*', '')}" for i, doc in enumerate(docs) ]) prompt = f"""请根据以下知识内容,用自己的话总结并回答问题。请确保回答清晰、准确,并基于提供的资料,不要编造内容。 {context} 问题:{self.question} 答案:""" self.status_signal.emit("🧠 正在调用模型...") response = requests.post( "http://127.0.0.1:11434/api/chat", json={ "model": "gemma3:1b", "messages": [{"role": "user", "content": prompt}], "stream": False } ) self.status_signal.emit(f"📡 API 响应状态码: {response.status_code}") answer = response.json().get("message", {}).get("content", "未获取到有效回答") self.result_ready.emit(f"问题:{self.question}\n答案:{answer}\n{'-' * 30}") #把回答发送到主窗口 except Exception as e: self.result_ready.emit(f"❌ 处理过程中发生错误:{str(e)}") class QABotApp(QWidget): def __init__(self): super().__init__() self.setWindowTitle("本地知识问答助手 - Gemma3 + PyQt5") self.knowledge_db = None self.worker = None self.init_ui() self.load_vector_db() def init_ui(self): layout = QVBoxLayout() self.question_input = QLineEdit() self.question_input.setPlaceholderText("请输入你的问题...") self.question_input.returnPressed.connect(self.get_answer) self.submit_button = QPushButton("提交") self.submit_button.clicked.connect(self.get_answer) self.answer_display = QTextEdit() self.answer_display.setReadOnly(True) layout.addWidget(QLabel("问题:")) layout.addWidget(self.question_input) layout.addWidget(self.submit_button) layout.addWidget(QLabel("答案:")) layout.addWidget(self.answer_display) self.setLayout(layout) def load_vector_db(self): try: from langchain_community.vectorstores import FAISS embeddings = OllamaEmbeddings(model="nomic-embed-text") self.knowledge_db = FAISS.load_local( "./faiss_index", embeddings, allow_dangerous_deserialization=True ) print("✅ 向量数据库加载成功") except Exception as e: print("❌ 加载向量数据库时发生错误:", e) self.answer_display.setText(f"⚠️ 向量数据库加载失败:{str(e)}") def get_answer(self): question = self.question_input.text() if not question or not self.knowledge_db: return self.answer_display.append("⏳ 正在获取答案,请稍候...") self.submit_button.setEnabled(False) self.worker = OllamaWorker(question, self.knowledge_db) self.worker.status_signal.connect(self.update_status) self.worker.result_ready.connect(self.display_answer) self.worker.error_occurred.connect(self.display_error) self.worker.finished.connect(self.worker.deleteLater) self.worker.start() def update_status(self, message): self.answer_display.append(message) def display_answer(self, answer): self.answer_display.append(answer) self.submit_button.setEnabled(True) def display_error(self, error): self.answer_display.append(error) self.submit_button.setEnabled(True) # ================= 主窗口 ================= class MainWindow(QTabWidget): def __init__(self): super().__init__() self.setWindowTitle("本地知识问答助手") self.resize(1000, 600) self.kb_manager = KnowledgeBaseManager() self.qa_app = QABotApp() self.addTab(self.kb_manager, "知识库管理") self.addTab(self.qa_app, "问答界面") # ================= 主程序入口 ================= if __name__ == "__main__": app = QApplication(sys.argv) window = MainWindow() window.show() sys.exit(app.exec_()) 能不能把他改为初始化界面全在主窗口,线程只负责工作
08-11
import sys import os os.environ['NO_PROXY'] = '127.0.0.1,localhost' import requests from PyQt5.QtWidgets import ( QApplication, QWidget, QVBoxLayout, QHBoxLayout, QPushButton, QListWidget, QTextEdit, QFileDialog, QLineEdit, QTabWidget, QLabel, QMessageBox ) from PyQt5.QtCore import Qt, QThread, pyqtSignal from docx import Document # LangChain 和 Ollama 模块 from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_ollama import OllamaEmbeddings from langchain_community.document_loaders import TextLoader, Docx2txtLoader from langchain_chroma import Chroma from langchain_community.vectorstores import FAISS # ================= 知识库管理界面 ================= class KnowledgeBaseManager(QWidget): def __init__(self): super().__init__() self.setWindowTitle("知识库管理器") self.knowledge_dir = "./knowledge_files" os.makedirs(self.knowledge_dir, exist_ok=True) self.init_ui() self.load_existing_files() def init_ui(self): layout = QVBoxLayout() self.file_list = QListWidget() self.file_list.itemClicked.connect(self.preview_file) layout.addWidget(QLabel("已上传文件:")) layout.addWidget(self.file_list) btn_layout = QHBoxLayout() self.add_btn = QPushButton("添加文件") self.add_btn.clicked.connect(self.add_file) btn_layout.addWidget(self.add_btn) self.remove_btn = QPushButton("移除文件") self.remove_btn.clicked.connect(self.remove_file) btn_layout.addWidget(self.remove_btn) self.build_btn = QPushButton("构建向量数据库") self.build_btn.clicked.connect(self.build_vector_db) btn_layout.addWidget(self.build_btn) layout.addLayout(btn_layout) self.preview_text = QTextEdit() self.preview_text.setReadOnly(True) layout.addWidget(QLabel("文件预览:")) layout.addWidget(self.preview_text) self.setLayout(layout) def add_file(self): file_paths, _ = QFileDialog.getOpenFileNames( self, "选择知识库文件", "", "文档文件 (*.txt *.docx)" ) for file_path in file_paths: filename = os.path.basename(file_path) dest_path = os.path.join(self.knowledge_dir, filename) if not os.path.exists(dest_path): with open(file_path, "rb") as src, open(dest_path, "wb") as dst: dst.write(src.read()) self.file_list.addItem(filename) def load_existing_files(self): for filename in os.listdir(self.knowledge_dir): if filename.endswith((".txt", ".docx")): self.file_list.addItem(filename) def preview_file(self): selected_item = self.file_list.currentItem() if not selected_item: return filename = selected_item.text() file_path = os.path.join(self.knowledge_dir, filename) try: if filename.endswith(".txt"): with open(file_path, "r", encoding="utf-8") as f: content = f.read() elif filename.endswith(".docx"): doc = Document(file_path) content = "\n".join([para.text for para in doc.paragraphs]) else: content = "不支持的文件格式" except Exception as e: content = f"读取失败: {str(e)}" self.preview_text.setText(content) def remove_file(self): selected_item = self.file_list.currentItem() if not selected_item: return filename = selected_item.text() file_path = os.path.join(self.knowledge_dir, filename) if os.path.exists(file_path): os.remove(file_path) self.file_list.takeItem(self.file_list.row(selected_item)) self.preview_text.clear() def build_vector_db(self): print("📚 正在加载知识库文件...") if not os.path.exists(self.knowledge_dir): print(f"❌ 知识库文件夹 {self.knowledge_dir} 不存在,请先添加文件。") return documents = [] for file in os.listdir(self.knowledge_dir): file_path = os.path.join(self.knowledge_dir, file) if file.endswith(".txt"): loader = TextLoader(file_path, encoding="utf-8") elif file.endswith(".docx"): loader = Docx2txtLoader(file_path) else: continue try: documents.extend(loader.load()) except Exception as e: print(f"⚠️ 加载文件 {file} 出错:{e}") if not documents: print("❌ 没有加载到任何文档,请检查文件格式或内容。") return print("✂️ 正在分割文本...") text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) docs = text_splitter.split_documents(documents) print("🧠 正在生成 embedding 并构建向量数据库...") try: embeddings = OllamaEmbeddings(model="nomic-embed-text") persist_directory = "./faiss_index" # 构建 FAISS 数据库 db = FAISS.from_documents(docs, embeddings) db.save_local(persist_directory) # 保存到本地 print("✅ 向量数据库已保存到:", os.path.abspath(persist_directory)) QMessageBox.information(None, "成功", f"✅ 向量数据库已构建并保存到:\n{os.path.abspath(persist_directory)}") except Exception as e: print("❌ 构建数据库时发生错误:", e) QMessageBox.critical(None, "错误", f"构建数据库失败:\n{str(e)}") # ================= 问答界面 ================= class OllamaWorker(QThread): result_ready = pyqtSignal(str) error_occurred = pyqtSignal(str) status_signal = pyqtSignal(str) # 新增状态信号 def __init__(self, question, knowledge_db): super().__init__() self.question = question self.knowledge_db = knowledge_db def run(self): try: self.status_signal.emit("🔍 正在尝试连接 Ollama 服务...") response = requests.get("http://127.0.0.1:11434/api/version") if response.status_code != 200: raise Exception("❌ Ollama 服务未启动,请先启动服务。") self.status_signal.emit(f"📞 Ollama 响应状态码: {response.status_code}") model_response = requests.get("http://127.0.0.1:11434/api/tags") models = model_response.json().get("models", []) if not models: raise Exception("❌ 未找到可用模型,请先拉取模型。") self.status_signal.emit(f"🧠 当前加载的模型: {[model['name'] for model in models]}") self.status_signal.emit("🔍 正在检索知识库...") docs = self.knowledge_db.similarity_search(self.question, k=3) if not docs: raise Exception("❌ 未找到相关知识,请尝试其他问题。") self.status_signal.emit("🔍 检索完成,找到多个结果") context = "\n\n".join([ f"文档{i + 1}:\n{doc.page_content.replace('**', '').replace('*', '')}" for i, doc in enumerate(docs) ]) prompt = f"""请根据以下知识内容,用自己的话总结并回答问题。请确保回答清晰、准确,并基于提供的资料,不要编造内容。 {context} 问题:{self.question} 答案:""" self.status_signal.emit("🧠 正在调用模型...") response = requests.post( "http://127.0.0.1:11434/api/chat", json={ "model": "gemma3:1b", "messages": [{"role": "user", "content": prompt}], "stream": False } ) self.status_signal.emit(f"📡 API 响应状态码: {response.status_code}") answer = response.json().get("message", {}).get("content", "未获取到有效回答") self.result_ready.emit(f"问题:{self.question}\n答案:{answer}\n{'-' * 30}") except Exception as e: self.result_ready.emit(f"❌ 处理过程中发生错误:{str(e)}") class QABotApp(QWidget): def __init__(self): super().__init__() self.setWindowTitle("本地知识问答助手 - Gemma3 + PyQt5") self.knowledge_db = None self.worker = None self.init_ui() self.load_vector_db() def init_ui(self): layout = QVBoxLayout() self.question_input = QLineEdit() self.question_input.setPlaceholderText("请输入你的问题...") self.question_input.returnPressed.connect(self.get_answer) self.submit_button = QPushButton("提交") self.submit_button.clicked.connect(self.get_answer) self.answer_display = QTextEdit() self.answer_display.setReadOnly(True) layout.addWidget(QLabel("问题:")) layout.addWidget(self.question_input) layout.addWidget(self.submit_button) layout.addWidget(QLabel("答案:")) layout.addWidget(self.answer_display) self.setLayout(layout) def load_vector_db(self): try: from langchain_community.vectorstores import FAISS embeddings = OllamaEmbeddings(model="nomic-embed-text") self.knowledge_db = FAISS.load_local( "./faiss_index", embeddings, allow_dangerous_deserialization=True ) print("✅ 向量数据库加载成功") except Exception as e: print("❌ 加载向量数据库时发生错误:", e) self.answer_display.setText(f"⚠️ 向量数据库加载失败:{str(e)}") def get_answer(self): question = self.question_input.text() if not question or not self.knowledge_db: return self.answer_display.append("⏳ 正在获取答案,请稍候...") self.submit_button.setEnabled(False) self.worker = OllamaWorker(question, self.knowledge_db) self.worker.status_signal.connect(self.update_status) self.worker.result_ready.connect(self.display_answer) self.worker.error_occurred.connect(self.display_error) self.worker.finished.connect(self.worker.deleteLater) self.worker.start() def update_status(self, message): self.answer_display.append(message) def display_answer(self, answer): self.answer_display.append(answer) self.submit_button.setEnabled(True) def display_error(self, error): self.answer_display.append(error) self.submit_button.setEnabled(True) # ================= 主窗口 ================= class MainWindow(QTabWidget): def __init__(self): super().__init__() self.setWindowTitle("本地知识问答助手") self.resize(1000, 600) self.kb_manager = KnowledgeBaseManager() self.qa_app = QABotApp() self.addTab(self.kb_manager, "知识库管理") self.addTab(self.qa_app, "问答界面") # ================= 主程序入口 ================= if __name__ == "__main__": app = QApplication(sys.argv) window = MainWindow() window.show() sys.exit(app.exec_()) 那我应该怎么修改我这个代码才能实现
08-10
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值