LLM padding left or right

文章讨论了大模型在处理输入序列长度不一致时,为何选择左填充而非右填充进行Batch推理。作者指出,尽管BERT时代习惯右填充,但大模型倾向于左填充以方便操作,即使右填充也需额外步骤。实例如Gemma的推理代码展示了如何在推理过程中处理paddingtoken。

参考博客:
大部分的大模型(LLM)采用左填充(left-padding)的原因
注:文章主要内容参考以上博客,及其评论区,如有侵权,联系删除。

最近在看大模型相关内容的时候,突然想到我实习时候一直一知半解的问题,大模型采用padding left还是right。以上内容,根据我自身的理解,如有错误,请大家批评指正。这里的padding 的位置,仅仅考虑推理时候的left or right。

为什么需要padding?

因为输入的序列长度不一致,为了能够在Batch内进行数据推理,所以需要增加padding,使输入序列的长度是一致的。

为什么要考虑padding left or right?

在BERT时代,通常padding的方式为right,即在右侧进行padding,因为BERT在初始位置有个特殊token,[CLS]左侧进行padding,不好操作。
在大模型时代,可能更偏向于左侧padding, 为什么进行左侧padding,我理解主要原因可能是为了更好也是为了更好操作。
最直观的想法,如果右侧进行padding,生成的序列中间会存在padding token,还需要进一步处理padding token。如下图所示:
在这里插入图片描述

如果采用左侧的padding 的方式则是比较方便处理或者操作。在进行batch推理的时候左侧,进行操作,非常的方便, 如下图所示:
在这里插入图片描述

摘取一些比较好的解释

在这里插入图片描述

大模型batch推理时只能padding left?

大模型在推理时候,同样可以采用padding right,只不过需要增加一些步骤,没有padding left这么直观。
由于只找到LLama和Gemma的推理代码,所以仅仅参考这两个代码进行解释。
参考代码:
LLama
Gemma
下面是Gemma推理代码:

 def generate(
        self,
        prompts: Union[str, Sequence[str]],
        device: Any,
        output_len: int = 100,
        temperature: Union[float, None] = 0.95,
        top_p: float = 1.0,
        top_k: int = 100,
    ) -> Union[str, Sequence[str]]:
        """Generates responses for given prompts using Gemma model."""
        # If a single prompt is provided, treat it as a batch of 1.
        is_str_prompt = isinstance(prompts, str)
        if is_str_prompt:
            prompts = [prompts]

        batch_size = len(prompts)
        prompt_tokens = [self.tokenizer.encode(prompt) for prompt in prompts]
        min_prompt_len = min(len(p) for p in prompt_tokens)
        max_prompt_len = max(len(p) for p in prompt_tokens)
        max_seq_len = max_prompt_len + output_len
        assert max_seq_len <= self.config.max_position_embeddings

        # build KV caches
        kv_caches = []
        for _ in range(self.config.num_hidden_layers):
            size = (batch_size, max_seq_len, self.config.num_key_value_heads,
                    self.config.head_dim)
            dtype = self.config.get_dtype()
            k_cache = torch.zeros(size=size, dtype=dtype, device=device)
            v_cache = torch.zeros(size=size, dtype=dtype, device=device)
            kv_caches.append((k_cache, v_cache))

        # prepare inputs
        token_ids_tensor = torch.full((batch_size, max_seq_len),
                                      self.tokenizer.pad_id, dtype=torch.int64)
        input_token_ids_tensor = torch.full((batch_size, min_prompt_len),
                                            self.tokenizer.pad_id,
                                            dtype=torch.int64)
        for i, p in enumerate(prompt_tokens):
            token_ids_tensor[i, :len(p)] = torch.tensor(p)
            input_token_ids_tensor[i, :min_prompt_len] = torch.tensor(
                p[:min_prompt_len])
        token_ids_tensor = token_ids_tensor.to(device)
        input_token_ids_tensor = input_token_ids_tensor.to(device)
        prompt_mask_tensor = token_ids_tensor != self.tokenizer.pad_id
        input_positions_tensor = torch.arange(0, min_prompt_len,
                                              dtype=torch.int64).to(device)
        mask_tensor = torch.full((1, 1, max_seq_len, max_seq_len),
                                 -2.3819763e38).to(torch.float)
        mask_tensor = torch.triu(mask_tensor, diagonal=1).to(device)
        curr_mask_tensor = mask_tensor.index_select(2, input_positions_tensor)
        output_positions_tensor = torch.LongTensor([min_prompt_len - 1]).to(
            device)
        temperatures_tensor = None if not temperature else torch.FloatTensor(
            [temperature] * batch_size).to(device)
        top_ps_tensor = torch.FloatTensor([top_p] * batch_size).to(device)
        top_ks_tensor = torch.LongTensor([top_k] * batch_size).to(device)
        output_index = torch.tensor(min_prompt_len, dtype=torch.int64).to(
            device)

        # Prefill up to min_prompt_len tokens, then treat other prefill as
        # decode and ignore output.
        for i in range(max_seq_len - min_prompt_len):
            next_token_ids = self(
                input_token_ids=input_token_ids_tensor,
                input_positions=input_positions_tensor,
                kv_write_indices=None,
                kv_caches=kv_caches,
                mask=curr_mask_tensor,
                output_positions=output_positions_tensor,
                temperatures=temperatures_tensor,
                top_ps=top_ps_tensor,
                top_ks=top_ks_tensor,
            )

            curr_prompt_mask = prompt_mask_tensor.index_select(
                1, output_index).squeeze(dim=1)
            curr_token_ids = token_ids_tensor.index_select(
                1, output_index).squeeze(dim=1)
            output_token_ids = torch.where(curr_prompt_mask, curr_token_ids,
                                           next_token_ids).unsqueeze(dim=1)
            token_ids_tensor.index_copy_(1, output_index, output_token_ids)

            input_token_ids_tensor = output_token_ids
            input_positions_tensor = output_index.unsqueeze(dim=-1)
            curr_mask_tensor = mask_tensor.index_select(2,
                                                        input_positions_tensor)
            output_positions_tensor = torch.tensor(0, dtype=torch.int64).to(
                device)
            output_index = output_index + 1

        # Detokenization.
        token_ids = token_ids_tensor.tolist()
        results = []
        for i, tokens in enumerate(token_ids):
            trimmed_output = tokens[len(prompt_tokens[i]):len(prompt_tokens[i])
                                    + output_len]
            if self.tokenizer.eos_id in trimmed_output:
                eos_index = trimmed_output.index(self.tokenizer.eos_id)
                trimmed_output = trimmed_output[:eos_index]
            results.append(self.tokenizer.decode(trimmed_output))

        # If a string was provided as input, return a string as output.
        return results[0] if is_str_prompt else results

以下面的图为例子进行讲解:
在这里插入图片描述
在推理的时候,先进行右侧padding,使长度一致。选择最短的长度同时进行处理,上图为1, 那么我们同时处理batch min(即1), 然后开始逐个token进行推理,怎么避免下图的形式呢在这里插入图片描述
核心代码为下面内容:output_token_ids = torch.where(curr_prompt_mask, curr_token_ids, next_token_ids).unsqueeze(dim=1),主要的思路就是,如何当前的token为padding token 则填充上一步预测的token的结果,否则填充当前的token。
例如:
在这里插入图片描述
当前位置的token不为pading,则token还是3,不为2这个位置预测的token。
在这里插入图片描述
当前token为padding,则为2这个位置预测的token。
以上就是大模型采用right padding的方法。

总结

感觉pading left or right, 其实无所谓,主要就是为了方便。根据实际情况的具体需求,进行使用,用的正确,方便即可。

#请作如下修改——1.加入访问访问本地知库时的请求文档数量显示和干预控件;2.进一步优化界面色彩;3.加入GPU使用情况监控 import tkinter as tk from tkinter import ttk, filedialog, messagebox, scrolledtext import ollama import os import time import threading import numpy as np import matplotlib.pyplot as plt from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg from matplotlib.figure import Figure import pandas as pd import seaborn as sns import PyPDF2 import docx import markdown from bs4 import BeautifulSoup import openpyxl from PIL import Image import pytesseract import io import psutil from ttkthemes import ThemedTk # 设置中文字体 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False class RAGApplication: def __init__(self, root): self.root = root self.root.title("✨智能RAG应用系统✨") self.root.geometry("1400x900") self.root.configure(bg="#f0f0f0") # 淡灰色背景 # 使用现代主题 self.style = ttk.Style() self.style.theme_use('arc') # 现代主题 # 自定义样式 - 淡色调 self.style.configure('TFrame', background='#f0f0f0') self.style.configure('TLabel', background='#f0f0f0', foreground='#333333') self.style.configure('TLabelframe', background='#f0f0f0', foreground='#333333', borderwidth=1) self.style.configure('TLabelframe.Label', background='#f0f0f0', foreground='#4dabf5') # 淡蓝色标题 self.style.configure('TButton', background='#4dabf5', foreground='#333333', borderwidth=1) # 深色文字按钮 self.style.map('TButton', background=[('active', '#3b99e0')]) self.style.configure('TNotebook', background='#f0f0f0', borderwidth=0) self.style.configure('TNotebook.Tab', background='#e6f0ff', foreground='#333333', padding=[10, 5]) # 淡蓝色标签 self.style.map('TNotebook.Tab', background=[('selected', '#4dabf5')]) # 初始化数据 self.documents = [] self.chunks = [] self.embeddings = [] self.qa_history = [] # 获取 Ollama 中已安装的模型列表 try: models_response = ollama.list() self.all_models = [model['model'] for model in models_response['models']] # 使用 'model' 字段 except Exception as e: print(f"获取 Ollama 模型列表失败: {e}") self.all_models = [] self.default_llm_model = "qwen2:7b" self.default_embedding_model = "bge-m3:latest" # 默认参数 self.params = { "temperature": 0.7, "top_p": 0.9, "max_length": 2048, "num_context_docs": 3, "chunk_size": 500, "chunk_overlap": 100, "chunk_strategy": "固定大小", "separators": "\n\n\n。\n!\n?\n", "embed_batch_size": 1, "enable_stream": True, "show_progress": True, "show_visualization": True, "ocr_enabled": True } # 创建界面 self.create_ui() def create_ui(self): # 主框架 self.main_frame = ttk.Frame(self.root) self.main_frame.pack(fill=tk.BOTH, expand=True, padx=20, pady=20) # 标题 title_frame = ttk.Frame(self.main_frame) title_frame.pack(fill=tk.X, pady=(0, 20)) ttk.Label(title_frame, text="✨ 智能RAG应用系统 ✨", font=('Arial', 24, 'bold'), foreground="#4dabf5").pack(side=tk.LEFT) # 淡青色标题 # 状态指示器 status_frame = ttk.Frame(title_frame) status_frame.pack(side=tk.RIGHT) self.status_label = ttk.Label(status_frame, text="● 就绪", foreground="#28a745") # 绿色状态 self.status_label.pack(side=tk.RIGHT, padx=10) # 参数控制面板 self.create_sidebar() # 主内容区域 self.create_main_content() def create_sidebar(self): # 侧边栏框架 self.sidebar = ttk.LabelFrame(self.main_frame, text="⚙️ 参数控制面板", width=300) self.sidebar.pack(side=tk.LEFT, fill=tk.Y, padx=10, pady=10) # 大模型参数 ttk.Label(self.sidebar, text="🔧 大模型参数", font=('Arial', 10, 'bold'), foreground="#333333").pack(pady=(15, 5)) self.temperature = tk.DoubleVar(value=self.params["temperature"]) ttk.Label(self.sidebar, text="温度(temperature)").pack(anchor=tk.W, padx=10) temp_frame = ttk.Frame(self.sidebar) temp_frame.pack(fill=tk.X, padx=10, pady=(0, 5)) ttk.Scale(temp_frame, from_=0.0, to=2.0, variable=self.temperature, length=180, command=lambda v: self.update_param("temperature", float(v))).pack(side=tk.LEFT) self.temp_label = ttk.Label(temp_frame, text=f"{self.temperature.get():.1f}", width=5) self.temp_label.pack(side=tk.RIGHT, padx=5) self.top_p = tk.DoubleVar(value=self.params["top_p"]) ttk.Label(self.sidebar, text="Top P").pack(anchor=tk.W, padx=10) top_p_frame = ttk.Frame(self.sidebar) top_p_frame.pack(fill=tk.X, padx=10, pady=(0, 5)) ttk.Scale(top_p_frame, from_=0.0, to=1.0, variable=self.top_p, length=180, command=lambda v: self.update_param("top_p", float(v))).pack(side=tk.LEFT) self.top_p_label = ttk.Label(top_p_frame, text=f"{self.top_p.get():.2f}", width=5) self.top_p_label.pack(side=tk.RIGHT, padx=5) # 添加大模型名称选择(来自 Ollama) ttk.Label(self.sidebar, text="大模型名称").pack(anchor=tk.W, padx=10) self.llm_model_var = tk.StringVar(value=self.default_llm_model) llm_combobox = ttk.Combobox(self.sidebar, textvariable=self.llm_model_var, values=self.all_models) llm_combobox.pack(padx=10, pady=5) # 嵌入模型选择(来自 Ollama) ttk.Label(self.sidebar, text="嵌入模型名称").pack(anchor=tk.W, padx=10) self.embedding_model_var = tk.StringVar(value=self.default_embedding_model) embed_combobox = ttk.Combobox(self.sidebar, textvariable=self.embedding_model_var, values=self.all_models) embed_combobox.pack(padx=10, pady=5) # RAG参数 ttk.Label(self.sidebar, text="🔧 RAG参数", font=('Arial', 10, 'bold'), foreground="#333333").pack(pady=(15, 5)) self.chunk_size = tk.IntVar(value=self.params["chunk_size"]) ttk.Label(self.sidebar, text="分块大小(字符)").pack(anchor=tk.W, padx=10) chunk_frame = ttk.Frame(self.sidebar) chunk_frame.pack(fill=tk.X, padx=10, pady=(0, 5)) ttk.Scale(chunk_frame, from_=100, to=2000, variable=self.chunk_size, length=180, command=lambda v: self.update_param("chunk_size", int(v))).pack(side=tk.LEFT) self.chunk_label = ttk.Label(chunk_frame, text=f"{self.chunk_size.get()}", width=5) self.chunk_label.pack(side=tk.RIGHT, padx=5) # OCR开关 self.ocr_var = tk.BooleanVar(value=self.params["ocr_enabled"]) ttk.Checkbutton(self.sidebar, text="启用OCR扫描", variable=self.ocr_var, command=lambda: self.update_param("ocr_enabled", self.ocr_var.get())).pack(pady=(15, 5), padx=10, anchor=tk.W) # 使用说明 ttk.Label(self.sidebar, text="📖 使用说明", font=('Arial', 10, 'bold'), foreground="#333333").pack(pady=(15, 5)) instructions = """1. 在"文档上传"页上传您的文档 2. 在"文档处理"页对文档进行分块和嵌入 3. 在"问答交互"页提问并获取答案 4. 在"系统监控"页查看系统状态""" ttk.Label(self.sidebar, text=instructions, justify=tk.LEFT, background="#e6f0ff", # 淡蓝色背景 foreground="#333333", padding=10).pack(fill=tk.X, padx=10, pady=5) def create_main_content(self): # 主内容框架 self.content_frame = ttk.Frame(self.main_frame) self.content_frame.pack(side=tk.RIGHT, fill=tk.BOTH, expand=True) # 创建选项卡 self.notebook = ttk.Notebook(self.content_frame) self.notebook.pack(fill=tk.BOTH, expand=True) # 文档上传页 self.create_upload_tab() # 文档处理页 self.create_process_tab() # 问答交互页 self.create_qa_tab() # 系统监控页 self.create_monitor_tab() def create_upload_tab(self): self.upload_tab = ttk.Frame(self.notebook) self.notebook.add(self.upload_tab, text="📤 文档上传") # 标题 title_frame = ttk.Frame(self.upload_tab) title_frame.pack(fill=tk.X, pady=(10, 20)) ttk.Label(title_frame, text="📤 文档上传与管理", font=('Arial', 14, 'bold'), foreground="#4dabf5").pack(side=tk.LEFT) # 淡青色标题 # 上传区域 upload_frame = ttk.Frame(self.upload_tab) upload_frame.pack(fill=tk.X, pady=10) # 上传按钮 upload_btn = ttk.Button(upload_frame, text="📁 上传文档", command=self.upload_files, style='Accent.TButton') upload_btn.pack(side=tk.LEFT, padx=10) # 清除按钮 clear_btn = ttk.Button(upload_frame, text="🗑️ 清除所有", command=self.clear_documents) clear_btn.pack(side=tk.RIGHT, padx=10) # 文档列表 self.doc_list_frame = ttk.LabelFrame(self.upload_tab, text="📋 已上传文档") self.doc_list_frame.pack(fill=tk.BOTH, expand=True, padx=10, pady=10) # 创建带滚动条的树状视图 tree_frame = ttk.Frame(self.doc_list_frame) tree_frame.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) # 创建滚动条 tree_scroll = ttk.Scrollbar(tree_frame) tree_scroll.pack(side=tk.RIGHT, fill=tk.Y) # 创建树状视图 columns = ("name", "size", "time", "type") self.doc_tree = ttk.Treeview(tree_frame, columns=columns, show="headings", yscrollcommand=tree_scroll.set, height=8) # 设置列标题 self.doc_tree.heading("name", text="文件名") self.doc_tree.heading("size", text="大小") self.doc_tree.heading("time", text="上传时间") self.doc_tree.heading("type", text="类型") # 设置列宽 self.doc_tree.column("name", width=250) self.doc_tree.column("size", width=80) self.doc_tree.column("time", width=150) self.doc_tree.column("type", width=80) self.doc_tree.pack(side=tk.LEFT, fill=tk.BOTH, expand=True) tree_scroll.config(command=self.doc_tree.yview) # 文档统计 self.doc_stats_frame = ttk.Frame(self.upload_tab) self.doc_stats_frame.pack(fill=tk.X, pady=10, padx=10) stats_style = ttk.Style() stats_style.configure('Stats.TLabel', background='#e6f0ff', foreground='#333333', padding=5) # 淡蓝色背景 ttk.Label(self.doc_stats_frame, text="📊 文档统计:", style='Stats.TLabel').pack(side=tk.LEFT, padx=5) self.doc_count_label = ttk.Label(self.doc_stats_frame, text="0", style='Stats.TLabel') self.doc_count_label.pack(side=tk.LEFT, padx=5) ttk.Label(self.doc_stats_frame, text="总字符数:", style='Stats.TLabel').pack(side=tk.LEFT, padx=5) self.char_count_label = ttk.Label(self.doc_stats_frame, text="0", style='Stats.TLabel') self.char_count_label.pack(side=tk.LEFT, padx=5) ttk.Label(self.doc_stats_frame, text="总页数:", style='Stats.TLabel').pack(side=tk.LEFT, padx=5) self.page_count_label = ttk.Label(self.doc_stats_frame, text="0", style='Stats.TLabel') self.page_count_label.pack(side=tk.LEFT, padx=5) def clear_documents(self): if not self.documents: return if messagebox.askyesno("确认", "确定要清除所有文档吗?"): self.documents = [] self.update_doc_list() # ================== 文件读取函数 ================== def read_pdf(self, filepath): """读取PDF文件内容,支持扫描版OCR""" content = "" pages = 0 try: with open(filepath, 'rb') as f: reader = PyPDF2.PdfReader(f) num_pages = len(reader.pages) pages = num_pages for page_num in range(num_pages): page = reader.pages[page_num] text = page.extract_text() # 如果是扫描版PDF,使用OCR识别 if not text.strip() and self.params["ocr_enabled"]: try: # 获取页面图像 images = page.images if images: for img in images: image_data = img.data image = Image.open(io.BytesIO(image_data)) text += pytesseract.image_to_string(image, lang='chi_sim+eng') except Exception as e: print(f"OCR处理失败: {str(e)}") content += text + "\n" except Exception as e: print(f"读取PDF失败: {str(e)}") return content, pages def read_docx(self, filepath): """读取Word文档内容""" content = "" pages = 0 try: doc = docx.Document(filepath) for para in doc.paragraphs: content += para.text + "\n" pages = len(doc.paragraphs) // 50 + 1 # 估算页数 except Exception as e: print(f"读取Word文档失败: {str(e)}") return content, pages def read_excel(self, filepath): """读取Excel文件内容,优化内存使用""" content = "" pages = 0 try: # 使用openpyxl优化大文件读取 wb = openpyxl.load_workbook(filepath, read_only=True) for sheet_name in wb.sheetnames: content += f"\n工作表: {sheet_name}\n" sheet = wb[sheet_name] for row in sheet.iter_rows(values_only=True): row_content = " | ".join([str(cell) if cell is not None else "" for cell in row]) content += row_content + "\n" pages = len(wb.sheetnames) except Exception as e: print(f"读取Excel文件失败: {str(e)}") return content, pages def read_md(self, filepath): """读取Markdown文件内容""" content = "" pages = 0 try: with open(filepath, 'r', encoding='utf-8') as f: html = markdown.markdown(f.read()) soup = BeautifulSoup(html, 'html.parser') content = soup.get_text() pages = len(content) // 2000 + 1 # 估算页数 except Exception as e: print(f"读取Markdown文件失败: {str(e)}") return content, pages def read_ppt(self, filepath): """读取PPT文件内容(简化版)""" content = "" pages = 0 try: # 实际应用中应使用python-pptx库 # 这里仅作演示 content = f"PPT文件内容提取: {os.path.basename(filepath)}" pages = 10 # 假设有10页 except Exception as e: print(f"读取PPT文件失败: {str(e)}") return content, pages def upload_files(self): filetypes = [ ("文本文件", "*.txt"), ("PDF文件", "*.pdf"), ("Word文件", "*.docx *.doc"), ("Excel文件", "*.xlsx *.xls"), ("Markdown文件", "*.md"), ("PPT文件", "*.pptx *.ppt"), ("所有文件", "*.*") ] filenames = filedialog.askopenfilenames(title="选择文档", filetypes=filetypes) if filenames: self.status_label.config(text="● 正在上传文档...", foreground="#ffc107") # 黄色状态 total_pages = 0 for filename in filenames: try: ext = os.path.splitext(filename)[1].lower() if ext == '.txt': with open(filename, 'r', encoding='utf-8') as f: content = f.read() pages = len(content) // 2000 + 1 elif ext == '.pdf': content, pages = self.read_pdf(filename) elif ext in ('.docx', '.doc'): content, pages = self.read_docx(filename) elif ext in ('.xlsx', '.xls'): content, pages = self.read_excel(filename) elif ext == '.md': content, pages = self.read_md(filename) elif ext in ('.pptx', '.ppt'): content, pages = self.read_ppt(filename) else: messagebox.showwarning("警告", f"不支持的文件类型: {ext}") continue # 处理字符编码问题 if not isinstance(content, str): try: content = content.decode('utf-8') except: content = content.decode('latin-1', errors='ignore') self.documents.append({ "name": os.path.basename(filename), "content": content, "size": len(content), "upload_time": time.strftime("%Y-%m-%d %H:%M:%S"), "type": ext.upper().replace(".", ""), "pages": pages }) total_pages += pages # 更新文档列表 self.update_doc_list() except Exception as e: messagebox.showerror("错误", f"无法读取文件 {filename}: {str(e)}") self.status_label.config(text=f"● 上传完成! 共{len(filenames)}个文档", foreground="#28a745") # 绿色状态 self.page_count_label.config(text=str(total_pages)) def update_doc_list(self): # 清空现有列表 for item in self.doc_tree.get_children(): self.doc_tree.delete(item) # 添加新文档 for doc in self.documents: size_kb = doc["size"] / 1024 size_str = f"{size_kb:.1f} KB" if size_kb < 1024 else f"{size_kb / 1024:.1f} MB" self.doc_tree.insert("", tk.END, values=( doc["name"], size_str, doc["upload_time"], doc["type"] )) # 更新统计信息 self.doc_count_label.config(text=str(len(self.documents))) self.char_count_label.config(text=str(sum(d['size'] for d in self.documents))) def create_process_tab(self): self.process_tab = ttk.Frame(self.notebook) self.notebook.add(self.process_tab, text="🔧 文档处理") # 标题 title_frame = ttk.Frame(self.process_tab) title_frame.pack(fill=tk.X, pady=(10, 20)) ttk.Label(title_frame, text="🔧 文档处理与分块", font=('Arial', 14, 'bold'), foreground="#4dabf5").pack(side=tk.LEFT) # 淡青色标题 # 处理按钮 btn_frame = ttk.Frame(self.process_tab) btn_frame.pack(fill=tk.X, pady=10) process_btn = ttk.Button(btn_frame, text="🔄 处理文档", command=self.process_documents, style='Accent.TButton') process_btn.pack(side=tk.LEFT, padx=10) visualize_btn = ttk.Button(btn_frame, text="📊 更新可视化", command=self.show_visualizations) visualize_btn.pack(side=tk.LEFT, padx=10) # 主内容区域 content_frame = ttk.Frame(self.process_tab) content_frame.pack(fill=tk.BOTH, expand=True) # 左侧:可视化区域 self.visual_frame = ttk.LabelFrame(content_frame, text="📈 文档分析") self.visual_frame.pack(side=tk.LEFT, fill=tk.BOTH, expand=True, padx=10, pady=10) # 右侧:分块列表 self.chunk_frame = ttk.LabelFrame(content_frame, text="📋 分块结果") self.chunk_frame.pack(side=tk.RIGHT, fill=tk.BOTH, expand=True, padx=10, pady=10) # 创建带滚动条的树状视图 tree_frame = ttk.Frame(self.chunk_frame) tree_frame.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) # 创建滚动条 tree_scroll = ttk.Scrollbar(tree_frame) tree_scroll.pack(side=tk.RIGHT, fill=tk.Y) # 创建树状视图 columns = ("doc_name", "start", "end", "content") self.chunk_tree = ttk.Treeview( tree_frame, columns=columns, show="headings", yscrollcommand=tree_scroll.set, height=15 ) # 设置列标题 self.chunk_tree.heading("doc_name", text="来源文档") self.chunk_tree.heading("start", text="起始位置") self.chunk_tree.heading("end", text="结束位置") self.chunk_tree.heading("content", text="内容预览") # 设置列宽 self.chunk_tree.column("doc_name", width=150) self.chunk_tree.column("start", width=80) self.chunk_tree.column("end", width=80) self.chunk_tree.column("content", width=300) self.chunk_tree.pack(side=tk.LEFT, fill=tk.BOTH, expand=True) tree_scroll.config(command=self.chunk_tree.yview) # 初始显示占位图 self.show_placeholder() def show_placeholder(self): """显示可视化占位图""" for widget in self.visual_frame.winfo_children(): widget.destroy() placeholder = ttk.Label(self.visual_frame, text="文档处理后将显示分析图表", font=('Arial', 12), foreground="#7f8c8d") placeholder.pack(expand=True, pady=50) def process_documents(self): if not self.documents: messagebox.showwarning("警告", "请先上传文档") return # 在新线程中处理文档 threading.Thread(target=self._process_documents_thread, daemon=True).start() def _process_documents_thread(self): # 显示进度条 self.progress_window = tk.Toplevel(self.root) self.progress_window.title("处理进度") self.progress_window.geometry("400x150") self.progress_window.resizable(False, False) self.progress_window.transient(self.root) self.progress_window.grab_set() self.progress_window.configure(bg="#f0f0f0") # 淡灰色背景 # 设置窗口居中 x = self.root.winfo_x() + (self.root.winfo_width() - 400) // 2 y = self.root.winfo_y() + (self.root.winfo_height() - 150) // 2 self.progress_window.geometry(f"+{x}+{y}") # 进度窗口内容 ttk.Label(self.progress_window, text="正在处理文档...", font=('Arial', 11)).pack(pady=(20, 10)) progress_frame = ttk.Frame(self.progress_window) progress_frame.pack(fill=tk.X, padx=20, pady=10) self.progress_var = tk.DoubleVar() progress_bar = ttk.Progressbar(progress_frame, variable=self.progress_var, maximum=100, length=360) progress_bar.pack() self.progress_label = ttk.Label(progress_frame, text="0%") self.progress_label.pack(pady=5) self.status_label.config(text="● 正在处理文档...", foreground="#ffc107") # 黄色状态 self.progress_window.update() try: # 分块处理 self.chunks = self.chunk_documents( self.documents, self.params["chunk_strategy"], self.params["chunk_size"], self.params["chunk_overlap"] ) # 生成嵌入 self.embeddings = self.generate_embeddings(self.chunks) # 更新UI self.root.after(0, self.update_chunk_list) self.root.after(0, self.show_visualizations) self.root.after(0, lambda: messagebox.showinfo("完成", "文档处理完成!")) self.status_label.config(text="● 文档处理完成", foreground="#28a745") # 绿色状态 except Exception as e: self.root.after(0, lambda: messagebox.showerror("错误", f"处理文档时出错: {str(e)}")) self.status_label.config(text="● 处理出错", foreground="#dc3545") # 红色状态 finally: self.root.after(0, self.progress_window.destroy) def chunk_documents(self, documents, strategy, size, overlap): chunks = [] total_docs = len(documents) for doc_idx, doc in enumerate(documents): content = doc['content'] if strategy == "固定大小": for i in range(0, len(content), size - overlap): chunk = content[i:i + size] chunks.append({ "doc_name": doc['name'], "content": chunk, "start": i, "end": min(i + size, len(content)) }) # 更新进度 progress = (doc_idx + 1) / total_docs * 100 self.progress_var.set(progress) self.progress_label.config(text=f"{int(progress)}%") self.progress_window.update() return chunks def generate_embeddings(self, chunks): """单批次处理每个分块,避免API参数类型错误""" embeddings = [] total_chunks = len(chunks) for idx, chunk in enumerate(chunks): try: # 传递单个字符串而不是列表 response = ollama.embeddings( model=self.embedding_model_var.get(), prompt=chunk['content'] # 单个字符串 ) embeddings.append({ "chunk_id": idx, "embedding": response['embedding'], "doc_name": chunk['doc_name'] }) except Exception as e: print(f"生成嵌入时出错: {str(e)}") # 添加空嵌入占位符 embeddings.append({ "chunk_id": idx, "embedding": None, "doc_name": chunk['doc_name'] }) # 更新进度 progress = (idx + 1) / total_chunks * 100 self.progress_var.set(progress) self.progress_label.config(text=f"{int(progress)}%") self.progress_window.update() # 添加延迟避免请求过快 time.sleep(0.1) return embeddings def update_chunk_list(self): # 清空现有列表 for item in self.chunk_tree.get_children(): self.chunk_tree.delete(item) # 添加新分块 for chunk in self.chunks: preview = chunk['content'][:50] + "..." if len(chunk['content']) > 50 else chunk['content'] self.chunk_tree.insert("", tk.END, values=( chunk['doc_name'], chunk['start'], chunk['end'], preview )) def show_visualizations(self): # 清空可视化区域 for widget in self.visual_frame.winfo_children(): widget.destroy() if not self.params["show_visualization"] or not self.chunks: self.show_placeholder() return # 创建图表框架 fig = plt.Figure(figsize=(10, 8), dpi=100) fig.set_facecolor('#f0f0f0') # 淡灰色背景 # 分块大小分布 ax1 = fig.add_subplot(221) ax1.set_facecolor('#e6f0ff') # 淡蓝色背景 chunk_sizes = [len(c['content']) for c in self.chunks] sns.histplot(chunk_sizes, bins=20, ax=ax1, color='#4dabf5') # 淡青色 ax1.set_title("分块大小分布", color='#333333') ax1.set_xlabel("字符数", color='#333333') ax1.set_ylabel("数量", color='#333333') ax1.tick_params(axis='x', colors='#333333') ax1.tick_params(axis='y', colors='#333333') ax1.spines['bottom'].set_color('#333333') ax1.spines['left'].set_color('#333333') # 文档分块数量 ax2 = fig.add_subplot(222) ax2.set_facecolor('#e6f0ff') # 淡蓝色背景 doc_chunk_counts = {} for chunk in self.chunks: doc_chunk_counts[chunk['doc_name']] = doc_chunk_counts.get(chunk['doc_name'], 0) + 1 # 只显示前10个文档 doc_names = list(doc_chunk_counts.keys()) counts = list(doc_chunk_counts.values()) if len(doc_names) > 10: # 按分块数量排序,取前10 sorted_indices = np.argsort(counts)[::-1][:10] doc_names = [doc_names[i] for i in sorted_indices] counts = [counts[i] for i in sorted_indices] sns.barplot(x=counts, y=doc_names, hue=doc_names, ax=ax2, palette='Blues', orient='h', legend=False) ax2.set_title("各文档分块数量", color='#333333') ax2.set_xlabel("分块数", color='#333333') ax2.set_ylabel("") ax2.tick_params(axis='x', colors='#333333') ax2.tick_params(axis='y', colors='#333333') ax2.spines['bottom'].set_color('#333333') ax2.spines['left'].set_color('#333333') # 内容词云(模拟) ax3 = fig.add_subplot(223) ax3.set_facecolor('#e6f0ff') # 淡蓝色背景 ax3.set_title("内容关键词分布", color='#333333') ax3.text(0.5, 0.5, "关键词可视化区域", horizontalalignment='center', verticalalignment='center', color='#333333', fontsize=12) ax3.axis('off') # 处理进度 ax4 = fig.add_subplot(224) ax4.set_facecolor('#e6f0ff') # 淡蓝色背景 ax4.set_title("处理进度", color='#333333') # 模拟数据 stages = ['上传', '分块', '嵌入', '完成'] progress = [100, 100, 100, 100] # 假设都已完成 ax4.barh(stages, progress, color=['#4dabf5', '#20c997', '#9b59b6', '#ffc107']) # 淡色系 ax4.set_xlim(0, 100) ax4.set_xlabel("完成百分比", color='#333333') ax4.tick_params(axis='x', colors='#333333') ax4.tick_params(axis='y', colors='#333333') ax4.spines['bottom'].set_color('#333333') ax4.spines['left'].set_color('#333333') # 调整布局 fig.tight_layout(rect=[0, 0, 1, 0.95], pad=3.0) # 添加总标题 fig.suptitle("文档分析概览", fontsize=16, color='#333333') # 在Tkinter中显示图表 canvas = FigureCanvasTkAgg(fig, master=self.visual_frame) canvas.draw() canvas.get_tk_widget().pack(fill=tk.BOTH, expand=True) def create_qa_tab(self): self.qa_tab = ttk.Frame(self.notebook) self.notebook.add(self.qa_tab, text="💬 问答交互") # 标题 title_frame = ttk.Frame(self.qa_tab) title_frame.pack(fill=tk.X, pady=(10, 20)) ttk.Label(title_frame, text="💬 问答交互", font=('Arial', 14, 'bold'), foreground="#4dabf5").pack(side=tk.LEFT) # 淡青色标题 # 主内容区域 main_frame = ttk.Frame(self.qa_tab) main_frame.pack(fill=tk.BOTH, expand=True, padx=10, pady=5) # 左侧:问答区域 left_frame = ttk.Frame(main_frame) left_frame.pack(side=tk.LEFT, fill=tk.BOTH, expand=True, padx=5, pady=5) # 问题输入 self.question_frame = ttk.LabelFrame(left_frame, text="❓ 输入问题") self.question_frame.pack(fill=tk.X, padx=5, pady=5) self.question_text = scrolledtext.ScrolledText(self.question_frame, height=8, wrap=tk.WORD, font=('Arial', 11)) self.question_text.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) self.question_text.focus_set() # 提交按钮 btn_frame = ttk.Frame(left_frame) btn_frame.pack(fill=tk.X, pady=10) submit_btn = ttk.Button(btn_frame, text="🚀 提交问题", command=self.submit_question, style='Accent.TButton') submit_btn.pack(side=tk.LEFT, padx=5) clear_btn = ttk.Button(btn_frame, text="🗑️ 清除问题", command=self.clear_question) clear_btn.pack(side=tk.LEFT, padx=5) # 回答显示 self.answer_frame = ttk.LabelFrame(left_frame, text="💡 回答") self.answer_frame.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) self.answer_text = scrolledtext.ScrolledText(self.answer_frame, state=tk.DISABLED, wrap=tk.WORD, font=('Arial', 11)) self.answer_text.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) # 右侧:问答历史 right_frame = ttk.Frame(main_frame, width=400) # 设置宽度 right_frame.pack(side=tk.RIGHT, fill=tk.BOTH, expand=False, padx=5, pady=5) # expand=False self.history_frame = ttk.LabelFrame(right_frame, text="🕒 问答历史") self.history_frame.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) # 创建带滚动条的树状视图 tree_frame = ttk.Frame(self.history_frame) tree_frame.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) # 创建滚动条 tree_scroll = ttk.Scrollbar(tree_frame) tree_scroll.pack(side=tk.RIGHT, fill=tk.Y) # 创建树状视图 columns = ("question", "time") self.history_tree = ttk.Treeview( tree_frame, columns=columns, show="headings", yscrollcommand=tree_scroll.set, height=20 ) # 设置列标题 self.history_tree.heading("question", text="问题") self.history_tree.heading("time", text="时间") # 设置列宽 self.history_tree.column("question", width=250) self.history_tree.column("time", width=120) self.history_tree.pack(side=tk.LEFT, fill=tk.BOTH, expand=True) tree_scroll.config(command=self.history_tree.yview) # 历史操作按钮 history_btn_frame = ttk.Frame(right_frame) history_btn_frame.pack(fill=tk.X, pady=10) view_btn = ttk.Button(history_btn_frame, text="👁️ 查看详情", command=lambda: self.show_history_detail(None)) view_btn.pack(side=tk.LEFT, padx=5, fill=tk.X, expand=True) clear_history_btn = ttk.Button(history_btn_frame, text="🗑️ 清除历史", command=self.clear_history) clear_history_btn.pack(side=tk.LEFT, padx=5, fill=tk.X, expand=True) # 绑定双击事件查看历史详情 self.history_tree.bind("<Double-1>", self.show_history_detail) def clear_question(self): self.question_text.delete("1.0", tk.END) def clear_history(self): if not self.qa_history: return if messagebox.askyesno("确认", "确定要清除所有问答历史吗?"): self.qa_history = [] self.update_history_list() def submit_question(self): question = self.question_text.get("1.0", tk.END).strip() if not question: messagebox.showwarning("警告", "问题不能为空") return # 在新线程中处理问题 threading.Thread(target=self._submit_question_thread, args=(question,), daemon=True).start() def _submit_question_thread(self, question): try: # 显示进度窗口 self.progress_window = tk.Toplevel(self.root) self.progress_window.title("处理中...") self.progress_window.geometry("400x150") self.progress_window.resizable(False, False) self.progress_window.transient(self.root) self.progress_window.grab_set() self.progress_window.configure(bg="#f0f0f0") # 淡灰色背景 # 设置窗口居中 x = self.root.winfo_x() + (self.root.winfo_width() - 400) // 2 y = self.root.winfo_y() + (self.root.winfo_height() - 150) // 2 self.progress_window.geometry(f"+{x}+{y}") # 进度窗口内容 ttk.Label(self.progress_window, text="正在思考中...", font=('Arial', 11)).pack(pady=(20, 10)) progress_frame = ttk.Frame(self.progress_window) progress_frame.pack(fill=tk.X, padx=20, pady=10) self.progress_var = tk.DoubleVar() progress_bar = ttk.Progressbar(progress_frame, variable=self.progress_var, maximum=100, length=360) progress_bar.pack() self.progress_label = ttk.Label(progress_frame, text="0%") self.progress_label.pack(pady=5) self.status_label.config(text="● 正在处理问题...", foreground="#ffc107") # 黄色状态 self.progress_window.update() # 检索相关文档块 relevant_chunks = self.retrieve_relevant_chunks(question, self.params["num_context_docs"]) # 构建上下文 context = "\n\n".join([ f"文档: {c['doc_name']}\n内容: {c['content']}\n相关性: {c['similarity']:.4f}" for c in relevant_chunks ]) # 调用大模型生成回答 prompt = f"""基于以下上下文,回答问题。如果答案不在上下文中,请回答"我不知道"。 上下文: {context} 问题: {question} 回答:""" # 更新进度 self.progress_var.set(50) self.progress_label.config(text="50%") self.progress_window.update() # 流式输出或一次性输出 self.root.after(0, self.answer_text.config, {'state': tk.NORMAL}) self.root.after(0, self.answer_text.delete, "1.0", tk.END) if self.params["enable_stream"]: full_response = "" for chunk in ollama.generate( model=self.llm_model_var.get(), # 使用用户选择的模型 prompt=prompt, stream=True, options={ 'temperature': self.params["temperature"], 'top_p': self.params["top_p"], 'num_ctx': self.params["max_length"] } ): full_response += chunk['response'] self.root.after(0, self.answer_text.insert, tk.END, chunk['response']) self.root.after(0, self.answer_text.see, tk.END) self.root.after(0, self.answer_text.update) # 更新进度 if len(full_response) > 0: progress = min(50 + len(full_response) / 200, 99) self.progress_var.set(progress) self.progress_label.config(text=f"{int(progress)}%") self.progress_window.update() else: response = ollama.generate( model=self.llm_model_var.get(), # 使用用户选择的模型 prompt=prompt, options={ 'temperature': self.params["temperature"], 'top_p': self.params["top_p"], 'num_ctx': self.params["max_length"] } ) full_response = response['response'] self.root.after(0, self.answer_text.insert, tk.END, full_response) # 记录问答历史 self.qa_history.append({ "question": question, "answer": full_response, "context": context, "time": time.strftime("%Y-%m-%d %H:%M:%S") }) # 更新历史列表 self.root.after(0, self.update_history_list) # 完成 self.progress_var.set(100) self.progress_label.config(text="100%") self.status_label.config(text="● 问题处理完成", foreground="#28a745") # 绿色状态 self.root.after(1000, self.progress_window.destroy) except Exception as e: self.root.after(0, lambda: messagebox.showerror("错误", f"处理问题时出错: {str(e)}")) self.root.after(0, self.progress_window.destroy) self.status_label.config(text="● 处理出错", foreground="#dc3545") # 红色状态 def retrieve_relevant_chunks(self, query, k): """处理嵌入为None的情况""" # 生成查询的嵌入 query_embedding = ollama.embeddings( model=self.embedding_model_var.get(), # 使用用户选择的模型 prompt=query )['embedding'] # 注意:返回的是字典中的'embedding'字段 # 计算相似度 similarities = [] for emb in self.embeddings: # 跳过无效的嵌入 if emb['embedding'] is None: continue # 计算余弦相似度 similarity = np.dot(query_embedding, emb['embedding']) similarities.append({ 'chunk_id': emb['chunk_id'], 'similarity': similarity, 'doc_name': emb['doc_name'] }) # 按相似度排序并返回前k个 top_chunks = sorted(similarities, key=lambda x: x['similarity'], reverse=True)[:k] return [{ **self.chunks[c['chunk_id']], 'similarity': c['similarity'] } for c in top_chunks] def update_history_list(self): # 清空现有列表 for item in self.history_tree.get_children(): self.history_tree.delete(item) # 添加新历史记录 for i, qa in enumerate(reversed(self.qa_history)): # 截断长问题 question = qa["question"] if len(question) > 50: question = question[:47] + "..." self.history_tree.insert("", tk.END, values=(question, qa["time"])) def show_history_detail(self, event): selected_item = self.history_tree.selection() if not selected_item: return item = self.history_tree.item(selected_item) question = item['values'][0] # 查找对应的问答记录 for qa in reversed(self.qa_history): if qa["question"].startswith(question) or question.startswith(qa["question"][:50]): # 显示详情窗口 detail_window = tk.Toplevel(self.root) detail_window.title("问答详情") detail_window.geometry("900x700") detail_window.configure(bg='#f0f0f0') # 淡灰色背景 # 设置窗口居中 x = self.root.winfo_x() + (self.root.winfo_width() - 900) // 2 y = self.root.winfo_y() + (self.root.winfo_height() - 700) // 2 detail_window.geometry(f"+{x}+{y}") # 问题 ttk.Label(detail_window, text="问题:", font=('Arial', 12, 'bold'), foreground="#4dabf5").pack(pady=(15, 5), padx=20, anchor=tk.W) question_frame = ttk.Frame(detail_window) question_frame.pack(fill=tk.X, padx=20, pady=(0, 10)) question_text = scrolledtext.ScrolledText(question_frame, wrap=tk.WORD, height=3, font=('Arial', 11)) question_text.insert(tk.INSERT, qa["question"]) question_text.config(state=tk.DISABLED) question_text.pack(fill=tk.X) # 回答 ttk.Label(detail_window, text="回答:", font=('Arial', 12, 'bold'), foreground="#4dabf5").pack(pady=(15, 5), padx=20, anchor=tk.W) answer_frame = ttk.Frame(detail_window) answer_frame.pack(fill=tk.BOTH, expand=True, padx=20, pady=(0, 10)) answer_text = scrolledtext.ScrolledText(answer_frame, wrap=tk.WORD, font=('Arial', 11)) answer_text.insert(tk.INSERT, qa["answer"]) answer_text.config(state=tk.DISABLED) answer_text.pack(fill=tk.BOTH, expand=True) # 上下文 ttk.Label(detail_window, text="上下文:", font=('Arial', 12, 'bold'), foreground="#4dabf5").pack(pady=(15, 5), padx=20, anchor=tk.W) context_frame = ttk.Frame(detail_window) context_frame.pack(fill=tk.BOTH, expand=True, padx=20, pady=(0, 20)) context_text = scrolledtext.ScrolledText(context_frame, wrap=tk.WORD, font=('Arial', 10)) context_text.insert(tk.INSERT, qa["context"]) context_text.config(state=tk.DISABLED) context_text.pack(fill=tk.BOTH, expand=True) break def create_monitor_tab(self): self.monitor_tab = ttk.Frame(self.notebook) self.notebook.add(self.monitor_tab, text="📊 系统监控") # 标题 title_frame = ttk.Frame(self.monitor_tab) title_frame.pack(fill=tk.X, pady=(10, 20)) ttk.Label(title_frame, text="📊 系统监控", font=('Arial', 14, 'bold'), foreground="#4dabf5").pack(side=tk.LEFT) # 主内容区域 main_frame = ttk.Frame(self.monitor_tab) main_frame.pack(fill=tk.BOTH, expand=True, padx=10, pady=5) # 左侧:资源监控 left_frame = ttk.Frame(main_frame) left_frame.pack(side=tk.LEFT, fill=tk.BOTH, expand=True, padx=5, pady=5) # 资源使用 self.resource_frame = ttk.LabelFrame(left_frame, text="📈 资源使用") self.resource_frame.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) # CPU使用 cpu_frame = ttk.Frame(self.resource_frame) cpu_frame.pack(fill=tk.X, padx=10, pady=10) ttk.Label(cpu_frame, text="CPU使用率:").pack(side=tk.LEFT) self.cpu_value = ttk.Label(cpu_frame, text="0%", width=5) self.cpu_value.pack(side=tk.RIGHT, padx=10) self.cpu_usage = ttk.Progressbar(self.resource_frame, length=400, mode='determinate') self.cpu_usage.pack(fill=tk.X, padx=10, pady=(0, 10)) # 内存使用 mem_frame = ttk.Frame(self.resource_frame) mem_frame.pack(fill=tk.X, padx=10, pady=10) ttk.Label(mem_frame, text="内存使用率:").pack(side=tk.LEFT) self.mem_value = ttk.Label(mem_frame, text="0%", width=5) self.mem_value.pack(side=tk.RIGHT, padx=10) self.mem_usage = ttk.Progressbar(self.resource_frame, length=400, mode='determinate') self.mem_usage.pack(fill=tk.X, padx=10, pady=(0, 10)) # 磁盘使用 disk_frame = ttk.Frame(self.resource_frame) disk_frame.pack(fill=tk.X, padx=10, pady=10) ttk.Label(disk_frame, text="磁盘使用率:").pack(side=tk.LEFT) self.disk_value = ttk.Label(disk_frame, text="0%", width=5) self.disk_value.pack(side=tk.RIGHT, padx=10) self.disk_usage = ttk.Progressbar(self.resource_frame, length=400, mode='determinate') self.disk_usage.pack(fill=tk.X, padx=10, pady=(0, 10)) # 右侧:模型状态 right_frame = ttk.Frame(main_frame) right_frame.pack(side=tk.RIGHT, fill=tk.BOTH, expand=True, padx=5, pady=5) # 模型状态 self.model_frame = ttk.LabelFrame(right_frame, text="🤖 模型状态") self.model_frame.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) btn_frame = ttk.Frame(self.model_frame) btn_frame.pack(fill=tk.X, padx=10, pady=10) ttk.Button(btn_frame, text="🔄 检查模型状态", command=self.check_model_status).pack() self.model_status_text = scrolledtext.ScrolledText(self.model_frame, height=15, state=tk.DISABLED, font=('Consolas', 10)) self.model_status_text.pack(fill=tk.BOTH, expand=True, padx=10, pady=(0, 10)) # 性能统计 self.perf_frame = ttk.LabelFrame(left_frame, text="⚡ 性能统计") self.perf_frame.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) # 创建图表 fig = Figure(figsize=(8, 4), dpi=100) fig.set_facecolor('#f0f0f0') # 淡灰色背景 self.ax = fig.add_subplot(111) self.ax.set_facecolor('#e6f0ff') # 淡蓝色背景 self.ax.set_title("CPU使用率历史", color='#333333') self.ax.set_xlabel("时间", color='#333333') self.ax.set_ylabel("使用率(%)", color='#333333') self.ax.tick_params(axis='x', colors='#333333') self.ax.tick_params(axis='y', colors='#333333') self.ax.spines['bottom'].set_color('#333333') self.ax.spines['left'].set_color('#333333') self.cpu_history = [] self.line, = self.ax.plot([], [], color='#4dabf5', marker='o', markersize=4) # 淡青色线条 self.ax.set_ylim(0, 100) canvas = FigureCanvasTkAgg(fig, master=self.perf_frame) canvas.draw() canvas.get_tk_widget().pack(fill=tk.BOTH, expand=True, padx=10, pady=10) # 开始更新资源使用情况 self.update_resource_usage() def update_resource_usage(self): # 获取真实资源数据 cpu_percent = psutil.cpu_percent() mem_percent = psutil.virtual_memory().percent disk_percent = psutil.disk_usage('/').percent # 更新进度条 self.cpu_usage['value'] = cpu_percent self.mem_usage['value'] = mem_percent self.disk_usage['value'] = disk_percent # 更新数值标签 self.cpu_value.config(text=f"{cpu_percent}%") self.mem_value.config(text=f"{mem_percent}%") self.disk_value.config(text=f"{disk_percent}%") # 更新CPU历史图表 self.cpu_history.append(cpu_percent) if len(self.cpu_history) > 20: self.cpu_history.pop(0) self.line.set_data(range(len(self.cpu_history)), self.cpu_history) self.ax.set_xlim(0, max(10, len(self.cpu_history))) self.ax.figure.canvas.draw() # 5秒后再次更新 self.root.after(5000, self.update_resource_usage) def check_model_status(self): try: self.model_status_text.config(state=tk.NORMAL) self.model_status_text.delete("1.0", tk.END) # 添加加载动画 self.model_status_text.insert(tk.INSERT, "正在检查模型状态...") self.model_status_text.update() # 模拟检查过程 time.sleep(1) # 清空并插入真实信息 self.model_status_text.delete("1.0", tk.END) llm_info = ollama.show(self.llm_model_var.get()) # 使用用户选择的模型 embed_info = ollama.show(self.embedding_model_var.get()) # 使用用户选择的模型 status_text = f"""✅ 大模型信息: 名称: {self.llm_model_var.get()} 参数大小: {llm_info.get('size', '未知')} 最后使用时间: {llm_info.get('modified_at', '未知')} 支持功能: {llm_info.get('capabilities', '未知')} ✅ 嵌入模型信息: 名称: {self.embedding_model_var.get()} 参数大小: {embed_info.get('size', '未知')} 最后使用时间: {embed_info.get('modified_at', '未知')} 支持功能: {embed_info.get('capabilities', '未知')} ⏱️ 最后检查时间: {time.strftime("%Y-%m-%d %H:%M:%S")} """ self.model_status_text.insert(tk.INSERT, status_text) self.model_status_text.config(state=tk.DISABLED) self.status_label.config(text="● 模型状态检查完成", foreground="#28a745") # 绿色状态 except Exception as e: self.model_status_text.config(state=tk.NORMAL) self.model_status_text.delete("1.0", tk.END) self.model_status_text.insert(tk.INSERT, f"❌ 检查模型状态时出错: {str(e)}") self.model_status_text.config(state=tk.DISABLED) self.status_label.config(text="● 模型检查出错", foreground="#dc3545") # 红色状态 def update_param(self, param, value): self.params[param] = value # 更新标签显示 if param == "temperature": self.temp_label.config(text=f"{value:.1f}") elif param == "top_p": self.top_p_label.config(text=f"{value:.2f}") elif param == "chunk_size": self.chunk_label.config(text=f"{value}") # 运行应用程序 if __name__ == "__main__": root = ThemedTk(theme="arc") # 使用现代主题 app = RAGApplication(root) root.mainloop()
最新发布
08-07
根据你的需求,我对代码进行了以下修改: 1. 在文档处理页面添加了访问本地知识库时的请求文档数量显示和干预控件 2. 优化了界面色彩搭配(使用更柔和的淡色调) 3. 在系统监控页面添加了GPU使用情况监控功能 以下是完整的修改后的代码: ```python import tkinter as tk from tkinter import ttk, filedialog, messagebox, scrolledtext import ollama import os import time import threading import numpy as np import matplotlib.pyplot as plt from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg from matplotlib.figure import Figure import pandas as pd import seaborn as sns import PyPDF2 import docx import markdown from bs4 import BeautifulSoup import openpyxl from PIL import Image import pytesseract import io import psutil from ttkthemes import ThemedTk # 设置中文字体 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False class RAGApplication: def __init__(self, root): self.root = root self.root.title("✨智能RAG应用系统✨") self.root.geometry("1600x1000") self.root.configure(bg="#f5f7fa") # 更柔和的背景色 # 使用现代主题 self.style = ttk.Style() self.style.theme_use('arc') # 现代主题 # 自定义样式 - 淡色调 self.style.configure('TFrame', background='#f5f7fa') self.style.configure('TLabel', background='#f5f7fa', foreground='#333333') self.style.configure('TLabelframe', background='#f5f7fa', foreground='#333333', borderwidth=1) self.style.configure('TLabelframe.Label', background='#f5f7fa', foreground='#4dabf5') # 淡蓝色标题 self.style.configure('TButton', background='#4dabf5', foreground='white', borderwidth=1) # 深色文字按钮 self.style.map('TButton', background=[('active', '#3b99e0')]) self.style.configure('TNotebook', background='#f5f7fa', borderwidth=0) self.style.configure('TNotebook.Tab', background='#e6f0ff', foreground='#333333', padding=[10, 5]) # 淡蓝色标签 self.style.map('TNotebook.Tab', background=[('selected', '#4dabf5')]) # 初始化数据 self.documents = [] self.chunks = [] self.embeddings = [] self.qa_history = [] # 获取 Ollama 中已安装的模型列表 try: models_response = ollama.list() self.all_models = [model['model'] for model in models_response['models']] # 使用 'model' 字段 except Exception as e: print(f"获取 Ollama 模型列表失败: {e}") self.all_models = [] self.default_llm_model = "qwen2:7b" self.default_embedding_model = "bge-m3:latest" # 默认参数 self.params = { "temperature": 0.7, "top_p": 0.9, "max_length": 2048, "num_context_docs": 3, "chunk_size": 500, "chunk_overlap": 100, "chunk_strategy": "固定大小", "separators": "\n\n\n。\n!\n?\n", "embed_batch_size": 1, "enable_stream": True, "show_progress": True, "show_visualization": True, "ocr_enabled": True } # 创建界面 self.create_ui() def create_ui(self): # 主框架 self.main_frame = ttk.Frame(self.root) self.main_frame.pack(fill=tk.BOTH, expand=True, padx=20, pady=20) # 标题 title_frame = ttk.Frame(self.main_frame) title_frame.pack(fill=tk.X, pady=(0, 20)) ttk.Label(title_frame, text="✨ 智能RAG应用系统 ✨", font=('Arial', 24, 'bold'), foreground="#4dabf5").pack(side=tk.LEFT) # 淡青色标题 # 状态指示器 status_frame = ttk.Frame(title_frame) status_frame.pack(side=tk.RIGHT) self.status_label = ttk.Label(status_frame, text="● 就绪", foreground="#28a745") # 绿色状态 self.status_label.pack(side=tk.RIGHT, padx=10) # 参数控制面板 self.create_sidebar() # 主内容区域 self.create_main_content() def create_sidebar(self): # 侧边栏框架 self.sidebar = ttk.LabelFrame(self.main_frame, text="⚙️ 参数控制面板", width=300) self.sidebar.pack(side=tk.LEFT, fill=tk.Y, padx=10, pady=10) # 大模型参数 ttk.Label(self.sidebar, text="🔧 大模型参数", font=('Arial', 10, 'bold'), foreground="#333333").pack(pady=(15, 5)) self.temperature = tk.DoubleVar(value=self.params["temperature"]) ttk.Label(self.sidebar, text="温度(temperature)").pack(anchor=tk.W, padx=10) temp_frame = ttk.Frame(self.sidebar) temp_frame.pack(fill=tk.X, padx=10, pady=(0, 5)) ttk.Scale(temp_frame, from_=0.0, to=2.0, variable=self.temperature, length=180, command=lambda v: self.update_param("temperature", float(v))).pack(side=tk.LEFT) self.temp_label = ttk.Label(temp_frame, text=f"{self.temperature.get():.1f}", width=5) self.temp_label.pack(side=tk.RIGHT, padx=5) self.top_p = tk.DoubleVar(value=self.params["top_p"]) ttk.Label(self.sidebar, text="Top P").pack(anchor=tk.W, padx=10) top_p_frame = ttk.Frame(self.sidebar) top_p_frame.pack(fill=tk.X, padx=10, pady=(0, 5)) ttk.Scale(top_p_frame, from_=0.0, to=1.0, variable=self.top_p, length=180, command=lambda v: self.update_param("top_p", float(v))).pack(side=tk.LEFT) self.top_p_label = ttk.Label(top_p_frame, text=f"{self.top_p.get():.2f}", width=5) self.top_p_label.pack(side=tk.RIGHT, padx=5) # 添加大模型名称选择(来自 Ollama) ttk.Label(self.sidebar, text="大模型名称").pack(anchor=tk.W, padx=10) self.llm_model_var = tk.StringVar(value=self.default_llm_model) llm_combobox = ttk.Combobox(self.sidebar, textvariable=self.llm_model_var, values=self.all_models) llm_combobox.pack(padx=10, pady=5) # 嵌入模型选择(来自 Ollama) ttk.Label(self.sidebar, text="嵌入模型名称").pack(anchor=tk.W, padx=10) self.embedding_model_var = tk.StringVar(value=self.default_embedding_model) embed_combobox = ttk.Combobox(self.sidebar, textvariable=self.embedding_model_var, values=self.all_models) embed_combobox.pack(padx=10, pady=5) # RAG参数 ttk.Label(self.sidebar, text="🔧 RAG参数", font=('Arial', 10, 'bold'), foreground="#333333").pack(pady=(15, 5)) self.chunk_size = tk.IntVar(value=self.params["chunk_size"]) ttk.Label(self.sidebar, text="分块大小(字符)").pack(anchor=tk.W, padx=10) chunk_frame = ttk.Frame(self.sidebar) chunk_frame.pack(fill=tk.X, padx=10, pady=(0, 5)) ttk.Scale(chunk_frame, from_=100, to=2000, variable=self.chunk_size, length=180, command=lambda v: self.update_param("chunk_size", int(v))).pack(side=tk.LEFT) self.chunk_label = ttk.Label(chunk_frame, text=f"{self.chunk_size.get()}", width=5) self.chunk_label.pack(side=tk.RIGHT, padx=5) # OCR开关 self.ocr_var = tk.BooleanVar(value=self.params["ocr_enabled"]) ttk.Checkbutton(self.sidebar, text="启用OCR扫描", variable=self.ocr_var, command=lambda: self.update_param("ocr_enabled", self.ocr_var.get())).pack(pady=(15, 5), padx=10, anchor=tk.W) # 使用说明 ttk.Label(self.sidebar, text="📖 使用说明", font=('Arial', 10, 'bold'), foreground="#333333").pack(pady=(15, 5)) instructions = """1. 在"文档上传"页上传您的文档 2. 在"文档处理"页对文档进行分块和嵌入 3. 在"问答交互"页提问并获取答案 4. 在"系统监控"页查看系统状态""" ttk.Label(self.sidebar, text=instructions, justify=tk.LEFT, background="#e6f0ff", # 淡蓝色背景 foreground="#333333", padding=10).pack(fill=tk.X, padx=10, pady=5) def create_main_content(self): # 主内容框架 self.content_frame = ttk.Frame(self.main_frame) self.content_frame.pack(side=tk.RIGHT, fill=tk.BOTH, expand=True) # 创建选项卡 self.notebook = ttk.Notebook(self.content_frame) self.notebook.pack(fill=tk.BOTH, expand=True) # 文档上传页 self.create_upload_tab() # 文档处理页 self.create_process_tab() # 问答交互页 self.create_qa_tab() # 系统监控页 self.create_monitor_tab() def create_upload_tab(self): self.upload_tab = ttk.Frame(self.notebook) self.notebook.add(self.upload_tab, text="📤 文档上传") # 标题 title_frame = ttk.Frame(self.upload_tab) title_frame.pack(fill=tk.X, pady=(10, 20)) ttk.Label(title_frame, text="📤 文档上传与管理", font=('Arial', 14, 'bold'), foreground="#4dabf5").pack(side=tk.LEFT) # 淡青色标题 # 上传区域 upload_frame = ttk.Frame(self.upload_tab) upload_frame.pack(fill=tk.X, pady=10) # 上传按钮 upload_btn = ttk.Button(upload_frame, text="📁 上传文档", command=self.upload_files, style='Accent.TButton') upload_btn.pack(side=tk.LEFT, padx=10) # 清除按钮 clear_btn = ttk.Button(upload_frame, text="🗑️ 清除所有", command=self.clear_documents) clear_btn.pack(side=tk.RIGHT, padx=10) # 文档列表 self.doc_list_frame = ttk.LabelFrame(self.upload_tab, text="📋 已上传文档") self.doc_list_frame.pack(fill=tk.BOTH, expand=True, padx=10, pady=10) # 创建带滚动条的树状视图 tree_frame = ttk.Frame(self.doc_list_frame) tree_frame.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) # 创建滚动条 tree_scroll = ttk.Scrollbar(tree_frame) tree_scroll.pack(side=tk.RIGHT, fill=tk.Y) # 创建树状视图 columns = ("name", "size", "time", "type") self.doc_tree = ttk.Treeview(tree_frame, columns=columns, show="headings", yscrollcommand=tree_scroll.set, height=8) # 设置列标题 self.doc_tree.heading("name", text="文件名") self.doc_tree.heading("size", text="大小") self.doc_tree.heading("time", text="上传时间") self.doc_tree.heading("type", type="类型") # 设置列宽 self.doc_tree.column("name", width=250) self.doc_tree.column("size", width=80) self.doc_tree.column("time", width=150) self.doc_tree.column("type", width=80) self.doc_tree.pack(side=tk.LEFT, fill=tk.BOTH, expand=True) tree_scroll.config(command=self.doc_tree.yview) # 文档统计 self.doc_stats_frame = ttk.Frame(self.upload_tab) self.doc_stats_frame.pack(fill=tk.X, pady=10, padx=10) stats_style = ttk.Style() stats_style.configure('Stats.TLabel', background='#e6f0ff', foreground='#333333', padding=5) # 淡蓝色背景 ttk.Label(self.doc_stats_frame, text="📊 文档统计:", style='Stats.TLabel').pack(side=tk.LEFT, padx=5) self.doc_count_label = ttk.Label(self.doc_stats_frame, text="0", style='Stats.TLabel') self.doc_count_label.pack(side=tk.LEFT, padx=5) ttk.Label(self.doc_stats_frame, text="总字符数:", style='Stats.TLabel').pack(side=tk.LEFT, padx=5) self.char_count_label = ttk.Label(self.doc_stats_frame, text="0", style='Stats.TLabel') self.char_count_label.pack(side=tk.LEFT, padx=5) ttk.Label(self.doc_stats_frame, text="总页数:", style='Stats.TLabel').pack(side=tk.LEFT, padx=5) self.page_count_label = ttk.Label(self.doc_stats_frame, text="0", style='Stats.TLabel') self.page_count_label.pack(side=tk.LEFT, padx=5) def clear_documents(self): if not self.documents: return if messagebox.askyesno("确认", "确定要清除所有文档吗?"): self.documents = [] self.update_doc_list() # ================== 文件读取函数 ================== def read_pdf(self, filepath): """读取PDF文件内容,支持扫描版OCR""" content = "" pages = 0 try: with open(filepath, 'rb') as f: reader = PyPDF2.PdfReader(f) num_pages = len(reader.pages) pages = num_pages for page_num in range(num_pages): page = reader.pages[page_num] text = page.extract_text() # 如果是扫描版PDF,使用OCR识别 if not text.strip() and self.params["ocr_enabled"]: try: # 获取页面图像 images = page.images if images: for img in images: image_data = img.data image = Image.open(io.BytesIO(image_data)) text += pytesseract.image_to_string(image, lang='chi_sim+eng') except Exception as e: print(f"OCR处理失败: {str(e)}") content += text + "\n" except Exception as e: print(f"读取PDF失败: {str(e)}") return content, pages def read_docx(self, filepath): """读取Word文档内容""" content = "" pages = 0 try: doc = docx.Document(filepath) for para in doc.paragraphs: content += para.text + "\n" pages = len(doc.paragraphs) // 50 + 1 # 估算页数 except Exception as e: print(f"读取Word文档失败: {str(e)}") return content, pages def read_excel(self, filepath): """读取Excel文件内容,优化内存使用""" content = "" pages = 0 try: # 使用openpyxl优化大文件读取 wb = openpyxl.load_workbook(filepath, read_only=True) for sheet_name in wb.sheetnames: content += f"\n工作表: {sheet_name}\n" sheet = wb[sheet_name] for row in sheet.iter_rows(values_only=True): row_content = " | ".join([str(cell) if cell is not None else "" for cell in row]) content += row_content + "\n" pages = len(wb.sheetnames) except Exception as e: print(f"读取Excel文件失败: {str(e)}") return content, pages def read_md(self, filepath): """读取Markdown文件内容""" content = "" pages = 0 try: with open(filepath, 'r', encoding='utf-8') as f: html = markdown.markdown(f.read()) soup = BeautifulSoup(html, 'html.parser') content = soup.get_text() pages = len(content) // 2000 + 1 # 估算页数 except Exception as e: print(f"读取Markdown文件失败: {str(e)}") return content, pages def read_ppt(self, filepath): """读取PPT文件内容(简化版)""" content = "" pages = 0 try: # 实际应用中应使用python-pptx库 # 这里仅作演示 content = f"PPT文件内容提取: {os.path.basename(filepath)}" pages = 10 # 假设有10页 except Exception as e: print(f"读取PPT文件失败: {str(e)}") return content, pages def upload_files(self): filetypes = [ ("文本文件", "*.txt"), ("PDF文件", "*.pdf"), ("Word文件", "*.docx *.doc"), ("Excel文件", "*.xlsx *.xls"), ("Markdown文件", "*.md"), ("PPT文件", "*.pptx *.ppt"), ("所有文件", "*.*") ] filenames = filedialog.askopenfilenames(title="选择文档", filetypes=filetypes) if filenames: self.status_label.config(text="● 正在上传文档...", foreground="#ffc107") # 黄色状态 total_pages = 0 for filename in filenames: try: ext = os.path.splitext(filename)[1].lower() if ext == '.txt': with open(filename, 'r', encoding='utf-8') as f: content = f.read() pages = len(content) // 2000 + 1 elif ext == '.pdf': content, pages = self.read_pdf(filename) elif ext in ('.docx', '.doc'): content, pages = self.read_docx(filename) elif ext in ('.xlsx', '.xls'): content, pages = self.read_excel(filename) elif ext == '.md': content, pages = self.read_md(filename) elif ext in ('.pptx', '.ppt'): content, pages = self.read_ppt(filename) else: messagebox.showwarning("警告", f"不支持的文件类型: {ext}") continue # 处理字符编码问题 if not isinstance(content, str): try: content = content.decode('utf-8') except: content = content.decode('latin-1', errors='ignore') self.documents.append({ "name": os.path.basename(filename), "content": content, "size": len(content), "upload_time": time.strftime("%Y-%m-%d %H:%M:%S"), "type": ext.upper().replace(".", ""), "pages": pages }) total_pages += pages # 更新文档列表 self.update_doc_list() except Exception as e: messagebox.showerror("错误", f"无法读取文件 {filename}: {str(e)}") self.status_label.config(text=f"● 上传完成! 共{len(filenames)}个文档", foreground="#28a745") # 绿色状态 self.page_count_label.config(text=str(total_pages)) def update_doc_list(self): # 清空现有列表 for item in self.doc_tree.get_children(): self.doc_tree.delete(item) # 添加新文档 for doc in self.documents: size_kb = doc["size"] / 1024 size_str = f"{size_kb:.1f} KB" if size_kb < 1024 else f"{size_kb / 1024:.1f} MB" self.doc_tree.insert("", tk.END, values=( doc["name"], size_str, doc["upload_time"], doc["type"] )) # 更新统计信息 self.doc_count_label.config(text=str(len(self.documents))) self.char_count_label.config(text=str(sum(d['size'] for d in self.documents))) def create_process_tab(self): self.process_tab = ttk.Frame(self.notebook) self.notebook.add(self.process_tab, text="🔧 文档处理") # 标题 title_frame = ttk.Frame(self.process_tab) title_frame.pack(fill=tk.X, pady=(10, 20)) ttk.Label(title_frame, text="🔧 文档处理与分块", font=('Arial', 14, 'bold'), foreground="#4dabf5").pack(side=tk.LEFT) # 淡青色标题 # 处理按钮 btn_frame = ttk.Frame(self.process_tab) btn_frame.pack(fill=tk.X, pady=10) process_btn = ttk.Button(btn_frame, text="🔄 处理文档", command=self.process_documents, style='Accent.TButton') process_btn.pack(side=tk.LEFT, padx=10) visualize_btn = ttk.Button(btn_frame, text="📊 更新可视化", command=self.show_visualizations) visualize_btn.pack(side=tk.LEFT, padx=10) # 主内容区域 content_frame = ttk.Frame(self.process_tab) content_frame.pack(fill=tk.BOTH, expand=True) # 左侧:可视化区域 self.visual_frame = ttk.LabelFrame(content_frame, text="📈 文档分析") self.visual_frame.pack(side=tk.LEFT, fill=tk.BOTH, expand=True, padx=10, pady=10) # 右侧:分块列表 self.chunk_frame = ttk.LabelFrame(content_frame, text="📋 分块结果") self.chunk_frame.pack(side=tk.RIGHT, fill=tk.BOTH, expand=True, padx=10, pady=10) # 创建带滚动条的树状视图 tree_frame = ttk.Frame(self.chunk_frame) tree_frame.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) # 创建滚动条 tree_scroll = ttk.Scrollbar(tree_frame) tree_scroll.pack(side=tk.RIGHT, fill=tk.Y) # 创建树状视图 columns = ("doc_name", "start", "end", "content") self.chunk_tree = ttk.Treeview( tree_frame, columns=columns, show="headings", yscrollcommand=tree_scroll.set, height=15 ) # 设置列标题 self.chunk_tree.heading("doc_name", text="来源文档") self.chunk_tree.heading("start", text="起始位置") self.chunk_tree.heading("end", text="结束位置") self.chunk_tree.heading("content", text="内容预览") # 设置列宽 self.chunk_tree.column("doc_name", width=150) self.chunk_tree.column("start", width=80) self.chunk_tree.column("end", width=80) self.chunk_tree.column("content", width=300) self.chunk_tree.pack(side=tk.LEFT, fill=tk.BOTH, expand=True) tree_scroll.config(command=self.chunk_tree.yview) # 初始显示占位图 self.show_placeholder() def show_placeholder(self): """显示可视化占位图""" for widget in self.visual_frame.winfo_children(): widget.destroy() placeholder = ttk.Label(self.visual_frame, text="文档处理后将显示分析图表", font=('Arial', 12), foreground="#7f8c8d") placeholder.pack(expand=True, pady=50) def process_documents(self): if not self.documents: messagebox.showwarning("警告", "请先上传文档") return # 在新线程中处理文档 threading.Thread(target=self._process_documents_thread, daemon=True).start() def _process_documents_thread(self): # 显示进度条 self.progress_window = tk.Toplevel(self.root) self.progress_window.title("处理进度") self.progress_window.geometry("400x150") self.progress_window.resizable(False, False) self.progress_window.transient(self.root) self.progress_window.grab_set() self.progress_window.configure(bg="#f5f7fa") # 淡灰色背景 # 设置窗口居中 x = self.root.winfo_x() + (self.root.winfo_width() - 400) // 2 y = self.root.winfo_y() + (self.root.winfo_height() - 150) // 2 self.progress_window.geometry(f"+{x}+{y}") # 进度窗口内容 ttk.Label(self.progress_window, text="正在处理文档...", font=('Arial', 11)).pack(pady=(20, 10)) progress_frame = ttk.Frame(self.progress_window) progress_frame.pack(fill=tk.X, padx=20, pady=10) self.progress_var = tk.DoubleVar() progress_bar = ttk.Progressbar(progress_frame, variable=self.progress_var, maximum=100, length=360) progress_bar.pack() self.progress_label = ttk.Label(progress_frame, text="0%") self.progress_label.pack(pady=5) self.status_label.config(text="● 正在处理文档...", foreground="#ffc107") # 黄色状态 self.progress_window.update() try: # 分块处理 self.chunks = self.chunk_documents( self.documents, self.params["chunk_strategy"], self.params["chunk_size"], self.params["chunk_overlap"] ) # 生成嵌入 self.embeddings = self.generate_embeddings(self.chunks) # 更新UI self.root.after(0, self.update_chunk_list) self.root.after(0, self.show_visualizations) self.root.after(0, lambda: messagebox.showinfo("完成", "文档处理完成!")) self.status_label.config(text="● 文档处理完成", foreground="#28a745") # 绿色状态 except Exception as e: self.root.after(0, lambda: messagebox.showerror("错误", f"处理文档时出错: {str(e)}")) self.status_label.config(text="● 处理出错", foreground="#dc3545") # 红色状态 finally: self.root.after(0, self.progress_window.destroy) def chunk_documents(self, documents, strategy, size, overlap): chunks = [] total_docs = len(documents) for doc_idx, doc in enumerate(documents): content = doc['content'] if strategy == "固定大小": for i in range(0, len(content), size - overlap): chunk = content[i:i + size] chunks.append({ "doc_name": doc['name'], "content": chunk, "start": i, "end": min(i + size, len(content)) }) # 更新进度 progress = (doc_idx + 1) / total_docs * 100 self.progress_var.set(progress) self.progress_label.config(text=f"{int(progress)}%") self.progress_window.update() return chunks def generate_embeddings(self, chunks): """单批次处理每个分块,避免API参数类型错误""" embeddings = [] total_chunks = len(chunks) for idx, chunk in enumerate(chunks): try: # 传递单个字符串而不是列表 response = ollama.embeddings( model=self.embedding_model_var.get(), prompt=chunk['content'] # 单个字符串 ) embeddings.append({ "chunk_id": idx, "embedding": response['embedding'], "doc_name": chunk['doc_name'] }) except Exception as e: print(f"生成嵌入时出错: {str(e)}") # 添加空嵌入占位符 embeddings.append({ "chunk_id": idx, "embedding": None, "doc_name": chunk['doc_name'] }) # 更新进度 progress = (idx + 1) / total_chunks * 100 self.progress_var.set(progress) self.progress_label.config(text=f"{int(progress)}%") self.progress_window.update() # 添加延迟避免请求过快 time.sleep(0.1) return embeddings def update_chunk_list(self): # 清空现有列表 for item in self.chunk_tree.get_children(): self.chunk_tree.delete(item) # 添加新分块 for chunk in self.chunks: preview = chunk['content'][:50] + "..." if len(chunk['content']) > 50 else chunk['content'] self.chunk_tree.insert("", tk.END, values=( chunk['doc_name'], chunk['start'], chunk['end'], preview )) def show_visualizations(self): # 清空可视化区域 for widget in self.visual_frame.winfo_children(): widget.destroy() if not self.params["show_visualization"] or not self.chunks: self.show_placeholder() return # 创建图表框架 fig = plt.Figure(figsize=(10, 8), dpi=100) fig.set_facecolor('#f5f7fa') # 淡灰色背景 # 分块大小分布 ax1 = fig.add_subplot(221) ax1.set_facecolor('#e6f0ff') # 淡蓝色背景 chunk_sizes = [len(c['content']) for c in self.chunks] sns.histplot(chunk_sizes, bins=20, ax=ax1, color='#4dabf5') # 淡青色 ax1.set_title("分块大小分布", color='#333333') ax1.set_xlabel("字符数", color='#333333') ax1.set_ylabel("数量", color='#333333') ax1.tick_params(axis='x', colors='#333333') ax1.tick_params(axis='y', colors='#333333') ax1.spines['bottom'].set_color('#333333') ax1.spines['left'].set_color('#333333') # 文档分块数量 ax2 = fig.add_subplot(222) ax2.set_facecolor('#e6f0ff') # 淡蓝色背景 doc_chunk_counts = {} for chunk in self.chunks: doc_chunk_counts[chunk['doc_name']] = doc_chunk_counts.get(chunk['doc_name'], 0) + 1 # 只显示前10个文档 doc_names = list(doc_chunk_counts.keys()) counts = list(doc_chunk_counts.values()) if len(doc_names) > 10: # 按分块数量排序,取前10 sorted_indices = np.argsort(counts)[::-1][:10] doc_names = [doc_names[i] for i in sorted_indices] counts = [counts[i] for i in sorted_indices] sns.barplot(x=counts, y=doc_names, hue=doc_names, ax=ax2, palette='Blues', orient='h', legend=False) ax2.set_title("各文档分块数量", color='#333333') ax2.set_xlabel("分块数", color='#333333') ax2.set_ylabel("") ax2.tick_params(axis='x', colors='#333333') ax2.tick_params(axis='y', colors='#333333') ax2.spines['bottom'].set_color('#333333') ax2.spines['left'].set_color('#333333') # 内容词云(模拟) ax3 = fig.add_subplot(223) ax3.set_facecolor('#e6f0ff') # 淡蓝色背景 ax3.set_title("内容关键词分布", color='#333333') ax3.text(0.5, 0.5, "关键词可视化区域", horizontalalignment='center', verticalalignment='center', color='#333333', fontsize=12) ax3.axis('off') # 处理进度 ax4 = fig.add_subplot(224) ax4.set_facecolor('#e6f0ff') # 淡蓝色背景 ax4.set_title("处理进度", color='#333333') # 模拟数据 stages = ['上传', '分块', '嵌入', '完成'] progress = [100, 100, 100, 100] # 假设都已完成 ax4.barh(stages, progress, color=['#4dabf5', '#20c997', '#9b59b6', '#ffc107']) # 淡色系 ax4.set_xlim(0, 100) ax4.set_xlabel("完成百分比", color='#333333') ax4.tick_params(axis='x', colors='#333333') ax4.tick_params(axis='y', colors='#333333') ax4.spines['bottom'].set_color('#333333') ax4.spines['left'].set_color('#333333') # 调整布局 fig.tight_layout(rect=[0, 0, 1, 0.95], pad=3.0) # 添加总标题 fig.suptitle("文档分析概览", fontsize=16, color='#333333') # 在Tkinter中显示图表 canvas = FigureCanvasTkAgg(fig, master=self.visual_frame) canvas.draw() canvas.get_tk_widget().pack(fill=tk.BOTH, expand=True) def create_qa_tab(self): self.qa_tab = ttk.Frame(self.notebook) self.notebook.add(self.qa_tab, text="💬 问答交互") # 标题 title_frame = ttk.Frame(self.qa_tab) title_frame.pack(fill=tk.X, pady=(10, 20)) ttk.Label(title_frame, text="💬 问答交互", font=('Arial', 14, 'bold'), foreground="#4dabf5").pack(side=tk.LEFT) # 淡青色标题 # 主内容区域 main_frame = ttk.Frame(self.qa_tab) main_frame.pack(fill=tk.BOTH, expand=True, padx=10, pady=5) # 左侧:问答区域 left_frame = ttk.Frame(main_frame) left_frame.pack(side=tk.LEFT, fill=tk.BOTH, expand=True, padx=5, pady=5) # 问题输入 self.question_frame = ttk.LabelFrame(left_frame, text="❓ 输入问题") self.question_frame.pack(fill=tk.X, padx=5, pady=5) self.question_text = scrolledtext.ScrolledText(self.question_frame, height=8, wrap=tk.WORD, font=('Arial', 11)) self.question_text.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) self.question_text.focus_set() # 提交按钮 btn_frame = ttk.Frame(left_frame) btn_frame.pack(fill=tk.X, pady=10) submit_btn = ttk.Button(btn_frame, text="🚀 提交问题", command=self.submit_question, style='Accent.TButton') submit_btn.pack(side=tk.LEFT, padx=5) clear_btn = ttk.Button(btn_frame, text="🗑️ 清除问题", command=self.clear_question) clear_btn.pack(side=tk.LEFT, padx=5) # 回答显示 self.answer_frame = ttk.LabelFrame(left_frame, text="💡 回答") self.answer_frame.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) self.answer_text = scrolledtext.ScrolledText(self.answer_frame, state=tk.DISABLED, wrap=tk.WORD, font=('Arial', 11)) self.answer_text.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) # 右侧:问答历史 right_frame = ttk.Frame(main_frame, width=400) # 设置宽度 right_frame.pack(side=tk.RIGHT, fill=tk.BOTH, expand=False, padx=5, pady=5) # expand=False self.history_frame = ttk.LabelFrame(right_frame, text="🕒 问答历史") self.history_frame.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) # 创建带滚动条的树状视图 tree_frame = ttk.Frame(self.history_frame) tree_frame.pack(fill=tk.BOTH, expand=True, padx=5, pady=5) # 创建滚动条 tree_scroll = ttk.Scrollbar(tree_frame) tree_scroll.pack(side=tk.RIGHT, fill=tk.Y) # 创建树状视图 columns = ("question", "time") self.history_tree = ttk.Treeview( tree_frame, columns=columns, show="headings", yscrollcommand=tree_scroll.set, height=20 ) # 设置列标题 self.history_tree.heading("question", text="问题") self.history_tree.heading("time", text="时间") # 设置列宽 self.history_tree.column("question", width=250) self.history_tree.column("time", width=120) self.history_tree.pack(side=tk.LEFT, fill=tk.BOTH, expand=True) tree_scroll.config(command=self.history_tree.yview) # 历史操作按钮 history_btn_frame = ttk.Frame(right_frame) history_btn_frame.pack(fill=tk.X, pady=10) view_btn = ttk.Button(history_btn_frame, text="👁️ 查看详情", command=lambda: self.show_history_detail(None)) view_btn.pack(side=tk.LEFT, padx=5, fill=tk.X, expand=True) clear_history_btn = ttk.Button(history_btn_frame, text="🗑️ 清除历史", command=self.clear_history) clear_history_btn.pack(side=tk.LEFT, padx=5, fill=tk.X, expand=True) # 绑定双击事件查看历史详情 self.history_tree.bind("<Double-1>", self.show_history_detail) def clear_question(self): self.question_text.delete("1.0", tk.END) def clear_history(self): if not self.qa_history: return if messagebox.askyesno("确认", "确定要清除所有问答历史吗?"): self.qa_history = [] self.update_history_list() def submit_question(self): question = self.question_text.get("1.0", tk.END).strip() if not question: messagebox.showwarning("警告", "问题不能为空") return # 在新线程中处理问题 threading.Thread(target=self._submit_question_thread, args=(question,), daemon=True).start() def _submit_question_thread(self, question): try: # 显示进度窗口 self.progress_window = tk.Toplevel(self.root) self.progress_window.title("处理中...") self.progress_window.geometry("400x150") self.progress_window.resizable(False, False) self.progress_window.transient(self.root) self.progress_window.grab_set() self.progress_window.configure(bg="#f5f7fa") # 淡灰色背景 # 设置窗口居中 x = self.root.winfo_x() + (self.root.winfo_width() - 400) // 2 y = self.root.winfo_y() + (self.root.winfo_height() - 150) // 2 self.progress_window.geometry(f"+{x}+{y}") # 进度窗口内容 ttk.Label(self.progress_window, text="正在思考中...", font=('Arial', 11)).pack(pady=(20, 10)) progress_frame = ttk.Frame(self.progress_window) progress_frame.pack(fill=tk.X, padx=20, pady=10) self.progress_var = tk.DoubleVar() progress_bar = ttk.Progressbar(progress_frame, variable=self.progress_var, maximum=100, length=360) progress_bar.pack() self.progress_label = ttk.Label(progress_frame, text="0%") self.progress_label.pack(pady=5) self.status_label.config(text="● 正在处理问题...", foreground="#ffc107") # 黄色状态 self.progress_window.update() # 检索相关文档块 relevant_chunks = self.retrieve_relevant_chunks(question, self.params["num_context_docs"]) # 显示检索到的文档数量 docs_count = len(relevant_chunks) self.status_label.config(text=f"● 已找到 {docs_count} 个相关文档", foreground="#28a745") # 绿色状态 self.progress_window.update() # 构建上下文 context = "\n\n".join([ f"文档: {c['doc_name']}\n内容: {c['content']}\n相关性: {c['similarity']:.4f}" for c in relevant_chunks ]) # 调用大模型生成回答 prompt = f"""基于以下上下文,回答问题。如果答案不在上下文中,请回答"我不知道"。 上下文: {context} 问题: {question} 回答:""" # 更新进度 self.progress_var.set(50) self.progress_label.config(text="50%") self.progress_window.update() # 流式输出或一次性输出 self.root.after(0, self.answer_text.config, {'state': tk.NORMAL}) self.root.after(0, self.answer_text.delete,
评论 4
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值