134 深入解析文本分割工具:utils.py llamaindex.core.node_parser.text.utils.py

深入解析文本分割工具:utils.py

在自然语言处理(NLP)领域,文本分割是一个基础且关键的步骤。它涉及将长文本拆分成更小的单元,以便于进一步处理和分析。今天,我们将深入探讨一个名为 utils.py 的文本分割工具,它提供了多种文本分割方法,包括按字符、按句子、按特定分隔符等。这些方法在处理不同类型的文本时尤为有用,因为它们可以帮助我们更好地理解文本的结构和上下文。

前置知识

在深入了解 utils.py 之前,我们需要掌握以下几个概念:

  1. 文本分割(Text Splitting):将长文本拆分成更小的单元,如句子、段落或词语。
  2. 正则表达式(Regular Expression):一种用于匹配字符串中字符组合的模式。
  3. NLTK(Natural Language Toolkit):一个用于自然语言处理的Python库。

utils.py 的实现

utils.py 提供了多种文本分割方法,下面是其实现的详细解析:

导入必要的模块

首先,我们需要导入一些必要的模块和函数:

import logging
from typing import Callable, List

from llama_index.core.node_parser.interface import TextSplitter

定义日志记录器

接下来,我们定义一个日志记录器,用于记录程序的运行信息:

logger = logging.getLogger(__name__)

定义文本分割函数

现在,我们定义一些文本分割函数,并为其添加必要的属性和方法:

def truncate_text(text: str, text_splitter: TextSplitter) -> str:
    """Truncate text to fit within the chunk size."""
    chunks = text_splitter.split_text(text)
    return chunks[0]

def split_text_keep_separator(text: str, separator: str) -> List[str
2025-10-23 09:29:49 - [MainThread] - INFO - (main.py:11) $ <module>) ::: start clean residual process... 2025-10-23 09:29:49 - [MainThread] - INFO - (clean_process.py:15) $ clean_process) ::: Closing process:RuntimeBroker.exe(PID:4900) 2025-10-23 09:29:49 - [MainThread] - INFO - (clean_process.py:15) $ clean_process) ::: Closing process:wps.exe(PID:5396) 2025-10-23 09:29:49 - [MainThread] - INFO - (clean_process.py:15) $ clean_process) ::: Closing process:wpscloudsvr.exe(PID:5964) 2025-10-23 09:29:49 - [MainThread] - INFO - (clean_process.py:15) $ clean_process) ::: Closing process:RuntimeBroker.exe(PID:10816) 2025-10-23 09:29:49 - [MainThread] - INFO - (clean_process.py:15) $ clean_process) ::: Closing process:RuntimeBroker.exe(PID:11464) 2025-10-23 09:29:49 - [MainThread] - INFO - (clean_process.py:15) $ clean_process) ::: Closing process:RuntimeBroker.exe(PID:11612) 2025-10-23 09:29:49 - [MainThread] - INFO - (clean_process.py:15) $ clean_process) ::: Closing process:wps.exe(PID:12296) 2025-10-23 09:29:49 - [MainThread] - INFO - (clean_process.py:15) $ clean_process) ::: Closing process:RuntimeBroker.exe(PID:14444) 2025-10-23 09:29:49 - [MainThread] - INFO - (main.py:14) $ <module>) ::: start agent... 2025-10-23 09:29:49 - [MainThread] - INFO - (server.py:39) $ restart) ::: restart ws 2025-10-23 09:29:49 - [MainThread] - INFO - (main.py:148) $ restart) ::: restart graph thread 2025-10-23 09:29:49 - [MainThread] - INFO - (main.py:153) $ restart) ::: start graph thread 2025-10-23 09:29:49 - [graph] - INFO - (main.py:49) $ init_graph) ::: load {'name': 'req_diff', 'type': 'req', 'desc': 'Requirement Diff Analysing, to compare requirement documents of two different versions', 'route_key': '需求差异分析(Requirement Diff Analysing)', 'steps': [{'node_name': 'req_diff_executor', 'funct_name': 'entry', 'module': 'robot.req_diff.entry'}]} 2025-10-23 09:29:49 - [Thread-4 (task_handler)] - INFO - (task_handler.py:2516) $ task_handler) ::: [task handler] sub thread start to run. 2025-10-23 09:29:49 - [ws] - INFO - (server.py:707) $ wrap) ::: server listening on 0.0.0.0:17011 2025-10-23 09:29:49 - [graph] - INFO - (main.py:49) $ init_graph) ::: load {'name': 'neo4j_diff', 'type': 'req', 'desc': 'Requirements Analysing, refers to a systematic process of analyzing a specific goal or issue to clarify its necessary conditions or critical elements', 'route_key': '要件分析(Requirement Analysing)', 'steps': [{'node_name': 'neo4j_diff_executor', 'funct_name': 'neo4j_diff', 'module': 'robot.neo4j_diff.neo4j_diff'}]} 2025-10-23 09:29:49 - [ws] - INFO - (server.py:63) $ _start_server) ::: WebSocket server started at ws://0.0.0.0:17011 2025-10-23 09:29:49 - [graph] - INFO - (main.py:49) $ init_graph) ::: load {'name': 'regulation_check', 'type': 'req', 'desc': 'Regulation Check, determine whether the change point requirements are compliant.', 'route_key': '合规性检查(Regulation Checking)', 'steps': [{'node_name': 'regulation_executor', 'funct_name': 'entry', 'module': 'robot.regulation_check.entry'}]} * Serving Flask app 'kotei_web_server.app' INFO: Started server process [12148] * Debug mode: on INFO: Waiting for application startup. INFO: Application startup complete. Exception in thread graph: INFO: Uvicorn running on http://0.0.0.0:17010 (Press CTRL+C to quit) Traceback (most recent call last): File "threading.py", line 1016, in _bootstrap_inner File "threading.py", line 953, in run 2025-10-23 09:29:50 - [Thread-3 (doc_web_server_main)] - INFO - (_internal.py:97) $ _log) ::: WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Running on all addresses (0.0.0.0) * Running on http://127.0.0.1:17012 * Running on http://192.168.192.146:17012 2025-10-23 09:29:50 - [Thread-3 (doc_web_server_main)] - INFO - (_internal.py:97) $ _log) ::: Press CTRL+C to quit File "kotei_agent\graph\main.py", line 122, in start_graph File "kotei_agent\graph\main.py", line 95, in init_graph File "importlib\__init__.py", line 126, in import_module File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 688, in _load_unlocked File "PyInstaller\loader\pyimod02_importers.py", line 450, in exec_module File "robot\req_diff\entry.py", line 23, in <module> File "D:\RRM\Packages\release-20251023\diff-agent\_internal\kotei_omc\compare_parser.py", line 3, in <module> from kotei_omc.comparers.plugins import compare_plugins File "D:\RRM\Packages\release-20251023\diff-agent\_internal\kotei_omc\comparers\__init__.py", line 1, in <module> from kotei_omc.comparers import text_comparer File "D:\RRM\Packages\release-20251023\diff-agent\_internal\kotei_omc\comparers\text_comparer.py", line 5, in <module> from kotei_omc.middlewares.text_middlewares import CustomTextStrategyMiddleware File "D:\RRM\Packages\release-20251023\diff-agent\_internal\kotei_omc\middlewares\text_middlewares.py", line 8, in <module> from sklearn.feature_extraction.text import TfidfVectorizer File "PyInstaller\loader\pyimod02_importers.py", line 450, in exec_module File "sklearn\__init__.py", line 73, in <module> File "PyInstaller\loader\pyimod02_importers.py", line 450, in exec_module File "sklearn\base.py", line 19, in <module> File "PyInstaller\loader\pyimod02_importers.py", line 450, in exec_module File "sklearn\utils\__init__.py", line 15, in <module> File "PyInstaller\loader\pyimod02_importers.py", line 450, in exec_module File "sklearn\utils\_chunking.py", line 11, in <module> File "PyInstaller\loader\pyimod02_importers.py", line 450, in exec_module File "sklearn\utils\_param_validation.py", line 14, in <module> File "PyInstaller\loader\pyimod02_importers.py", line 450, in exec_module File "scipy\sparse\__init__.py", line 300, in <module> File "PyInstaller\loader\pyimod02_importers.py", line 450, in exec_module File "scipy\sparse\_base.py", line 5, in <module> File "PyInstaller\loader\pyimod02_importers.py", line 450, in exec_module File "scipy\sparse\_sputils.py", line 10, in <module> File "PyInstaller\loader\pyimod02_importers.py", line 450, in exec_module File "scipy\_lib\_util.py", line 13, in <module> File "PyInstaller\loader\pyimod02_importers.py", line 450, in exec_module File "scipy\_lib\_array_api.py", line 18, in <module> File "PyInstaller\loader\pyimod02_importers.py", line 450, in exec_module File "scipy\_lib\array_api_compat\numpy\__init__.py", line 1, in <module> File "numpy\__init__.py", line 374, in __getattr__ ModuleNotFoundError: No module named 'numpy.f2py'
最新发布
10-24
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

需要重新演唱

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值