langchain txt 文档按字数分块，按指定字符分块

木下瞳

已于 2024-04-24 22:53:59 修改

阅读量1k

点赞数 4

CC 4.0 BY-SA版权

分类专栏： langchain 文章标签： langchain

于 2024-04-13 20:31:34 首次发布

本文链接：https://blog.youkuaiyun.com/zjkpy_5/article/details/137724235

langchain 专栏收录该内容

17 篇文章

订阅专栏

文章介绍了使用RecursiveCharacterTextSplitter按字数和指定字符分块的方法，以及CharacterTextSplitter按单个分隔符分块的用法，特别提到了如何处理换行符和调整chunk_size以适应不同情况。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

RecursiveCharacterTextSplitter 按字数分块

CharacterTextSplitter 指定单个分隔符分块

txt 有多行，我的这份数据有 67 行，样例如下：

字段1\t值1\n

字段2\t值2\n

...

RecursiveCharacterTextSplitter 按字数分块

按照每 100 个字分块，每块允许重叠字数为 20，可以看到结果被分为了 135 个块

from langchain.text_splitter import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter


# 方法1 读取
filepath = 'data/专业描述.txt'
with open(filepath, encoding='utf8') as f:
    state_of_the_union = f.read()

# 方法2 读取
state_of_the_union = TextLoader(filepath, encoding='utf8').load()

# 按块分割
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,  # 指定每块大小
    chunk_overlap=20,  # 指定每块可以重叠的字符数
    length_function=len,
    is_separator_regex=False,
)
texts = text_splitter.create_documents([state_of_the_union])
# texts = text_splitter.split_documents(raw_documents)
print(texts)

其实这个也可以按照指定字符去分割，按顺序分割它们，直到块足够小。默认列表是 ["\n\n", "\n", " ", ""]，指定 separators 参数

from langchain.text_splitter import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter


# 方法1 读取
filepath = 'data/专业描述.txt'
with open(filepath, encoding='utf8') as f:
    state_of_the_union = f.read()

# 方法2 读取
state_of_the_union = TextLoader(filepath, encoding='utf8').load()

# 按块分割
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,  # 指定每块大小
    chunk_overlap=20,  # 指定每块可以重叠的字符数
    length_function=len,
    is_separator_regex=False,
    separators=['\n', '\n\n']  # 指定按照什么字符去分割，如果不指定就按照 chunk_size +- chunk_overlap（100+-20）个字去分割
)
texts = text_splitter.create_documents([state_of_the_union])
# texts = text_splitter.split_documents(raw_documents)
print(texts)

在这里我指定了 \n 换行负去分割，可以看到被分割为了我的数据行数 67 行

为什么我里面还有一个 \n\n 呢，且用的是列表，假如我的数据有某一行换行是两个 \n\n，那我指定一个 \n 就会有问题，所以相当于加了一个选择，分割符号以 \n 或者 \n\n 去进行分割；

想再加其他分隔符只需要加到列表中就可以了，所以好处是可以指定多个分隔符！

下面这种方法就只能指定一个分隔符。

CharacterTextSplitter 指定单个分隔符分块

以 \n 为分隔符，得到 67 块

from langchain.text_splitter import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter


# 方法1 读取
filepath = 'data/专业描述.txt'
with open(filepath, encoding='utf8') as f:
    state_of_the_union = f.read()

# 方法2 读取
state_of_the_union = TextLoader(filepath, encoding='utf8').load()

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=True,
)
texts = text_splitter.create_documents([state_of_the_union])
# texts = text_splitter.split_documents(raw_documents)
print(texts)