目录
RecursiveCharacterTextSplitter 按字数分块
CharacterTextSplitter 指定单个分隔符分块
txt 有多行,我的这份数据有 67 行,样例如下:
字段1\t值1\n
字段2\t值2\n
...
RecursiveCharacterTextSplitter 按字数分块
按照每 100 个字分块,每块允许重叠字数为 20,可以看到结果被分为了 135 个块
from langchain.text_splitter import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
# 方法1 读取
filepath = 'data/专业描述.txt'
with open(filepath, encoding='utf8') as f:
state_of_the_union = f.read()
# 方法2 读取
state_of_the_union = TextLoader(filepath, encoding='utf8').load()
# 按块分割
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, # 指定每块大小
chunk_overlap=20, # 指定每块可以重叠的字符数
length_function=len,
is_separator_regex=False,
)
texts = text_splitter.create_documents([state_of_the_union])
# texts = text_splitter.split_documents(raw_documents)
print(texts)
其实这个也可以按照指定字符去分割,按顺序分割它们,直到块足够小。默认列表是 ["\n\n", "\n", " ", ""],指定
separators 参数
from langchain.text_splitter import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
# 方法1 读取
filepath = 'data/专业描述.txt'
with open(filepath, encoding='utf8') as f:
state_of_the_union = f.read()
# 方法2 读取
state_of_the_union = TextLoader(filepath, encoding='utf8').load()
# 按块分割
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100, # 指定每块大小
chunk_overlap=20, # 指定每块可以重叠的字符数
length_function=len,
is_separator_regex=False,
separators=['\n', '\n\n'] # 指定按照什么字符去分割,如果不指定就按照 chunk_size +- chunk_overlap(100+-20)个字去分割
)
texts = text_splitter.create_documents([state_of_the_union])
# texts = text_splitter.split_documents(raw_documents)
print(texts)
在这里我指定了 \n 换行负去分割,可以看到被分割为了我的数据行数 67 行
为什么我里面还有一个 \n\n 呢,且用的是列表,假如我的数据有某一行换行是两个 \n\n,那我指定一个 \n 就会有问题,所以相当于加了一个选择,分割符号以 \n 或者 \n\n 去进行分割;
想再加其他分隔符只需要加到列表中就可以了,所以好处是可以指定多个分隔符!
下面这种方法就只能指定一个分隔符。
CharacterTextSplitter 指定单个分隔符分块
以 \n 为分隔符,得到 67 块
from langchain.text_splitter import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
# 方法1 读取
filepath = 'data/专业描述.txt'
with open(filepath, encoding='utf8') as f:
state_of_the_union = f.read()
# 方法2 读取
state_of_the_union = TextLoader(filepath, encoding='utf8').load()
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=100,
chunk_overlap=20,
length_function=len,
is_separator_regex=True,
)
texts = text_splitter.create_documents([state_of_the_union])
# texts = text_splitter.split_documents(raw_documents)
print(texts)
当按照指定字符分割时,发现命名指定了字符却没有分割,chunck_size 导致的
例如我想以 \n 分割每一行,chunck_size=100, chunck_overlap=20
而我的一行字数为:123456789
9个字符,远远小于 100,那么不会按照指定字符分割了,按照 chunck_size 分割了,需要调小它,例如这里调小为
chunck_size=3,chunck_overlap=0,就可以正常按照我指定的字符分割了