什么是 “chunked_pooling“

当我们说 chunked_pooling_enabled=True 时,并不是把多个 chunk 的“原始文本”合并成一个长文档,而是:

将每个 chunk 的“语义向量”合并成一个最终文档的“语义表示向量”


📦 什么是 chunk?

设想你有一篇 4000 个 token 的文章,但你的模型最多只能处理 512 个 token,那么你得把文章切成多个小段(chunk)

  • chunk 1:前 512 个 token
  • chunk 2:第 256 ~ 768(带重叠)
  • chunk 3:第 512 ~ 1024

这些 chunk 只是从原始文档中截取的片段,每个 chunk 都能独立送入模型去获取“向量表示”。


📐 什么是“语义表示向量”?

每个 chunk 输入到模型后,比如一个 BERT/Transformer 模型,会输出一个向量,比如:

chunk1_embedding = [0.1, 0.2, 0.3, ..., 0.7]  # 长度是比如768维
chunk2_embedding = [0.4, 0.3, 0.2, ..., 0.5]
...

这些向量表示的是 chunk 的“语义特征”。


✅ 什么是 “chunked_pooling”?

chunked_pooling 的目的就是:
把多个 chunk 的 embedding 合成为一个“代表整个文档”的 embedding 向量,这样就可以让这个向量和 query 做匹配、排序、打分。

🔧 合成方法有很多种,比如:

合成方法含义举例
avg pooling对每个维度取平均[0.1+0.4]/2 = 0.25
max pooling每个维度取最大值max(0.1, 0.4) = 0.4
attention pooling加权平均,重要的 chunk 权重高根据模型打分加权
MLP 聚合通过神经网络学习合适的聚合方式可训练的方式

这样就能从多个 [chunk1_embedding, chunk2_embedding, ..., chunkN_embedding]
生成一个最终的 document_embedding


🤖 举个完整例子

原文档内容:

这是一篇关于人工智能的长文档,分为多个部分:
第一部分讲了ChatGPT的架构;
第二部分讲了Transformer的机制;
第三部分讲了如何应用到医疗;
...

被拆成 3 个 chunk → 送入模型:

Chunk模型向量(简化)
C1[0.1, 0.2, 0.3]
C2[0.4, 0.1, 0.6]
C3[0.3, 0.5, 0.2]

avg_pooling 合成:

document_embedding = avg([C1, C2, C3])
                  = [ (0.1+0.4+0.3)/3, (0.2+0.1+0.5)/3, (0.3+0.6+0.2)/3 ]
                  = [0.266, 0.266, 0.366]

这个最终的向量,就代表了整篇文档的语义特征,可用于与查询匹配。


🚫 那么可以直接合成原始 chunk 文本回原文档吗?

不能(或者说不推荐):

  • chunk 之间有信息丢失和重叠,不能保证顺序和语义完整;
  • 主要任务是“检索”而非“重构”,我们关心的是语义向量是否能代表整篇文档;
  • 如果你合并文本再重新送入模型处理,很可能超出模型长度限制。

✅ 总结

问题回答
能否把多个 chunk 合成一个文档?文本上不能可靠合并;语义向量可以合并成一个文档表示
chunked_pooling_enabled=True 是干什么的?合并多个 chunk 的向量,代表整篇文档
为什么需要?因为单个 chunk 只看一部分内容,合并后能代表整篇文章
怎么合并?avg、max、attention 等方式

演示 chunk pooling 的实际代码

import numpy as np
import matplotlib.pyplot as plt

# 模拟三个 chunk 的 embedding(每个向量长度为 3)
chunk_embeddings = np.array([
    [0.1, 0.2, 0.3],  # Chunk 1
    [0.4, 0.1, 0.6],  # Chunk 2
    [0.3, 0.5, 0.2],  # Chunk 3
])

# 平均池化 (avg pooling)
avg_pooling = np.mean(chunk_embeddings, axis=0)

# 最大池化 (max pooling)
max_pooling = np.max(chunk_embeddings, axis=0)

# 绘图:展示每个 chunk 向量 + 聚合结果
x = np.arange(len(avg_pooling))
bar_width = 0.2

fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(x - bar_width*1.5, chunk_embeddings[0], width=bar_width, label='Chunk 1')
ax.bar(x - bar_width*0.5, chunk_embeddings[1], width=bar_width, label='Chunk 2')
ax.bar(x + bar_width*0.5, chunk_embeddings[2], width=bar_width, label='Chunk 3')
ax.bar(x + bar_width*1.5, avg_pooling, width=bar_width, label='Avg Pooling', color='orange')
ax.bar(x + bar_width*2.5, max_pooling, width=bar_width, label='Max Pooling', color='green')

ax.set_xlabel('Embedding Dimension')
ax.set_ylabel('Value')
ax.set_title('Chunk Pooling Visualization')
ax.legend()
ax.grid(True)
plt.tight_layout()
plt.show()

结果:
在这里插入图片描述

3 组柱状图:

  • 分别表示 3 个不同 chunk 的向量(Chunk 1、Chunk 2、Chunk 3),表示不同位置的向量值;

  • 右侧的橙色柱状图:表示通过 平均池化 得到的文档整体向量;

  • 右侧的绿色柱状图:表示通过 最大池化 得到的文档整体向量;

通俗解释

假设一个文档被分成三段(chunk),每段经过模型处理后得到了一个向量表示:

Chunk 1 = [0.1, 0.2, 0.3]
Chunk 2 = [0.4, 0.1, 0.6]
Chunk 3 = [0.3, 0.5, 0.2]
平均池化:

把同一维度的值加起来取平均,例如第一维是:

(0.1 + 0.4 + 0.3) / 3 = 0.2667

这样得到的新向量是这个文档的“总体表示”。

最大池化:

选择每一维上最大的那个值,比如:

max(0.1, 0.4, 0.3) = 0.4
max(0.2, 0.1, 0.5) = 0.5
max(0.3, 0.6, 0.2) = 0.6

也可以看作是:用每一维中“最显著”的特征来代表整个文档。

总结

“将多个 chunk 合成一个文档表示向量”,就是通过这些策略(平均池化、最大池化等)将多个小段向量整合成一个“能代表整篇文档”的大向量。

from sentence_transformers import SentenceTransformer import pandas as pd import torch import os # 加载模型 (首次运行自动下载) model = SentenceTransformer('BAAI/bge-large-zh', device='cuda' if torch.cuda.is_available() else 'cpu') # 读取CSV数据 CSV_PATH = os.path.join(os.environ['USERPROFILE'], 'Desktop', 'es_textdoc.csv') # 自动定位桌面文件 INDEX_NAME = "products" # 索引名称 # 组合文本字段 df['combined_text'] = df['title'] + " [品牌] " + df['brand'] + " [分类] " + df['category'] # 批量生成向量 (维度1024) batch_size = 32 embeddings = [] for i in range(0, len(df), batch_size): batch = df['combined_text'].iloc[i:i+batch_size].tolist() embeddings.extend(model.encode(batch, normalize_embeddings=True)) df['vector'] = [e.tolist() for e in embeddings] # 添加向量列我执行这个代码,报了这个错:D:\Pythonproject\elasticsearch9\.venv\Scripts\python.exe D:\Pythonproject\elasticsearch9\src\0704_date.py No sentence-transformers model found with name BAAI/bge-large-zh. Creating a new one with mean pooling. Traceback (most recent call last): File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\urllib3\connection.py", line 198, in _new_conn sock = connection.create_connection( (self._dns_host, self.port), ...<2 lines>... socket_options=self.socket_options, ) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\urllib3\util\connection.py", line 85, in create_connection raise err File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\urllib3\util\connection.py", line 73, in create_connection sock.connect(sa) ~~~~~~~~~~~~^^^^ TimeoutError: timed out The above exception was the direct cause of the following exception: Traceback (most recent call last): File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\urllib3\connectionpool.py", line 787, in urlopen response = self._make_request( conn, ...<10 lines>... **response_kw, ) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\urllib3\connectionpool.py", line 488, in _make_request raise new_e File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\urllib3\connectionpool.py", line 464, in _make_request self._validate_conn(conn) ~~~~~~~~~~~~~~~~~~~^^^^^^ File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\urllib3\connectionpool.py", line 1093, in _validate_conn conn.connect() ~~~~~~~~~~~~^^ File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\urllib3\connection.py", line 753, in connect self.sock = sock = self._new_conn() ~~~~~~~~~~~~~~^^ File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\urllib3\connection.py", line 207, in _new_conn raise ConnectTimeoutError( ...<2 lines>... ) from e urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x000002A53441F4D0>, 'Connection to huggingface.co timed out. (connect timeout=10)') The above exception was the direct cause of the following exception: Traceback (most recent call last): File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\requests\adapters.py", line 667, in send resp = conn.urlopen( method=request.method, ...<9 lines>... chunked=chunked, ) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\urllib3\connectionpool.py", line 841, in urlopen retries = retries.increment( method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2] ) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\urllib3\util\retry.py", line 519, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /BAAI/bge-large-zh/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000002A53441F4D0>, 'Connection to huggingface.co timed out. (connect timeout=10)')) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\huggingface_hub\file_download.py", line 1533, in _get_metadata_or_catch_error metadata = get_hf_file_metadata( url=url, proxies=proxies, timeout=etag_timeout, headers=headers, token=token ) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\huggingface_hub\utils\_validators.py", line 114, in _inner_fn return fn(*args, **kwargs) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\huggingface_hub\file_download.py", line 1450, in get_hf_file_metadata r = _request_wrapper( method="HEAD", ...<5 lines>... timeout=timeout, ) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\huggingface_hub\file_download.py", line 286, in _request_wrapper response = _request_wrapper( method=method, ...<2 lines>... **params, ) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\huggingface_hub\file_download.py", line 309, in _request_wrapper response = http_backoff(method=method, url=url, **params, retry_on_exceptions=(), retry_on_status_codes=(429,)) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\huggingface_hub\utils\_http.py", line 310, in http_backoff response = session.request(method=method, url=url, **kwargs) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\requests\sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\requests\sessions.py", line 703, in send r = adapter.send(request, **kwargs) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\huggingface_hub\utils\_http.py", line 96, in send return super().send(request, *args, **kwargs) ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\requests\adapters.py", line 688, in send raise ConnectTimeout(e, request=request) requests.exceptions.ConnectTimeout: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /BAAI/bge-large-zh/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000002A53441F4D0>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: a46ed614-e7b0-4d32-b35f-f6caf6a7200b)') The above exception was the direct cause of the following exception: Traceback (most recent call last): File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\transformers\utils\hub.py", line 470, in cached_files hf_hub_download( ~~~~~~~~~~~~~~~^ path_or_repo_id, ^^^^^^^^^^^^^^^^ ...<10 lines>... local_files_only=local_files_only, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\huggingface_hub\utils\_validators.py", line 114, in _inner_fn return fn(*args, **kwargs) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\huggingface_hub\file_download.py", line 1008, in hf_hub_download return _hf_hub_download_to_cache_dir( # Destination ...<14 lines>... force_download=force_download, ) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\huggingface_hub\file_download.py", line 1115, in _hf_hub_download_to_cache_dir _raise_on_head_call_error(head_call_error, force_download, local_files_only) ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\huggingface_hub\file_download.py", line 1648, in _raise_on_head_call_error raise LocalEntryNotFoundError( ...<3 lines>... ) from head_call_error huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "D:\Pythonproject\elasticsearch9\src\0704_date.py", line 7, in <module> model = SentenceTransformer('BAAI/bge-large-zh', device='cuda' if torch.cuda.is_available() else 'cpu') File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\sentence_transformers\SentenceTransformer.py", line 339, in __init__ modules = self._load_auto_model( model_name_or_path, ...<8 lines>... has_modules=has_modules, ) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\sentence_transformers\SentenceTransformer.py", line 2061, in _load_auto_model transformer_model = Transformer( model_name_or_path, ...<4 lines>... backend=self.backend, ) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\sentence_transformers\models\Transformer.py", line 87, in __init__ config, is_peft_model = self._load_config(model_name_or_path, cache_dir, backend, config_args) ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\sentence_transformers\models\Transformer.py", line 152, in _load_config return AutoConfig.from_pretrained(model_name_or_path, **config_args, cache_dir=cache_dir), False ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\transformers\models\auto\configuration_auto.py", line 1197, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\transformers\configuration_utils.py", line 608, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\transformers\configuration_utils.py", line 667, in _get_config_dict resolved_config_file = cached_file( pretrained_model_name_or_path, ...<10 lines>... _commit_hash=commit_hash, ) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\transformers\utils\hub.py", line 312, in cached_file file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) File "D:\Pythonproject\elasticsearch9\.venv\Lib\site-packages\transformers\utils\hub.py", line 543, in cached_files raise OSError( ...<3 lines>... ) from e OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files. Check your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'. 进程已结束,退出代码为 1
07-05
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值