本文记录学习笔记,后面有新的bug和实践会不断更新~
Loader 加载器
1.文件夹 DirectoryLoader、
2.Azure 存储 AzureBlobStorageContainerLoader
3.CSV文件 CSVLoader
......
还有多种可看官网
实战之PDF---参考以下网址
How to load PDFs | 🦜️🔗 LangChain
1.基本loader以及嵌入大模型
%pip install -qU pypdf
# pdf文件加载器
# 返回Document对象的迭代器
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(file_path)
pages = []
async for page in loader.alazy_load():
pages.append(page)
print(f"{pages[0].metadata}\n")
print(pages[0].page_content)
# 使用OpenAI嵌入,首先需要配置openai-key
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
vector_store = InMemoryVectorStore.from_documents(pages, OpenAIEmbeddings())
docs = vector_store.similarity_search("What is LayoutParser?", k=2)
for doc in docs:
print(f'Page {doc.metadata["page"]}: {doc.page_content[:300]}\n')
2. 对pdf进行精细化处理,有时候需要更精细的文本分割(例如,分成不同的段落、标题、表格或其他结构)或需要从图像中提取文本
%pip install -qU langchain-unstructured
#key
import getpass
import os
if "UNSTRUCTURED_API_KEY" not in os.environ:
os.environ["UNSTRUCTURED_API_KEY"] = getpass.getpass("Unstructured API Key:")
# 非结构化加载器
from langchain_unstructured import UnstructuredLoader
loader = UnstructuredLoader(
file_path=file_path,
strategy="hi_res",
partition_via_api=True,
coordinates=True,
)
docs = []
for doc in loader.lazy_load():
docs.append(doc)
加载的每个Document
都代表一个结构,如标题、段落或表格。
#识别并提取一个表格
import fitz
import matplotlib.patches as patches
import matplotlib.pyplot as plt
from PIL import Image
#画框
def plot_pdf_with_boxes(pdf_page, segments):
# 坐标
pix = pdf_page.get_pixmap()
pil_image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
fig, ax = plt.subplots(1, figsize=(10, 10))
ax.imshow(pil_image)
categories = set()
# 标题、图像、图表对应框的不同颜色
category_to_color = {
"Title": "orchid",
"Image": "forestgreen",
"Table": "tomato",
}
for segment in segments:
points = segment["coordinates"]["points"]
layout_width = segment["coordinates"]["layout_width"]
layout_height = segment["coordinates"]["layout_height"]
scaled_points = [
(x * pix.width / layout_width, y * pix.height / layout_height)
for x, y in points
]
box_color = category_to_color.get(segment["category"], "deepskyblue")
categories.add(segment["category"])
rect = patches.Polygon(
scaled_points, linewidth=1, edgecolor=box_color, facecolor="none"
)
ax.add_patch(rect)
# Make legend
legend_handles = [patches.Patch(color="deepskyblue", label="Text")]
for category in ["Title", "Image", "Table"]:
if category in categories:
legend_handles.append(
patches.Patch(color=category_to_color[category], label=category)
)
ax.axis("off")
ax.legend(handles=legend_handles, loc="upper right")
plt.tight_layout()
plt.show()
# 划分pdf的图像、标题、文档数据
def render_page(doc_list: list, page_number: int, print_text=True) -> None:
pdf_page = fitz.open(file_path).load_page(page_number - 1)
page_docs = [
doc for doc in doc_list if doc.metadata.get("page_number") == page_number
]
segments = [doc.metadata for doc in page_docs]
plot_pdf_with_boxes(pdf_page, segments)
if print_text:
for doc in page_docs:
print(f"{doc.page_content}\n")
具体各类数据解析,后续再更~
模型应用的数据处理
%pip install -qU PyMuPDF pillow langchain-openai
import base64
import io
import fitz
from PIL import Image
from langchain_openai import ChatOpenAI
from IPython.display import Image as IPImage
from IPython.display import display
from langchain_core.messages import HumanMessage
def pdf_page_to_base64(pdf_path: str, page_number: int):
pdf_document = fitz.open(pdf_path)
page = pdf_document.load_page(page_number - 1) # input is one-indexed
pix = page.get_pixmap()
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
buffer = io.BytesIO()
img.save(buffer, format="PNG")
return base64.b64encode(buffer.getvalue()).decode("utf-8")
base64_image = pdf_page_to_base64(file_path, 11)
display(IPImage(data=base64.b64decode(base64_image)))
llm = ChatOpenAI(model="gpt-4o-mini")
query = "What is the name of the first step in the pipeline?"
message = HumanMessage(
content=[
{"type": "text", "text": query},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
},
],
)
response = llm.invoke([message])
print(response.content)
先更到这,后续会持续学习持续更新哒~