llamaindex 元数据提取使用模式(Metadata Extraction Usage Pattern)

元数据提取使用模式(Metadata Extraction Usage Pattern)

概念解释

你可以使用 LLM(大型语言模型)来自动提取元数据,通过我们的元数据提取器模块实现。

我们的元数据提取器模块包括以下“特征提取器”:

  • SummaryExtractor - 自动提取一组节点的摘要
  • QuestionsAnsweredExtractor - 提取每个节点可以回答的一组问题
  • TitleExtractor - 提取每个节点的标题
  • EntityExtractor - 提取每个节点内容中提到的实体(例如地点、人物、事物的名称)

然后,你可以将元数据提取器与我们的节点解析器链接起来:

使用模式
定义元数据提取器
from llama_index.core.extractors import (
    TitleExtractor,
    QuestionsAnsweredExtractor,
)
from llama_index.core.node_parser import TokenTextSplitter

text_splitter = TokenTextSplitter(
    separator=" ", chunk_size=512, chunk_overlap=128
)
title_extractor = TitleExtractor(nodes=5)
qa_extractor = QuestionsAnsweredExtractor(questions=3)
运行元数据提取器

假设文档已经定义,提取节点:

from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(
    transformations=[text_splitter, title_extractor, qa_extractor]
)

nodes = pipeline.run(
    documents=documents,
    in_place=True,
    show_progress=True,
)

或者插入到索引中:

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    documents, transformations=[text_splitter, title_extractor, qa_extractor]
)
自动元数据提取以改善检索和合成

在本教程中,我们展示了如何执行自动元数据提取以获得更好的检索结果。我们使用两个提取器:一个 QuestionsAnsweredExtractor,它从一段文本中生成问答对,以及一个 SummaryExtractor,它不仅在当前文本中提取摘要,还在相邻文本中提取摘要。

我们展示了这允许“块梦境”——每个单独的块可以有更多的“整体”细节,从而在给定检索结果的情况下提高答案质量。

我们的数据来源取自 Eugene Yan 的流行文章《LLM Patterns》:https://eugeneyan.com/writing/llm-patterns/

设置

如果你在 colab 上打开这个 Notebook,你可能需要安装 LlamaIndex 🦙。

%pip install llama-index-llms-openai
%pip install llama-index-readers-web
!pip install llama-index
import nest_asyncio

nest_asyncio.apply()

import os
import openai
# OPTIONAL: setup W&B callback handling for tracing
from llama_index.core import set_global_handler

set_global_handler("wandb", run_args={
   
   "project": "llamaindex"}
03-13
### Grok Pattern Usage and Examples in Log Parsing Grok patterns serve as a powerful tool within various logging frameworks, especially when dealing with complex or unstructured data. These patterns allow users to parse logs into structured formats that can be easily queried and analyzed. A typical use case involves parsing web server access logs where each line contains multiple fields such as IP address, timestamp, request method, URL path, status code, etc. By defining appropriate Grok patterns, these elements get extracted from raw text strings into named variables which facilitate further processing steps like filtering, aggregation, visualization, alerting based on specific criteria defined by the user[^1]. For instance, consider an Apache HTTP Server's combined log format entry: ``` 192.168.1.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 ``` To extract meaningful information out of this string using Grok, one could define a pattern similar to below: ```grok %{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] "%{WORD:verb} %{URIPATHPARAM:request} HTTP/%{NUMBER:httpversion}" %{INT:response} (?:%{INT:bytes}| '-') ``` This would result in capturing groups being populated accordingly: - `clientip`: 192.168.1.1 - `ident`: - - `auth`: frank - `timestamp`: 10/Oct/2000:13:55:36 -0700 - `verb`: GET - `request`: /apache_pb.gif - `httpversion`: 1.0 - `response`: 200 - `bytes`: 2326 Moreover, custom patterns may also be created depending upon unique requirements not covered under default ones provided by tools implementing Grok functionality. For example, if there exists proprietary application-specific metadata embedded inside messages requiring extraction then corresponding regex-based definitions need crafting specifically targeting those structures[^2]. In addition to standalone implementations found across numerous open-source projects including but not limited to Elasticsearch Beats input plugins, Filebeat modules, Fluentd source directives among others; commercial solutions too leverage Grok extensively enhancing their core capabilities thereby providing robust mechanisms for handling diverse types of event streams efficiently at scale while ensuring high performance characteristics remain intact throughout operations lifecycle phases ranging from ingestion through indexing up until retrieval stages inclusive[^3].
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

需要重新演唱

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值