145 使用LlamaIndex中的PropertyGraphIndex：深入探索与实战指南

需要重新演唱

已于 2024-09-26 14:03:01 修改

阅读量1.8k

点赞数 27

CC 4.0 BY-SA版权

分类专栏： llamindex文章文章标签： RAG LLM

于 2024-09-24 08:15:00 首次发布

本文链接：https://blog.youkuaiyun.com/xycxycooo/article/details/142469431

llamindex文章专栏收录该内容

171 篇文章

订阅专栏

https://docs.llamaindex.ai/en/stable/module_guides/indexing/lpg_index_guide/#dynamicllmpathextractor

使用LlamaIndex中的PropertyGraphIndex：深入探索与实战指南

在现代数据处理和知识管理中，属性图（Property Graph）是一种强大的工具，用于表示复杂的实体关系和元数据。LlamaIndex 提供了一个名为 PropertyGraphIndex 的工具，帮助我们构建和查询这些属性图。本文将深入探讨如何在 LlamaIndex 中使用 PropertyGraphIndex，并通过详细的代码示例和解释，帮助你快速上手和应用。

前置知识

在深入学习 PropertyGraphIndex 之前，你需要了解以下基础知识：

属性图（Property Graph）：属性图是一种图数据结构，其中节点（Node）和边（Edge）都可以带有属性（Property）。节点通常表示实体，边表示实体之间的关系。
LlamaIndex：LlamaIndex 是一个用于构建和查询知识图谱的工具库。它提供了多种索引类型，包括 PropertyGraphIndex。
Python 基础：熟悉 Python 编程语言，包括类、方法、模块导入等基本概念。

基本使用

导入并创建 `PropertyGraphIndex`

首先，我们需要导入 PropertyGraphIndex 类，并使用文档创建一个索引。

from llama_index.core import PropertyGraphIndex

# 创建索引
index = PropertyGraphIndex.from_documents(
    documents,
)

使用索引进行查询

创建索引后，我们可以使用它来检索节点和执行查询。

# 创建检索器
retriever = index.as_retriever(
    include_text=True,  # 包含匹配路径的源块
    similarity_top_k=2,  # 向量知识图谱节点检索的前 k 个
)

# 检索节点
nodes = retriever.retrieve("Test")

# 创建查询引擎
query_engine = index.as_query_engine(
    include_text=True,  # 包含匹配路径的源块
    similarity_top_k=2,  # 向量知识图谱节点检索的前 k 个
)

# 执行查询
response = query_engine.query("Test")

保存和加载索引

我们可以将索引保存到磁盘，并在需要时加载它。

# 保存索引
index.storage_context.persist(persist_dir="./storage")

# 加载索引
from llama_index.core import StorageContext, load_index_from_storage

storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)

从现有图存储加载索引

如果你已经有了一个图存储（Graph Store）和可选的向量存储（Vector Store），可以直接从这些存储中加载索引。

# 从现有图存储和向量存储加载索引
index = PropertyGraphIndex.from_existing(
    property_graph_store=graph_store,
    vector_store=vector_store,
    ...
)

构建属性图

在 LlamaIndex 中，属性图的构建是通过对每个块（Chunk）执行一系列的 kg_extractors，并将实体和关系作为元数据附加到每个 LlamaIndex 节点上。你可以使用多个 kg_extractors，它们都会被应用。

index = PropertyGraphIndex.from_documents(
    documents,
    kg_extractors=[extractor1, extractor2, ...],
)

# 插入额外的文档或节点
index.insert(document)
index.insert_nodes(nodes)

如果没有提供 kg_extractors，默认使用 SimpleLLMPathExtractor 和 ImplicitPathExtractor。

`SimpleLLMPathExtractor`

SimpleLLMPathExtractor 使用 LLM 提取短语句，并解析单跳路径（single-hop paths），格式为 (entity1, relation, entity2)。

from llama_index.core.indices.property_graph import SimpleLLMPathExtractor

kg_extractor = SimpleLLMPathExtractor(
    llm=llm,
    max_paths_per_chunk=10,
    num_workers=4,
    show_progress=False,
)

你还可以自定义提示和解析路径的函数。

prompt = (
    "Some text is provided below. Given the text, extract up to "
    "{max_paths_per_chunk} "
    "knowledge triples in the form of `subject,predicate,object` on each line. Avoid stopwords.\n"
)

def parse_fn(response_str: str) -> List[Tuple[str, str, str]]:
    lines = response_str.split("\n")
    triples = [line.split(",") for line in lines]
    return triples

kg_extractor = SimpleLLMPathExtractor(
    llm=llm,
    extract_prompt=prompt,
    parse_fn=parse_fn,
)

`ImplicitPathExtractor`

ImplicitPathExtractor 使用每个 LlamaIndex 节点对象的 node.relationships 属性提取路径。

from llama_index.core.indices.property_graph import ImplicitPathExtractor

kg_extractor = ImplicitPathExtractor()

`DynamicLLMPathExtractor`

DynamicLLMPathExtractor 根据可选的允许实体类型和关系类型列表提取路径（包括实体类型）。

from llama_index.core.indices.property_graph import DynamicLLMPathExtractor

kg_extractor = DynamicLLMPathExtractor(
    llm=llm,
    max_triplets_per_chunk=20,
    num_workers=4,
    allowed_entity_types=["POLITICIAN", "POLITICAL_PARTY"],
    allowed_relation_types=["PRESIDENT_OF", "MEMBER_OF"],
)

`SchemaLLMPathExtractor`

SchemaLLMPathExtractor 根据严格的模式提取路径，包括允许的实体、关系以及哪些实体可以连接到哪些关系。

from typing import Literal
from llama_index.core.indices.property_graph import SchemaLLMPathExtractor

entities = Literal["PERSON", "PLACE", "THING"]
relations = Literal["PART_OF", "HAS", "IS_A"]
schema = {
    "PERSON": ["PART_OF", "HAS", "IS_A"],
    "PLACE": ["PART_OF", "HAS"],
    "THING": ["IS_A"],
}

kg_extractor = SchemaLLMPathExtractor(
    llm=llm,
    possible_entities=entities,
    possible_relations=relations,
    kg_validation_schema=schema,
    strict=True,  # 如果为 false，将允许模式外的三元组
    num_workers=4,
    max_paths_per_chunk=10,
    show_progres=False,
)

检索和查询

创建检索器和查询引擎

你可以组合多种节点检索方法来创建检索器和查询引擎。

# 创建检索器
retriever = index.as_retriever(sub_retrievers=[retriever1, retriever2, ...])

# 创建查询引擎
query_engine = index.as_query_engine(
    sub_retrievers=[retriever1, retriever2, ...]
)

如果没有提供子检索器，默认使用 LLMSynonymRetriever 和 VectorContextRetriever（如果启用了嵌入）。

`LLMSynonymRetriever`

LLMSynonymRetriever 根据 LLM 生成的关键词和同义词检索节点。

from llama_index.core.indices.property_graph import LLMSynonymRetriever

prompt = (
    "Given some initial query, generate synonyms or related keywords up to {max_keywords} in total, "
    "considering possible cases of capitalization, pluralization, common expressions, etc.\n"
    "Provide all synonyms/keywords separated by '^' symbols: 'keyword1^keyword2^...'\n"
    "Note, result should be in one-line, separated by '^' symbols."
    "----\n"
    "QUERY: {query_str}\n"
    "----\n"
    "KEYWORDS: "
)

def parse_fn(self, output: str) -> list[str]:
    matches = output.strip().split("^")
    return [x.strip().capitalize() for x in matches if x.strip()]

synonym_retriever = LLMSynonymRetriever(
    index.property_graph_store,
    llm=llm,
    include_text=False,
    synonym_prompt=prompt,
    output_parsing_fn=parse_fn,
    max_keywords=10,
    path_depth=1,
)

retriever = index.as_retriever(sub_retrievers=[synonym_retriever])

`VectorContextRetriever`

VectorContextRetriever 根据向量相似度检索节点，并获取与这些节点连接的路径。

from llama_index.core.indices.property_graph import VectorContextRetriever

vector_retriever = VectorContextRetriever(
    index.property_graph_store,
    embed_model=embed_model,
    include_text=False,
    similarity_top_k=2,
    path_depth=1,
)

retriever = index.as_retriever(sub_retrievers=[vector_retriever])

`TextToCypherRetriever`

TextToCypherRetriever 使用图存储模式、查询和文本到 Cypher 的提示模板生成并执行 Cypher 查询。

from llama_index.core.indices.property_graph import TextToCypherRetriever

DEFAULT_RESPONSE_TEMPLATE = (
    "Generated Cypher query:\n{query}\n\n" "Cypher Response:\n{response}"
)
DEFAULT_ALLOWED_FIELDS = ["text", "label", "type"]

DEFAULT_TEXT_TO_CYPHER_TEMPLATE = (
    index.property_graph_store.text_to_cypher_template,
)

cypher_retriever = TextToCypherRetriever(
    index.property_graph_store,
    llm=llm,
    text_to_cypher_template=DEFAULT_TEXT_TO_CYPHER_TEMPLATE,
    response_template=DEFAULT_RESPONSE_TEMPLATE,
    cypher_validator=None,
    allowed_output_field=DEFAULT_ALLOWED_FIELDS,
)

`CypherTemplateRetriever`

CypherTemplateRetriever 是一个更受限的 TextToCypherRetriever 版本，它使用 Cypher 模板并让 LLM 填充空白。

from pydantic import BaseModel, Field
from llama_index.core.indices.property_graph import CypherTemplateRetriever

cypher_query = """
MATCH (c:Chunk)-[:MENTIONS]->(o)
WHERE o.name IN $names
RETURN c.text, o.name, o.label;
"""

class TemplateParams(BaseModel):
    names: list[str] = Field(
        description="A list of entity names or keywords to use for lookup in a knowledge graph."
    )

template_retriever = CypherTemplateRetriever(
    index.property_graph_store, TemplateParams, cypher_query
)

存储

保存和加载索引

默认的属性图存储 SimplePropertyGraphStore 将所有内容存储在内存中，并从磁盘持久化和加载。

from llama_index.core import StorageContext, load_index_from_storage
from llama_index.core.indices import PropertyGraphIndex

# 创建索引
index = PropertyGraphIndex.from_documents(documents)

# 保存索引
index.storage_context.persist("./storage")

# 加载索引
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)

使用集成保存和加载索引

集成通常会自动保存。一些图存储支持向量，而其他可能不支持。你可以始终将图存储与外部向量数据库结合使用。

from llama_index.core import StorageContext, load_index_from_storage
from llama_index.core.indices import PropertyGraphIndex
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, AsyncQdrantClient

vector_store = QdrantVectorStore(
    "graph_collection",
    client=QdrantClient(...),
    aclient=AsyncQdrantClient(...),
)

graph_store = Neo4jPropertyGraphStore(
    username="neo4j",
    password="<password>",
    url="bolt://localhost:7687",
)

# 创建索引
index = PropertyGraphIndex.from_documents(
    documents,
    property_graph_store=graph_store,
    vector_store=vector_store,
    embed_kg_nodes=True,
)

# 从现有图存储和向量存储加载索引
index = PropertyGraphIndex.from_existing(
    property_graph_store=graph_store,
    vector_store=vector_store,
    embed_kg_nodes=True,
)

直接使用属性图存储

属性图存储的基础类是 PropertyGraphStore。这些属性图存储使用不同类型的 LabeledNode 对象构建，并通过 Relation 对象连接。

from llama_index.core.graph_stores import (
    SimplePropertyGraphStore,
    EntityNode,
    Relation,
)
from llama_index.core.schema import TextNode

graph_store = SimplePropertyGraphStore()

entities = [
    EntityNode(name="llama", label="ANIMAL", properties={"key": "val"}),
    EntityNode(name="index", label="THING", properties={"key": "val"}),
]

relations = [
    Relation(
        label="HAS",
        source_id=entities[0].id,
        target_id=entities[1].id,
        properties={},
    )
]

graph_store.upsert_nodes(entities)
graph_store.upsert_relations(relations)

# 可选地，我们还可以插入文本块
source_chunk = TextNode(id_="source", text="My llama has an index.")

# 为每个实体创建关系
source_relations = [
    Relation(
        label="HAS_SOURCE",
        source_id=entities[0].id,
        target_id="source",
    ),
    Relation(
        label="HAS_SOURCE",
        source_id=entities[1].id,
        target_id="source",
    ),
]
graph_store.upsert_llama_nodes([source_chunk])
graph_store.upsert_relations(source_relations)

高级定制

子类化提取器

在 LlamaIndex 中，图提取器子类化 TransformComponent 类。如果你之前使用过摄取管道，这会非常熟悉，因为它是同一个类。

from llama_index.core.graph_store.types import (
    EntityNode,
    Relation,
    KG_NODES_KEY,
    KG_RELATIONS_KEY,
)
from llama_index.core.schema import BaseNode, TransformComponent

class MyGraphExtractor(TransformComponent):
    def __call__(
        self, llama_nodes: list[BaseNode], **kwargs
    ) -> list[BaseNode]:
        for llama_node in llama_nodes:
            existing_nodes = llama_node.metadata.pop(KG_NODES_KEY, [])
            existing_relations = llama_node.metadata.pop(KG_RELATIONS_KEY, [])

            existing_nodes.append(
                EntityNode(
                    name="llama", label="ANIMAL", properties={"key": "val"}
                )
            )
            existing_nodes.append(
                EntityNode(
                    name="index", label="THING", properties={"key": "val"}
                )
            )

            existing_relations.append(
                Relation(
                    label="HAS",
                    source_id="llama",
                    target_id="index",
                    properties={},
                )
            )

            llama_node.metadata[KG_NODES_KEY] = existing_nodes
            llama_node.metadata[KG_RELATIONS_KEY] = existing_relations

        return llama_nodes

子类化检索器

检索器的返回类型非常灵活，可以是字符串、TextNode、NodeWithScore 或上述类型的列表。

from llama_index.core.indices.property_graph import (
    CustomPGRetriever,
    CUSTOM_RETRIEVE_TYPE,
)

class MyCustomRetriever(CustomPGRetriever):
    def __init__(self, my_option_1: bool = False, **kwargs) -> None:
        """Uses any kwargs passed in from class constructor."""
        self.my_option_1 = my_option_1
        # optionally do something with self.graph_store

    def custom_retrieve(self, query_str: str) -> CUSTOM_RETRIEVE_TYPE:
        # some operation with self.graph_store
        return "result"

    # optional async method
    # async def acustom_retrieve(self, query_str: str) -> str:
    #     ...

custom_retriever = MyCustomRetriever(graph_store, my_option_1=True)

retriever = index.as_retriever(sub_retrievers=[custom_retriever])