16 使用Llama Index提取术语和定义的指南

本文链接：https://blog.youkuaiyun.com/xycxycooo/article/details/141319697

使用Llama Index提取术语和定义的指南

Llama Index有许多用例（如语义搜索、总结等），这些用例都有很好的文档记录。然而，这并不意味着我们不能将Llama Index应用于非常特定的用例！

在本教程中，我们将介绍使用Llama Index从文本中提取术语和定义的设计过程，同时允许用户稍后查询这些术语。使用Streamlit，我们可以提供一种简单的方法来构建前端，运行和测试所有这些内容，并快速迭代我们的设计。

本教程假设你已经安装了Python3.9+和以下包：

llama-index
streamlit

在基本层面上，我们的目标是获取文档中的文本，提取术语和定义，然后为用户提供一种查询这些术语和定义知识库的方法。本教程将介绍Llama Index和Streamlit的功能，并希望为出现的常见问题提供一些有趣的解决方案。

本教程的最终版本可以在这里找到，一个实时的托管演示可以在Huggingface Spaces上找到。

上传文本

第一步是为用户提供一种手动输入文本的方式。让我们使用Streamlit编写一些代码来提供这个界面！使用以下代码并使用streamlit run app.py启动应用程序。

import streamlit as st

st.title("🦙 Llama Index Term Extractor 🦙")

document_text = st.text_area("Enter raw text")
if st.button("Extract Terms and Definitions") and document_text:
    with st.spinner("Extracting..."):
        extracted_terms = document_text  # this is a placeholder!
    st.write(extracted_terms)

非常简单对吧！但你会注意到应用程序还没有做任何有用的事情。要使用llama_index，我们还需要设置我们的OpenAI LLM。LLM有许多可能的设置，所以我们可以让用户自己找出最佳设置。我们还应该让用户设置将提取术语的提示（这也会帮助我们调试什么效果最好）。

LLM设置

这一步在我们的应用程序中引入了一些标签，将其分成不同的窗格，提供不同的功能。让我们为LLM设置和上传文本创建一个标签：

import os
import streamlit as st

DEFAULT_TERM_STR = (
    "Make a list of terms and definitions that are defined in the context, "
    "with one pair on each line. "
    "If a term is missing it's definition, use your best judgment. "
    "Write each line as as follows:\nTerm: <term> Definition: <definition>"
)

st.title("🦙 Llama Index Term Extractor 🦙")

setup_tab, upload_tab = st.tabs(["Setup", "Upload/Extract Terms"])

with setup_tab:
    st.subheader("LLM Setup")
    api_key = st.text_input("Enter your OpenAI API key here", type="password")
    llm_name = st.selectbox("Which LLM?", ["gpt-3.5-turbo", "gpt-4"])
    model_temperature = st.slider(
        "LLM Temperature", min_value=0.0, max_value=1.0, step=0.1
    )
    term_extract_str = st.text_area(
        "The query to extract terms and definitions with.",
        value=DEFAULT_TERM_STR,
    )

with upload_tab:
    st.subheader("Extract and Query Definitions")
    document_text = st.text_area("Enter raw text")
    if st.button("Extract Terms and Definitions") and document_text:
        with st.spinner("Extracting..."):
            extracted_terms = document_text  # this is a placeholder!
        st.write(extracted_terms)

现在我们的应用程序有两个标签，这真的有助于组织。你还注意到我添加了一个默认提示来提取术语——你可以在尝试提取一些术语后更改这个提示，这只是我在实验后得出的提示。

说到提取术语，是时候添加一些函数来完成这个任务了！

提取和存储术语

现在我们能够定义LLM设置并输入文本，我们可以尝试使用Llama Index从文本中提取术语！

我们可以添加以下函数来初始化我们的LLM，并使用它从输入文本中提取术语。

from llama_index.core import Document, SummaryIndex, load_index_from_storage
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

def get_llm(llm_name, model_temperature, api_key, max_tokens=256):
    os.environ["OPENAI_API_KEY"] = api_key
    return OpenAI(
        temperature=model_temperature, model=llm_name, max_tokens=max_tokens
    )

def extract_terms(
    documents, term_extract_str, llm_name, model_temperature, api_key
):
    llm = get_llm(llm_name, model_temperature, api_key, max_tokens=1024)

    temp_index = SummaryIndex.from_documents(
        documents,
    )
    query_engine = temp_index.as_query_engine(
        response_mode="tree_summarize", llm=llm
    )
    terms_definitions = str(query_engine.query(term_extract_str))
    terms_definitions = [
        x
        for x in terms_definitions.split("\n")
        if x and "Term:" in x and "Definition:" in x
    ]
    # parse the text into a dict
    terms_to_definition = {
   
   
        x.split("Definition:")[0]
        .split("Term:"