LlamaIndex：a data framework for your LLM applications

LlamaIndex：a data framework for your LLM applications - 郑瀚Andrew
2023-12-7 22:51:0 Author: www.cnblogs.com(查看原文) 阅读量:18 收藏

LlamaIndex 是一个数据框架，用于基于大型语言模型（LLM）的应用程序来摄取、构建和访问私有或特定领域的数据。

LlamaIndex由以下几个主要能力模块组成：

数据连接器（Data connectors）：按照原生的来源和格式摄取你的私有数据，这些来源可能包括API、PDF、SQL等等（更多）。
数据索引（Data indexes）：以中间表示（intermediate representations）形式构建和存储你的数据，使其易于LLMs消费且性能高效。
引擎（Engines）：提供对你数据的自然语言访问接口。例如：
- 查询引擎是强大的检索接口，用于增强知识的输出。
- 聊天引擎是对话式接口，用于与你的数据进行多条消息的“来回”交互。
数据代理（Data agents）：是由LLM驱动的知识工作者，由从简单辅助功能到API集成等工具组成。
应用集成（Application integrations）：将LlamaIndex重新整合回你的整个生态系统中。这可能是LangChain、Flask、Docker、ChatGPT或者……其他任何东西！

参考链接：

https://github.com/run-llama/llama_index

大型语言模型（LLMs）为人类与数据之间提供了一种自然语言交互接口。广泛可用的模型已经在大量公开可用的数据上进行了预训练，例如维基百科、邮件列表、教科书、源代码等等。然而，尽管LLMs在大量数据上进行了训练，它们并没有针对你的数据进行训练，这些数据可能是私有的或者特定于你试图解决的问题。这些数据可能隐藏在API接口后面，存储在SQL数据库中，或者被困在PDF文档和幻灯片中。

LlamaIndex通过连接到这些数据源并将这些数据添加到LLMs已有的数据中来解决这个问题。这通常被称为检索增强生成（Retrieval-Augmented Generation, RAG）。RAG使你能够使用LLMs查询你的数据、转换它，并产生新的洞见。你可以询问有关你数据的问题，创建聊天机器人，构建半自主代理等等。

RAG的五个关键阶段将成为您构建的任何更大应用程序的一部分。这些阶段包括：

加载（Loading）：这指的是将您的数据从其所在位置 —— 无论是文本文件、PDF、另一个网站、数据库还是API —— 引入到您的处理流程中。LlamaHub提供了数百种连接器可供选择。
索引（Indexing）：这意味着创建一个允许查询数据的数据结构。对于LLM来说，这几乎总是意味着创建向量嵌入（即数据的语义的向量表示），以及许多其他元数据策略，以便于准确地找到上下文相关的数据。
存储（Storing）：一旦您的数据被索引，您几乎总是会想要存储您的索引以及其他元数据，以避免必须重新索引。
查询（Querying）：对于任何给定的索引策略，您都可以使用多种方式利用LLM和LlamaIndex数据结构进行查询，包括子查询、多步骤查询和混合策略。
评估（Evaluation）：任何处理流程中的一个关键步骤是检查其相对于其他策略的有效性，或者当您进行更改时的有效性。评估提供了客观的衡量指标，可以衡量您对查询的响应的准确性、忠实度和速度。

0x1：Loading stage

1、Nodes and Documents

文档（Document）是任何数据源的容器 —— 例如一个PDF文件、一个API输出或者从数据库检索的数据。

节点（Node）是LlamaIndex中数据的原子单位，代表来源文档的一个“chunk”。节点具有元数据，这些元数据将它们与所在的文档以及其他节点相关联。

2、Connectors

数据连接器（通常称为Reader）将不同数据源和数据格式的数据摄取到文档和节点中。

0x2：Querying Stage

1、Retrievers

检索器（Retrievers）定义了在给定查询时如何从索引中高效地检索相关上下文。您的检索策略对于检索到的数据的相关性以及其效率至关重要。

2、Routers

路由器（Routers）决定使用哪个检索器从知识库中检索相关上下文。更具体地说，RouterRetriever类负责选择一个或多个候选的检索器来执行查询。它们使用选择器根据每个候选者的元数据和查询来选择最佳选项。

3、Node Postprocessors

节点后处理器（Node Postprocessors）接收一组检索到的节点，并对它们应用转换、过滤或重新排名的逻辑。

4、Response Synthesizers

响应合成器（Response Synthesizers）使用用户查询和一组给定的检索到的文本块从LLM生成响应。

参考链接：

https://llamahub.ai/l/google_drive
https://docs.llamaindex.ai/en/stable/understanding/understanding.html

0x1：Installation from Pip

0x2：Local Model Setup

1、A full guide to using and configuring LLMs available

选择合适的大型语言模型（LLM）是构建任何基于私有数据的LLM应用程序时需要考虑的首要步骤之一。

LLM是LlamaIndex的核心组成部分。它们可以作为独立模块使用，或者插入到其他核心LlamaIndex模块（索引、检索器、查询引擎）中。它们总是在响应合成步骤中使用（例如，在检索之后）。根据所使用的索引类型，LLM可能也会在索引构建、插入和查询遍历过程中被使用。

LlamaIndex为定义LLM模块提供了统一的接口，无论是来自OpenAI、Hugging Face还是LangChain，这样您就不必自己编写定义LLM接口的样板代码。这个接口包括以下内容：

支持 text completion 和 chat 接口
支持流式（streaming）和非流式（non-streaming）接口
支持同步（synchronous）和异步（asynchronous）接口

下面的代码片段展示了如何在llama-index中使用大型语言模型。

使用openai大模型，

from llama_index.llms import OpenAI

# non-streaming
resp = OpenAI().complete("Paul Graham is ")
print(resp)

使用hugeface托管大模型，

# -- coding: utf-8 --**

from llama_index.prompts import PromptTemplate
import torch
from llama_index.llms import HuggingFaceLLM

if __name__ == "__main__":
    system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
    - StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
    - StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
    - StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
    - StableLM will refuse to participate in anything that could harm a human.
    """

    # This will wrap the default prompts that are internal to llama-index
    query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")
    llm = HuggingFaceLLM(
        context_window=4096,
        max_new_tokens=256,
        generate_kwargs={"temperature": 0.7, "do_sample": False},
        system_prompt=system_prompt,
        query_wrapper_prompt=query_wrapper_prompt,
        tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
        model_name="StabilityAI/stablelm-tuned-alpha-3b",
        device_map="auto",
        stopping_ids=[50278, 50279, 50277, 1, 0],
        tokenizer_kwargs={"max_length": 4096},
        # uncomment this if using CUDA to reduce memory usage
        # model_kwargs={"torch_dtype": torch.float16}
    )
    service_context = ServiceContext.from_defaults(
        chunk_size=1024,
        llm=llm,
    )

如果要使用自定义的本地大型语言模型（LLM），您仅需实现 LLM 类（或为了简化接口实现 CustomLLM 类）。您将负责将文本传递给模型并返回新生成的token。这种实现可以是某个本地模型，甚至是围绕您自己的API的封装。

# -- coding: utf-8 --**

from typing import Optional, List, Mapping, Any

from llama_index import ServiceContext, SimpleDirectoryReader, SummaryIndex
from llama_index.callbacks import CallbackManager
from llama_index.llms import (
    CustomLLM,
    CompletionResponse,
    CompletionResponseGen,
    LLMMetadata,
)
from llama_index.llms.base import llm_completion_callback


class OurLLM(CustomLLM):
    context_window: int = 3900
    num_output: int = 256
    model_name: str = "custom"
    dummy_response: str = "My response"

    @property
    def metadata(self) -> LLMMetadata:
        """Get LLM metadata."""
        return LLMMetadata(
            context_window=self.context_window,
            num_output=self.num_output,
            model_name=self.model_name,
        )

    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        return CompletionResponse(text=self.dummy_response)

    @llm_completion_callback()
    def stream_complete(
        self, prompt: str, **kwargs: Any
    ) -> CompletionResponseGen:
        response = ""
        for token in self.dummy_response:
            response += token
            yield CompletionResponse(text=response, delta=token)


# define our LLM
llm = OurLLM()

service_context = ServiceContext.from_defaults(
    llm=llm, embed_model="local:BAAI/bge-base-en-v1.5"
)

# Load the your data
documents = SimpleDirectoryReader("./data").load_data()
index = SummaryIndex.from_documents(documents, service_context=service_context)

# Query and print response
query_engine = index.as_query_engine()
response = query_engine.query("<query_text>")
print(response)

使用这种方法，您可以使用任何LLM。也许您有在本地运行的，或者在您自己的服务器上运行的LLM。只要类被实现并且返回了生成的token，它就应该可以正常工作。

请注意，我们需要使用prompt helper来定制提示的大小，因为每个模型的上下文长度略有不同。

decorator是可选的，但它通过在LLM调用上的回调上提供了可观察性。

请注意，您可能需要调整内部提示（internal prompts）才能获得良好的性能。即便如此，您应该使用足够大的LLM来确保它能够处理LlamaIndex内部使用的复杂查询，所以您的实际效果可能会有所不同。

2、A full guide to using and configuring embedding models is available

在LlamaIndex中，嵌入（Embeddings）用于使用复杂的数值向量表示来表示您的文档。

这些嵌入模型已经经过海量语料无监督训练过，嵌入模型将文本作为输入，并返回一长串数字（向量表示），这些数字被用来捕捉文本的语义。

举个例子，从高层次上讲，如果用户提出有关狗的问题，那么该问题的嵌入将与谈论狗的文本的嵌入高度相似。

在计算嵌入之间的相似性时，有许多方法可以使用（点积、余弦相似度等）。默认情况下，LlamaIndex在比较嵌入时使用余弦相似度。

有许多嵌入模型可以选择。默认情况下，LlamaIndex使用OpenAI的text-embedding-ada-002。llama-index还支持Langchain提供的任何嵌入模型，以及提供一个易于扩展的基类，用于实现您自己的嵌入。

在LlamaIndex中，最常见的是在ServiceContext对象中指定嵌入模型，然后在向量索引中使用。在索引构建过程中，将使用嵌入模型来嵌入文档，以及稍后使用查询引擎进行的任何查询。

from llama_index import ServiceContext
from llama_index.embeddings import OpenAIEmbedding

embed_model = OpenAIEmbedding()
service_context = ServiceContext.from_defaults(embed_model=embed_model)

嵌入模型最常见的用途是在服务上下文对象中设置它，然后使用它来构建索引和查询。输入文档将被拆分成节点，嵌入模型将为每个节点生成一个嵌入。

默认情况下，LlamaIndex会使用text-embedding-ada-002，

from llama_index import ServiceContext, VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings import OpenAIEmbedding

embed_model = OpenAIEmbedding()
service_context = ServiceContext.from_defaults(embed_model=embed_model)

# optionally set a global service context to avoid passing it into other objects every time
from llama_index import set_global_service_context

set_global_service_context(service_context)

documents = SimpleDirectoryReader("./data").load_data()

index = VectorStoreIndex.from_documents(documents)

然后，在查询时，嵌入模型将再次被用来嵌入查询文本。

query_engine = index.as_query_engine()

response = query_engine.query("query string")

参考链接：

https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b
https://docs.llamaindex.ai/en/stable/api_reference/llms/huggingface.html
https://github.com/run-llama/llama_index/blob/main/llama_index/prompts/default_prompts.py
https://github.com/run-llama/llama_index/blob/main/llama_index/prompts/chat_prompts.py 
https://docs.llamaindex.ai/en/stable/module_guides/models/llms/usage_custom.html
https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html
https://docs.llamaindex.ai/en/stable/module_guides/models/llms.html

0x1：Download Data

mkdir -p 'data/paul_graham/'
wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

0x2：Load documents, build the VectorStoreIndex

将海量、高维的语料库提取出嵌入向量，形成一个向量知识库。

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM

# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model="local:BAAI/bge-large-en")

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

0x3：Query Index

将输入query通过embedding大模型生成嵌入空间向量，然后通过向量相似度搜索算法，在向量知识库里搜索近似的embedding chunk nodes。

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM

# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model="local:BAAI/bge-large-en")

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

query_engine = index.as_query_engine()
response = query_engine.query("what is The worst thing about leaving YC?")
print(response)

0x4：Storing your index

默认情况下，您刚刚加载的数据以一系列向量嵌入的形式存储在内存中。您可以通过将嵌入保存到磁盘来节省时间（以及对大模型的请求）。

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext, StorageContext, load_index_from_storage
from llama_index.llms import HuggingFaceLLM

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model="local:BAAI/bge-large-en")

import os.path
# check if storage already exists
if not os.path.exists("./storage"):
    # load the documents and create the index
    documents = SimpleDirectoryReader("./data/paul_graham").load_data()
    index = VectorStoreIndex.from_documents(
        documents, service_context=service_context
    )
    # store it for later
    index.storage_context.persist()
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir="./storage")
    index = load_index_from_storage(storage_context)

query_engine = index.as_query_engine()
response = query_engine.query("what is The worst thing about leaving YC?")
print(response)

0x5：chat with LLM with the response

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM

# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

# setup prompts - specific to StableLM
from llama_index.prompts import PromptTemplate

system_prompt = """<|SYSTEM|># StableLM Tuned (Alpha version)
- StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
- StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
- StableLM will refuse to participate in anything that could harm a human.
"""

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
    model_name="StabilityAI/stablelm-tuned-alpha-3b",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=1024, llm=llm, embed_model="local:BAAI/bge-large-en")

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

query_engine = index.as_query_engine()
response = query_engine.query("what is The worst thing about leaving YC?")
print(response)

chat_engine = index.as_chat_engine()
response = chat_engine.chat("Oh interesting, tell me more.")
print(response)

参考链接：

https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html#modules
https://docs.llamaindex.ai/en/stable/examples/customization/llms/SimpleIndexDemo-Huggingface_stablelm.html 
https://docs.llamaindex.ai/en/stable/examples/vector_stores/SimpleIndexDemoLlama-Local.html

LLM 最常见的应用之一是回答有关一组文档内容的问题。 LlamaIndex 对多种形式的问答提供了丰富的支持。

参考链接：

https://docs.llamaindex.ai/en/stable/use_cases/q_and_a.html

聊天机器人是LLM极其流行的另一个典型场景。与单一的问题和回答不同，聊天机器人可以处理多个来回的查询和回答，获取澄清或回答后续问题。

LlamaIndex为您提供了构建知识增强型聊天机器人和代理的工具。

参考链接：

https://docs.llamaindex.ai/en/stable/use_cases/chatbots.html
https://docs.llamaindex.ai/en/stable/understanding/putting_it_all_together/chatbots/building_a_chatbot.html

文章来源: https://www.cnblogs.com/LittleHann/p/17879401.html
如有侵权请联系:admin#unsafe.sh