尽管大模型呈现出的形式是端到端、文本输入输出的,但实际模型接触和学习的数据并不是文本本身,而是向量化的文本,因为文本本身直接作为数据维度太高、学习起来太低效(稀疏)了。所谓向量化的文本,就是模型对自然语言的压缩和总结。早年的 NLP 教材有一个很经典的例子:如果我们把每一个单词看作向量,king 减 queen 之差与 man 与 woman 之差是相等的,都代表着性别的差异。
这些关键信息就是人脑加工的 embedding。
向量搜索就是在海量存储的向量中找到最符合要求的 k 个目标。
当我想从海外独角兽的文本库中找出与“硅谷最新动态”最相关的 5 段文本时,首先会使用 OpenAI Embedding api 将海外独角兽的所有文章加工成向量,存入向量数据库中;然后把“硅谷最新动态”的向量与数据库中所有向量进行语义相似度的对比;比对后,对相似度排名返回 top 5 的文本,很可能来自去年团队去硅谷的所见所闻。
因此之前向量搜索算法就已经出现,Facebook 开源的 FAISS 是其中的翘楚,只是在大模型出现之前,这个需求只在大厂中存在,主要通过自研产品满足。
如果有大量信息或语料需要给 LLM 作为参考,把大量文本一股脑的作为 Prompt 显然很不经济,而且过多不相干信息还可能误导模型输出。因此一个好的方式是,提前把语料库向量化,再查询跟问题 embedding 相似的语料,最终一同送入 GPT 模型。这是一种典型地整合 OpenAI api 的路径,是现阶段比较灵活且经济的方式。向量搜索在这里扮演了择优选择 prompt 的角色。
MongoDB 的重要性一部分来自于 JSON 的灵活性,其覆盖多个场景,使用单个数据库完成了多种数据库的任务。而向量 embedding 也有这一潜质,文本、图片、音频、视频等多媒体数据,未来都可以用通用大模型压缩成向量化的数据。
因此如果我们认为AI 应用 = LLM + 交互 + 记忆 + 多模态,
当 AI 能有这么强的信息提取和组织能力之后,传统数据库的很多能力是受到冲击的。
向量搜索的普及过程中,很多之前用 SQL 和结构化数据比较难解锁的产品功能自然得到了实现,长期用户的使用范式肯定慢慢会从传统数据库转移到 LLM + Vector DB。
下图展示了向量数据库是如何提升AI Applications的能力的,
换言之,从人类智能的角度看,向量数据库是短期记忆,LLM 是长期记忆,但目前他们之间的交互还是单向的,缺少了短期记忆累积沉淀,形成长期记忆的过程。但直接去调整大模型的参数是不太可行的。因此这一过程可能需要一些新的组件来弥补,例如一个基于 Lora 进行微调的小模型,来帮助大模型做一些领域专业知识的记忆;也或者是由多个 LLM 交互形成群体记忆,来达到更新长期记忆的效果。
同时,还有一种观点认为,当 LLM 能够读入无限 token 时,向量数据库的必要性就不大了。理论上这是完全可行的,但这忽略了经济成本和工程复用性的问题。当每一次执行都要将相关语料库不经检索地作为 prompt 输入时,其中大部分的内容信息增益和 ROI 是很低的,可能带来很多不必要的商业成本和资源浪费。尤其是当大模型允许多模态输入之后,这一问题会更加显著。而且模型使用向量化的记忆,再将输出向量化存入记忆中是很好的记忆回路,能够一定程度上 LLM 对过往经验知识的总结和复用。
pip install chromadb
pip install pybind11
pip install tiktoken
pip install unstructured
pip install pdf2image
pip install pytesseract
import os from langchain.embeddings import HuggingFaceEmbeddings from langchain.vectorstores import Chroma from langchain.text_splitter import CharacterTextSplitter from langchain import OpenAI, VectorDBQA from langchain.document_loaders import DirectoryLoader from langchain.chains import RetrievalQA # 加载文件夹中的所有.neat文件 loader = DirectoryLoader('./webshell_data_0414', glob='**/*.neat') # 将数据转成 document 对象,每个文件会作为一个 document documents = loader.load() # 初始化加载器 text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0) # 切割加载的 document split_docs = text_splitter.split_documents(documents) # 初始化 embeddings 对象 embedding = HuggingFaceInstructEmbeddings() # 将 document 通过 embeddings 对象计算 embedding 向量信息并存入 Chroma 向量数据库,用于后续匹配查询 vector_store_path = r"./vector_store" docsearch = Chroma(persist_directory=vector_store_path, embedding_function=embeddings)
pip install InstructorEmbedding
pip install sentence_transformers
Then you can use the model like this to calculate domain-specific and task-aware embeddings:
from InstructorEmbedding import INSTRUCTOR model = INSTRUCTOR('hkunlp/instructor-large') sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments" instruction = "Represent the Science title:" embeddings = model.encode([[instruction,sentence]]) print(embeddings)
You can further use the model to compute similarities between two groups of sentences, with customized embeddings.
from sklearn.metrics.pairwise import cosine_similarity from InstructorEmbedding import INSTRUCTOR model = INSTRUCTOR('hkunlp/instructor-large') sentences_a = [ [ 'Represent the Science sentence: ', 'Parton energy loss in QCD matter' ], [ 'Represent the Financial statement: ', 'The Federal Reserve on Wednesday raised its benchmark interest rate.' ] ] sentences_b = [ [ 'Represent the Science sentence: ', 'The Chiral Phase Transition in Dissipative Dynamics' ], [ 'Represent the Financial statement: ', 'The funds rose less than 0.5 per cent on Friday' ] ] embeddings_a = model.encode(sentences_a) embeddings_b = model.encode(sentences_b) similarities = cosine_similarity(embeddings_a,embeddings_b) print(similarities)
You can also use customized embeddings for information retrieval.
import numpy as np from sklearn.metrics.pairwise import cosine_similarity from InstructorEmbedding import INSTRUCTOR model = INSTRUCTOR('hkunlp/instructor-large') query = [ [ 'Represent the Wikipedia question for retrieving supporting documents: ', 'where is the food stored in a yam plant' ] ] corpus = [ [ 'Represent the Wikipedia document for retrieval: ', 'Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term "mixed economies" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.' ], [ 'Represent the Wikipedia document for retrieval: ', "The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loans—and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession" ], [ 'Represent the Wikipedia document for retrieval: ', 'Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.' ] ] query_embeddings = model.encode(query) corpus_embeddings = model.encode(corpus) similarities = cosine_similarity(query_embeddings, corpus_embeddings) print("similarities: ", similarities) retrieved_doc_id = np.argmax(similarities) print("retrieved_doc_id: ", retrieved_doc_id) print("corpus[retrieved_doc_id]: ", corpus[retrieved_doc_id])
可以看到,通过embedding similar search,我们在corpus中搜索得到了和query最接近的语料,在prompt Retrieval场景中可以用于提高prompt的精确度和上下文信息丰富度。
Use customized embeddings for clustering texts in groups.
from InstructorEmbedding import INSTRUCTOR model = INSTRUCTOR('hkunlp/instructor-large') import sklearn.cluster sentences = [['Represent the Medicine sentence for clustering: ','Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity'], ['Represent the Medicine sentence for clustering: ','Comparison of Atmospheric Neutrino Flux Calculations at Low Energies'], ['Represent the Medicine sentence for clustering: ','Fermion Bags in the Massive Gross-Neveu Model'], ['Represent the Medicine sentence for clustering: ',"QCD corrections to Associated t-tbar-H production at the Tevatron"], ['Represent the Medicine sentence for clustering: ','A New Analysis of the R Measurements: Resonance Parameters of the Higher, Vector States of Charmonium']] embeddings = model.encode(sentences) clustering_model = sklearn.cluster.MiniBatchKMeans(n_clusters=2) clustering_model.fit(embeddings) cluster_assignment = clustering_model.labels_ print(cluster_assignment)
# 持久化数据 docsearch = Pinecone.from_texts([t.page_content for t in split_docs], embeddings, index_name=index_name) # 加载数据 docsearch = Pinecone.from_existing_index(index_name, embeddings)