基于大模型和 RAG 的工程实战:以统计学方法将单词转换为向量-优快云博客

在前一节中，我们介绍了整个 RAG 过程，我们展示了需要将文档切分成更小的部分，并将其转换为某种数学向量。本节将介绍如何将文档部分转换为向量，并搜索给定文档部分作为问题上下文，以提高 LLM 的响应质量。

我们需要了解的第一个概念是 TF-IDF，即“词频-逆文档频率”（Term Frequency-Inverse Document Frequency）。它是一种统计方法，用于评估某个单词在文档中的重要性，相对于一组文档而言。它需要计算两个指标。第一个指标是词频（Term Frequency），其计算方式如下：

TF(t,d) = (单词 t 在文档 d 中出现的次数) / (文档 d 中单词的总数)

第二个需要计算的指标是逆文档频率（Inverse Document Frequency），计算方式如下：

IDF(t) = log[(文档总数) / (包含单词 t 的文档数量)]

如果某个单词出现在许多文档中，比如 “is”、”the”、”a”、”an” 等，这样的单词通常没有重要的价值。

我们将上述两个指标结合起来以获得 TF-IDF：

TF-IDF(t, d) = TF(t,d) * IDF(t)

在上述公式中，单词的重要性由两个因素决定。第一个因素是该单词在给定文档中出现的次数，出现次数越多，该单词就越重要。但我们需要削弱类似 “a”、”an”、”the” 等常见单词的影响，这些单词容易出现在文档中，但对文档的重要信息没有贡献。此外，这些单词会频繁出现在许多文档中。因此，如果一个单词可以轻松出现在许多文档中，那么它的重要性应降低。我们将这两个因素相乘以综合它们的效果。

让我们使用一些代码来实现 TF-IDF 算法，如下所示：

from collections import Counter 
import math

def compute_tf(doc):
  """
  compute each frequency of each word in the document
  """
  word_count = Counter(doc)
  total_terms = len(doc)
  return {term: count / total_terms for term ,count in word_count.items()}

def compute_idf(all_docs):
  """
  IDF(t) = log[( total number of documents) / (number of doucuments containing word t)]
  """
  total_docs = len(all_docs)
  idf = {}
  for doc in all_docs:
    for term in set(doc):
      #compute number of doc containing the given term
      idf[term] = idf.get(term, 0) + 1
  return {term: math.log(total_docs / count) for term, count in idf.items()}

def compute_tf_idf(all_docs):
  """
  TF-IDF(t, d) = TF(t,d) * IDF(t)
  """
  tf_list = [compute_tf(doc) for doc in all_docs]
  idf = compute_idf(all_docs)
  tf_idf_list = []
  for tf in tf_list:
    tf_idf = {term: tf[term] * idf.get(term, 0) for term in tf}
    tf_idf_list.append(tf_idf)

  return tf_idf_list 

all_docs = [
    ["this", "is", "a", "sample"],
    ["this", "is", "another", "example", "example"]
]

td_idf_scores = compute_tf_idf(all_docs)

for i, doc_scores in enumerate(td_idf_scores):
  print(f"document: {i+1} TF-IDF scores:")
  for term, score in doc_scores.items():
    print(f" {term}: {score:.4f}")

运行上述代码，我们可以得到以下结果：

document: 1 TF-IDF scores:
 this: 0.0000
 is: 0.0000
 a: 0.1733
 sample: 0.1733
document: 2 TF-IDF scores:
 this: 0.0000
 is: 0.0000
 another: 0.1386
 example: 0.2773

基于文档中每个单词的 TF-IDF 分值，我们来看看如何将单词转换为向量。假设文档如下：

#1. remove repeation of word in documents and sort all words from document then each word will have its order, and compute tf-idf value of each word
idf = compute_idf(all_docs)
print(f"idf : {idf}")

vocab = sorted(idf.keys())
print(f"vocab: {vocab}")

运行上述代码，我们可以得到以下结果：

idf : {'a': 1.0986122886681098, 'document': 0.0, 'this': 0.4054651081081644, 'sample': 0.4054651081081644, 'is': 0.4054651081081644, 'example': 0.4054651081081644, 'another': 1.0986122886681098, 'more': 1.0986122886681098, 'one': 1.0986122886681098}
vocab: ['a', 'another', 'document', 'example', 'is', 'more', 'one', 'sample', 'this']

每个单词都有一个顺序值，例如 ‘a’ 的顺序值为 0，’another’ 的顺序值为 1。
将文档转换为向量，也就是从词汇表中取出每个单词。如果单词出现在文档中，则将该单词的 TF-IDF 值替换为向量中对应索引的位置；如果单词不在文档中，则将向量中对应索引的位置设置为 0：

def compute_tf_idf_vector(doc, idf):
  tf = compute_tf(doc)
  vec =  {term: tf[term] * idf.get(term, 0) for term in doc}
  print(f"doc: {doc} and word to tf-idf value: {vec}")
  return vec

def doc2vec(doc, idf):
  vec = compute_tf_idf_vector(doc, idf)
  vector = [vec.get(term, 0) for term in vocab]
  return vector

vectors = []
for i,doc in enumerate(all_docs):
  vec = doc2vec(doc, idf)
  print(f"vector of document {i} is : {vec}")
  vectors.append(vec)

运行上述代码，我们可以得到以下结果：

the 0th doc is : ['this', 'is', 'a', 'sample', 'document']
doc: ['this', 'is', 'a', 'sample', 'document'] and word to tf-idf value: {'this': 0.08109302162163289, 'is': 0.08109302162163289, 'a': 0.21972245773362198, 'sample': 0.08109302162163289, 'document': 0.0}
vector of document 0 is : [0.21972245773362198, 0, 0.0, 0, 0.08109302162163289, 0, 0, 0.08109302162163289, 0.08109302162163289]
the 1th doc is : ['this', 'document', 'is', 'another', 'example']
doc: ['this', 'document', 'is', 'another', 'example'] and word to tf-idf value: {'this': 0.08109302162163289, 'document': 0.0, 'is': 0.08109302162163289, 'another': 0.21972245773362198, 'example': 0.08109302162163289}
vector of document 1 is : [0, 0.21972245773362198, 0.0, 0.08109302162163289, 0.08109302162163289, 0, 0, 0, 0.08109302162163289]
the 2th doc is : ['one', 'more', 'sample', 'document', 'example']
doc: ['one', 'more', 'sample', 'document', 'example'] and word to tf-idf value: {'one': 0.21972245773362198, 'more': 0.21972245773362198, 'sample': 0.08109302162163289, 'document': 0.0, 'example': 0.08109302162163289}
vector of document 2 is : [0, 0, 0.0, 0.08109302162163289, 0, 0.21972245773362198, 0.21972245773362198, 0.08109302162163289, 0]

3.对于给定的单词，我们从词汇表中找到它的索引号，然后遍历每个文档向量，从给定索引获取值，并将所有值组合为该单词的向量：

def get_word_vec(word):
  word_index = vocab.index(word)
  print(f"word index: {word_index}")
  word_vec = []
  for vector in vectors:
      val = vector[word_index]
      word_vec.append(val)

  return word_vec

word = "sample"
vec = get_word_vec(word)
print(f"vector for word: {word} is {vec}")

运行上述代码我们得到以下结果：

word index: 7
vector for word: sample is [0.08109302162163289, 0, 0.08109302162163289]

现在我们可以计算给定向量的余弦相似度，如下所示： Cosine Similarity = (A . B) / ||A|| * ||B||

两个向量 (x1, y1)、(x2, y2) 的点积操作为 x1x2 + y1y2，||A|| 是范数值，如果 A 是 (x, y)，则其范数为 sqrt(x2 + y2)。我们来看看以下代码：

#compute cosin similarity
import math

def dot_product(v1, v2):
  return sum(x * y for x,y in zip(v1,v2))

def norm(vec):
  return math.sqrt(sum(x ** 2 for x in vec))

def consin_similarity(v1, v2):
  dot = dot_product(v1, v2)
  norm1 = norm(v1)
  norm2 = norm(v2)
  if norm1 == 0 or norm2 == 0:
    return 0.0
  return dot / (norm1 * norm2)

word1 = "sample"
word2 = "example"

v1 = get_word_vec(word1)
v2 = get_word_vec(word2)

print(f"v1: {v1}")
print(f"v2: {v2}")

sim = consin_similarity(v1, v2)
print(f"consin similarity is: {sim}")

运行上述代码我们可以得到以下结果：

word index: 7
word index: 3
v1: [0.08109302162163289, 0, 0.08109302162163289]
v2: [0, 0.08109302162163289, 0.08109302162163289]
consin similarity is: 0.5000000000000001

实际上，我们不需要自己完成所有这些操作，可以直接通过调用 sklearn 库来使用相同的算法，如下所示：

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np 

all_docs = [
     "this is a sample document",
    "this document is another example",
    "one more sample document example"
]

vectorizer = TfidfVectorizer()
#convert each document to vector
doc_vecs = vectorizer.fit_transform(all_docs)
"""
(0, 7)  0.5268 means for the matrix, the value of cell with row 0 and column 7
is 0.5268, if not cell metioned , then the value of cell is 0
"""
print(f"document vecs max with compression: {doc_vecs}")
doc_dense_vecs = doc_vecs.todense()
print(f"dense vecs: {doc_dense_vecs}")

vocab = vectorizer.get_feature_names_out()
print(f"vocab: {vocab}")

for i, doc_vec in enumerate(doc_dense_vecs):
  print(f"doc {i+1} with TF-IDF scores: {doc_vec}")

word = "example"
word_index = vectorizer.vocabulary_.get(word)
print(f"index for word:{word} is :{word_index}")
if word_index is not None:
  print(f"\n vector for word:{word}:")
  print(doc_dense_vecs[:,word_index])

from sklearn.metrics.pairwise import cosine_similarity
word_example_vec = np.asarray(doc_dense_vecs[:,word_index]).T
print(f"vec for example after flattern: {word_example_vec}")
word = "sample"
word_index = vectorizer.vocabulary_.get(word)
word_sample_vec = np.asarray(doc_dense_vecs[:,word_index]).T
print(f"vec for sample after flattern: {word_sample_vec}")
similarity = cosine_similarity(word_example_vec, word_sample_vec)
print(f"similarity is: {similarity}")

在上述代码中，vectorizer 将完成我们之前所做的操作，它将计算每个单词的 TF-IDF，然后将文档转换为词汇表，并计算每个文档的 TF-IDF 向量，所有这些操作都可以通过调用 vectorizer.fit_transform(all_docs) 完成。而我们之前获得的词汇表可以通过调用 vectorizer.get_feature_names_out() 完成。

而 doc_dense_vecs 是将每个文档转换为向量的结果，我们可以通过从词汇表中获取单词的索引，并从每个文档向量中获取给定索引的值，最终可以通过 sklearn.metrics.pairwise 中的 cosine_similarity 计算余弦相似度，运行上述代码，我们得到以下结果：

document vecs max with compression:   (0, 7)    0.5268201732399633
  (0, 3)    0.5268201732399633
  (0, 6)    0.5268201732399633
  (0, 1)    0.40912286076708654
  (1, 7)    0.43306684852870914
  (1, 3)    0.43306684852870914
  (1, 1)    0.33631504064053513
  (1, 0)    0.5694308628404254
  (1, 2)    0.43306684852870914
  (2, 6)    0.4061917781433946
  (2, 1)    0.3154441510317797
  (2, 2)    0.4061917781433946
  (2, 5)    0.5340933749435833
  (2, 4)    0.5340933749435833
dense vecs: [[0.         0.40912286 0.         0.52682017 0.         0.
  0.52682017 0.52682017]
 [0.56943086 0.33631504 0.43306685 0.43306685 0.         0.
  0.         0.43306685]
 [0.         0.31544415 0.40619178 0.         0.53409337 0.53409337
  0.40619178 0.        ]]
vocab: ['another' 'document' 'example' 'is' 'more' 'one' 'sample' 'this']
doc 1 with TF-IDF scores: [[0.         0.40912286 0.         0.52682017 0.         0.
  0.52682017 0.52682017]]
doc 2 with TF-IDF scores: [[0.56943086 0.33631504 0.43306685 0.43306685 0.         0.
  0.         0.43306685]]
doc 3 with TF-IDF scores: [[0.         0.31544415 0.40619178 0.         0.53409337 0.53409337
  0.40619178 0.        ]]
index for word:example is :2

 vector for word:example:
[[0.        ]
 [0.43306685]
 [0.40619178]]
vec for example after flattern: [[0.         0.43306685 0.40619178]]
vec for sample after flattern: [[0.52682017 0.         0.40619178]]
similarity is: [[0.41772158]]