Task: Implement TF-IDF (Term Frequency-Inverse Document Frequency)
Your task is to implement a function that computes the TF-IDF scores for a query against a given corpus of documents.
Function Signature
Write a function compute_tf_idf(corpus, query)
that takes the following inputs:
corpus
: A list of documents, where each document is a list of words.query
: A list of words for which you want to compute the TF-IDF scores.
Output
The function should return a list of lists containing the TF-IDF scores for the query words in each document, rounded to five decimal places.
Important Considerations
-
Handling Division by Zero:
When implementing the Inverse Document Frequency (IDF) calculation, you must account for cases where a term does not appear in any document (df = 0
). This can lead to division by zero in the standard IDF formula. Add smoothing (e.g., adding 1 to both numerator and denominator) to avoid such errors. -
Empty Corpus:
Ensure your implementation gracefully handles the case of an empty corpus. If no documents are provided, your function should either raise an appropriate error or return an empty result. This will ensure the program remains robust and predictable. -
Edge Cases:
- Query terms not present in the corpus.
- Documents with no words.
- Extremely large or small values for term frequencies or document frequencies.
By addressing these considerations, your implementation will be robust and handle real-world scenarios effectively.
Example:
Input:
corpus = [ ["the", "cat", "sat", "on", "the", "mat"], ["the", "dog", "chased", "the", "cat"], ["the", "bird", "flew", "over", "the", "mat"] ] query = ["cat"] print(compute_tf_idf(corpus, query))
Output:
[[0.21461], [0.25754], [0.0]]
Reasoning:
The TF-IDF scores for the word "cat" in each document are computed and rounded to five decimal places.
import numpy as np
import math
def compute_tf_idf(corpus, query):
"""
Compute TF-IDF scores for a query against a corpus of documents.
:param corpus: List of documents, where each document is a list of words
:param query: List of words in the query
:return: List of lists containing TF-IDF scores for the query words in each document
"""
# Edge case: empty corpus
if not corpus:
return []
# Edge case: empty query
if not query:
return [[0] * len(corpus)]
num_docs = len(corpus)
results = []
# Calculate document frequency (df) for each query term
df = {}
for term in query:
df[term] = sum(1 for doc in corpus if term in doc)
# Calculate TF-IDF for each document
for doc in corpus:
doc_scores = []
doc_len = len(doc)
# Skip empty documents
if doc_len == 0:
doc_scores = [0.0] * len(query)
else:
# Calculate term frequency (tf) for each query term in this document
for term in query:
# Term frequency: number of times term appears in document divided by document length
tf = doc.count(term) / doc_len
# Inverse document frequency with smoothing to avoid division by zero
# Add 1 to both numerator and denominator for smoothing
idf = math.log((num_docs + 1) / (df[term] + 1)) + 1
# Calculate TF-IDF and round to 5 decimal places
tf_idf = round(tf * idf, 5)
doc_scores.append(tf_idf)
results.append(doc_scores)
return results
Test Results3/3
import numpy as np
def compute_tf_idf(corpus, query):
"""
Compute TF-IDF scores for a query against a corpus of documents using only NumPy.
The output TF-IDF scores retain five decimal places.
"""
vocab = sorted(set(word for document in corpus for word in document).union(query))
word_to_index = {word: idx for idx, word in enumerate(vocab)}
tf = np.zeros((len(corpus), len(vocab)))
for doc_idx, document in enumerate(corpus):
for word in document:
word_idx = word_to_index[word]
tf[doc_idx, word_idx] += 1
tf[doc_idx, :] /= len(document)
df = np.count_nonzero(tf > 0, axis=0)
num_docs = len(corpus)
idf = np.log((num_docs + 1) / (df + 1)) + 1
tf_idf = tf * idf
query_indices = [word_to_index[word] for word in query]
tf_idf_scores = tf_idf[:, query_indices]
tf_idf_scores = np.round(tf_idf_scores, 5)
return tf_idf_scores.tolist()