Implement TF-IDF (Term Frequency-Inverse Document Frequency)

Task: Implement TF-IDF (Term Frequency-Inverse Document Frequency)

Your task is to implement a function that computes the TF-IDF scores for a query against a given corpus of documents.

Function Signature

Write a function compute_tf_idf(corpus, query) that takes the following inputs:

  • corpus: A list of documents, where each document is a list of words.
  • query: A list of words for which you want to compute the TF-IDF scores.

Output

The function should return a list of lists containing the TF-IDF scores for the query words in each document, rounded to five decimal places.

Important Considerations

  1. Handling Division by Zero:
    When implementing the Inverse Document Frequency (IDF) calculation, you must account for cases where a term does not appear in any document (df = 0). This can lead to division by zero in the standard IDF formula. Add smoothing (e.g., adding 1 to both numerator and denominator) to avoid such errors.

  2. Empty Corpus:
    Ensure your implementation gracefully handles the case of an empty corpus. If no documents are provided, your function should either raise an appropriate error or return an empty result. This will ensure the program remains robust and predictable.

  3. Edge Cases:

    • Query terms not present in the corpus.
    • Documents with no words.
    • Extremely large or small values for term frequencies or document frequencies.

By addressing these considerations, your implementation will be robust and handle real-world scenarios effectively.

Example:

Input:

corpus = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "chased", "the", "cat"],
    ["the", "bird", "flew", "over", "the", "mat"]
]
query = ["cat"]

print(compute_tf_idf(corpus, query))

Output:

[[0.21461], [0.25754], [0.0]]

Reasoning:

The TF-IDF scores for the word "cat" in each document are computed and rounded to five decimal places.

 

import numpy as np
import math
def compute_tf_idf(corpus, query):
    """
    Compute TF-IDF scores for a query against a corpus of documents.
    
    :param corpus: List of documents, where each document is a list of words
    :param query: List of words in the query
    :return: List of lists containing TF-IDF scores for the query words in each document
    """
    # Edge case: empty corpus
    if not corpus:
        return []
    
    # Edge case: empty query
    if not query:
        return [[0] * len(corpus)]
    
    num_docs = len(corpus)
    results = []
    
    # Calculate document frequency (df) for each query term
    df = {}
    for term in query:
        df[term] = sum(1 for doc in corpus if term in doc)
    
    # Calculate TF-IDF for each document
    for doc in corpus:
        doc_scores = []
        doc_len = len(doc)
        
        # Skip empty documents
        if doc_len == 0:
            doc_scores = [0.0] * len(query)
        else:
            # Calculate term frequency (tf) for each query term in this document
            for term in query:
                # Term frequency: number of times term appears in document divided by document length
                tf = doc.count(term) / doc_len
                
                # Inverse document frequency with smoothing to avoid division by zero
                # Add 1 to both numerator and denominator for smoothing
                idf = math.log((num_docs + 1) / (df[term] + 1)) + 1
                
                # Calculate TF-IDF and round to 5 decimal places
                tf_idf = round(tf * idf, 5)
                doc_scores.append(tf_idf)
        
        results.append(doc_scores)
    
    return results

Test Results3/3
 

import numpy as np

def compute_tf_idf(corpus, query):
    """
    Compute TF-IDF scores for a query against a corpus of documents using only NumPy.
    The output TF-IDF scores retain five decimal places.
    """
    vocab = sorted(set(word for document in corpus for word in document).union(query))
    word_to_index = {word: idx for idx, word in enumerate(vocab)}

    tf = np.zeros((len(corpus), len(vocab)))

    for doc_idx, document in enumerate(corpus):
        for word in document:
            word_idx = word_to_index[word]
            tf[doc_idx, word_idx] += 1
        tf[doc_idx, :] /= len(document)

    df = np.count_nonzero(tf > 0, axis=0)

    num_docs = len(corpus)
    idf = np.log((num_docs + 1) / (df + 1)) + 1

    tf_idf = tf * idf

    query_indices = [word_to_index[word] for word in query]
    tf_idf_scores = tf_idf[:, query_indices]

    tf_idf_scores = np.round(tf_idf_scores, 5)

    return tf_idf_scores.tolist()

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

六月五日

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值