16、文本分析与可视化技术深度解析

wind

于 2025-11-07 13:26:30 发布

阅读量9

点赞数

CC 4.0 BY-SA版权

分类专栏：用Python解锁文本智能文章标签： n-gram 语言模型熵

本文链接：https://blog.youkuaiyun.com/wind/article/details/154973829

用Python解锁文本智能专栏收录该内容

27 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

文本分析与可视化技术深度解析

1. n-gram语言模型基础

在自然语言处理中，n-gram语言模型是一种重要的工具。我们可以为其创建熵（entropy）方法，通过计算NgramCounter中每个n-gram的平均对数概率来实现。以下是具体的熵计算函数：

def entropy(self, text):
    """
    Calculate the approximate cross-entropy of the n-gram model for a
    given text represented as a list of comma-separated strings.
    This is the average log probability of each word in the text.
    """
    normed_text = (self._check_against_vocab(word) for word in text)
    entropy = 0.0
    processed_ngrams = 0
    for ngram in self.ngram_counter.to_ngrams(normed_text):
        context, word = tuple(ngram[:-1]), ngram[-1]
        entropy += self.logscore(word, context)
        processed_ngrams += 1
    return - (entropy / processed_ngrams)

会员秒杀 ¥9.9 重磅福利

超级会员免费看