概率语言模型 Probabilistic Language Modeling (一) --- 整体简介

本文介绍了语言模型的基本概念,包括计算句子概率的方法、N元语言模型及其概率估计方式,并探讨了如何评估语言模型性能。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1. Introduction

语言模型的目标就是compute the probability of a sentence or sequence of words:
数学公式很简单,给定一个包含l个词语w1,w2,...,wl的句子W,其
P(W)=P(w1,w2,,wl)
=p(w1)p(w2|w1)p(w3|w1,w2)...p(wl|w1,w2,...,wl1)
若是计算上述公式中每个p(wl|w1,w2,...,wl1),参数空间过大造成数据稀疏严重。一般采用马尔科夫假设(Markov Assumption):下一个词的出现仅依赖于它前面的k个词,即
p(wi|w1,w2,...,wi1)=p(wi|wik,...,wi1)

-下一个词的出现只依赖它前面的一个词,即为Bigram model:
p(wi|w1,w2,...,wi1)=p(wi|wi1)
-下一个词的出现依赖它前面的两个词,即为Trigram model:
p(wi|w1,w2,...,wi1)=p(wi|wi2,wi1)

对于上述简化假设,我们称之为N元语言模型(n-Gram),一般来说,其计算公式为:
P(W)=l+1i=1p(wi|wi1in+1)
其中wji表示词语序列wi,...,wj

2. Estimating N-gram Probabilities

构造使用语言模型的第一步就是生成每个条件概率p(wi|wi1in+1), 通常使用最大似然估计(Maximum Likelihood Estimate), 即
p(wi|wi1)=count(wi1,wi)/count(wi1)
其中count(wi1,wi)为词组wi1,wi的出现频率。

3. Performance Evaluation

一般评价语言模型的性能采用迷惑度/困惑度/混乱度(perplexity),计算公式为:
PP(W)=2H(W)
H(W)=1llog2P(W)
其中l为句子W的长度,一般来说Lower perplexity = better model

Statistical learning refers to a set of tools for modeling and understanding complex datasets. It is a recently developed area in statistics and blends with parallel developments in computer science and, in particular, machine learning. The field encompasses many methods such as the lasso and sparse regression, classification and regression trees, and boosting and support vector machines. With the explosion of “Big Data” problems, statistical learning has be- come a very hot field in many scientific areas as well as marketing, finance, and other business disciplines. People with statistical learning skills are in high demand. One of the first books in this area—The Elements of Statistical Learning (ESL) (Hastie, Tibshirani, and Friedman)—was published in 2001, with a second edition in 2009. ESL has become a popular text not only in statis- tics but also in related fields. One of the reasons for ESL’s popularity is its relatively accessible style. But ESL is intended for individuals with ad- vanced training in the mathematical sciences. An Introduction to Statistical Learning (ISL) arose from the perceived need for a broader and less tech- nical treatment of these topics. In this new book, we cover many of the same topics as ESL, but we concentrate more on the applications of the methods and less on the mathematical details. We have created labs illus- trating how to implement each of the statistical learning methods using the popular statistical software package R . These labs provide the reader with valuable hands-on experience.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值