【课程笔记】Lecture2-斯坦福自然语言处理cs224n

最新推荐文章于 2022-11-07 17:06:56 发布

原创

最新推荐文章于 2022-11-07 17:06:56 发布 · 729 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#斯坦福NLP课程 #word2vec #skipgram #词向量技术

本文是斯坦福NLP课程Lecture2的内容，主要讲解了词向量表示方法word2vec，包括word meaning的概念、one-hot表示的问题、word2vec的skip-gram模型以及模型的目标函数和优化。通过上下文预测，word2vec能够捕捉单词的语义，并通过调整向量表示来计算词语的相似性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Lecture2_Stanford cs224d_Simple Word Vector Representations: word2vec,GloVe

文章目录

Lecture2_Stanford cs224d_Simple Word Vector Representations: word2vec,GloVe

1. Word meaning

A question lies ahead

Q:How do we have a usable meaning in a computer?
Common answer:Use a taxonomy like WordNet that has hypernyms relationships and synonym set.

WordNet is one of the most famous taxonomic resource and it is popular among computational linguists.Because It is free to download a copy,it provides a lot of taxonomy infomation about words.
the components of WordNet
demo implemented by python

The picture above shows you getting a hold of WordNet using the NLTK which is one of the main python packages for nlp.

在python3.7版本中运行结果如下：
显示跟"panda"一词接近的上位词列表如下：

[Synset(‘procyonid.n.01’),

Synset(‘carnivore.n.01’),

Synset(‘placental.n.01’),

Synset(‘mammal.n.01’),

Synset(‘vertebrate.n.01’),

Synset(‘chordate.n.01’),

Synset(‘animal.n.01’),

Synset(‘organism.n.01’),

Synset(‘living_thing.n.01’),

Synset(‘whole.n.02’),

Synset(‘object.n.01’),

Synset(‘physical_entity.n.01’),

Synset(‘entity.n.01’)]

discrete representation

上述有关词的离散表示虽然是一种语言学资源，但在实际应用中，结果可能并没有人们所期望得那么好。因为词的离散表示所找到的同义词在意思上还有细微的差别(nuance)，其主要局限性体现在：

缺少新词(很难与时俱进取更新)
主观化
需要大量的人力去创建和维护
很难准确计算词语相似性
大量的基于规则的(rule-based)和统计(statistical)自然语言处理任务中将词视为一个不可分割的原子单元。在向量空间中，每一个向量由一个“1”和很多个“0”表示，我们将这种表示方法称为“one-hot” representation.它存在的问题是：
1. 语料库中词汇表的数目大，向量的维度也就变得非常大
Dimensionality:
20K (speech) – 50K (PTB) – 500K (big vocab) – 13M (Google 1T)
- “speech” means speech recognizer
- PTB- penn tree bank
在python中可直接调用：from nltk.corpus import ptb
- big vocab
if we kinda building a machine translation system,we might use a 500,000 word vocabulary.
- Google 1T
Google released sort of 1-terabyte corpus of web crawl.
2. 词向量两两正交(点乘为零或称内积为零)
or we can say there is no natural notion of similarity

How to make neighbors represent words?

语言学家J. R. Firth提出，通过一个单词的上下文可以得到它的意思。J. R. Firth甚至建议，如果你能把单词放到正确的上下文中去，才说明你掌握了它的意义。

“you shall know a word by the company it keeps.”
20世纪初在哲学语言学方面深有造诣的Wittgenstein也表示“the right way to think about the meaning of words is understanding their use in text.”

通过向量定义词语的含义Word meaning id defined in terms of vectors

通过调整一个单词及其上下文单词的向量，使得根据两个向量可以推测两个词语的相似度；或根据向量可以预测词语的上下文。这种手法也是递归的(recursive)，根据向量来调整向量，与词典中意项的定义相似。

In addition,distributed representations与symbolic representations（localist representation、one-hot representation）相对；discrete representation则与后者及denotation的意思相似。切不可搞混distributed和discrete这两个单词。