Welcome to CS224n!
Before you start, make sure you read the README.txt in the same directory as this notebook.
# All Import Statements Defined Here
# Note: Do not add to this list.
# All the dependencies you need, can be installed by running .
# ----------------
import sys
assert sys.version_info[0]==3
assert sys.version_info[1] >= 5
from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import pprint
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]
import nltk
nltk.download('reuters')
from nltk.corpus import reuters
import numpy as np
import random
import scipy as sp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA
START_TOKEN = '<START>'
END_TOKEN = '<END>'
np.random.seed(0)
random.seed(0)
# ----------------
output:
[nltk_data] Downloading package reuters to
[nltk_data] C:\Users\yezhipeng\AppData\Roaming\nltk_data...
[nltk_data] Package reuters is already up-to-date!
Word Vectors
Word Vectors are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, translation, etc., so it is important to build some intuitions as to their strengths and weaknesses. Here, you will explore two types of word vectors: those derived from co-occurrence matrices, and those derived via word2vec.
Assignment Notes: Please make sure to save the notebook as you go along. Submission Instructions are located at the bottom of the notebook.
Note on Terminology: The terms "word vectors" and "word embeddings" are often used interchangeably. The term "embedding" refers to the fact that we are encoding aspects of a word's meaning in a lower dimensional space. As Wikipedia states, "conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension".
Part 1: Count-Based Word Vectors (10 points)
Most word vector models start from the following idea:
You shall know a word by the company it keeps (Firth, J. R. 1957:11)
Many word vector implementations are driven by the idea that similar words, i.e., (near) synonyms, will be used in similar contexts. As a result, similar words will often be spoken or written along with a shared subset of words, i.e., contexts. By examining these contexts, we can try to develop embeddings for our words. With this intuition in mind, many "old school" approaches to constructing word vectors relied on word counts. Here we elaborate upon one of those strategies, co-occurrence matrices (for more information, see here or here).
Co-Occurrence
A co-occurrence matrix counts how often things co-occur in some environment. Given some word occurring in the document, we consider the context window surrounding
. Supposing our fixed window size is
, then this is the
preceding and
subsequent words in that document, i.e. words
and
. We build a co-occurrence matrix
, which is a symmetric word-by-word matrix in which
is the number of times
appears inside
's window.
Example: Co-Occurrence with Fixed Window of n=1:
Document 1: "all that glitters is not gold"
Document 2: "all is well that ends well"
* | START | all | that | glitters | is | not | gold | well | ends | END |
---|---|---|---|---|---|---|---|---|---|---|
START | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
all | 2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
that | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
glitters | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
is | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
not | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
gold | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
well | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
ends | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
END | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
Note: In NLP, we often add START and END tokens to represent the beginning and end of sentences, paragraphs or documents. In thise case we imagine START and END tokens encapsulating each document, e.g., "START All that glitters is not gold END", and include these tokens in our co-occurrence counts.
The rows (or columns) of this matrix provide one type of word vectors (those based on word-word co-occurrence), but the vectors will be large in general (linear in the number of distinct words in a corpus). Thus, our next step is to run dimensionality reduction. In particular, we will run SVD (Singular Value Decomposition), which is a kind of generalized PCA (Principal Components Analysis) to select the top principal components. Here's a visualization of dimensionality reduction with SVD. In this picture our co-occurrence matrix is with
rows corresponding to
words. We obtain a full matrix decomposition, with the singular values ordered in the diagonal
matrix, and our new, shorter length-
word vectors in
.
This reduced-dimensionality co-occurrence representation preserves semantic relationships between words, e.g. doctor and hospital will be closer than doctor and dog.
Notes: If you can barely remember what an eigenvalue is, here's a slow, friendly introduction to SVD. If you want to learn more thoroughly about PCA or SVD, feel free to check out lectures 7, 8, and 9 of CS168. These course notes provide a great high-level treatment of these general purpose algorithms. Though, for the purpose of this class, you only need to know how to extract the k-dimensional embeddings by utilizing pre-programmed implementations of these algorithms from the numpy, scipy, or sklearn python packages. In practice, it is challenging to apply full SVD to large corpora because of the memory needed to perform PCA or SVD. However, if you only want the top vector components for relatively small — known as Truncated SVD — then there are reasonably scalable techniques to compute those iteratively.
Plotting Co-Occurrence Word Embeddings
Here, we will be using the Reuters (business and financial news) corpus. If you haven't run the import cell at the top of this page, please run it now (click it and press SHIFT-RETURN). The corpus consists of 10,788 news documents totaling 1.3 million words. These documents span 90 categories and are split into train and test. For more details, please see https://www.nltk.org/book/ch02.html. We provide a read_corpus
function below that pulls out only articles from the "crude" (i.e. news articles about oil, gas, etc.) category. The function also adds START and END tokens to each of the documents, and lowercases words. You do not hav