For given term frequency,
the vector length is seen to take values only in a
narrow interval. That interval initially shifts upwards
with increasing frequency. Around a frequency
of about 30, that trend reverses and the interval
shifts downwards.
...
Both forces determining the length of a word
vector are seen at work here. Small-frequency
words tend to be used consistently, so that the
more frequently such words appear, the longer
their vectors. This tendency is reflected by the upwards
trend in Fig. 3 at low frequencies. High-frequency
words, on the other hand, tend to be
used in many different contexts, the more so, the
more frequently they occur. The averaging over
an increasing number of different contexts shortens
the vectors representing such words. This tendency
is clearly reflected by the downwards trend
in Fig. 3 at high frequencies, culminating in punctuation
marks and stop words with short vectors at
the very end.
...
Figure 3: Word vector length v versus term frequency
tf of all words in the hep-th vocabulary.
Note the logarithmic scale used on the frequency
axis. The dark symbols denote bin means with the
kth bin containing the frequencies in the interval
[2k−1, 2k − 1] with k = 1, 2, 3, . . .. These means
are included as a guide to the eye. The horizontal
line indicates the length v = 1.37 of the mean
vector