Mining Twitter Data with Python Part 3: Term Frequencies

最新推荐文章于 2025-12-05 17:02:52 发布

转载最新推荐文章于 2025-12-05 17:02:52 发布 · 575 阅读

文章标签：

#twitter #python

python 专栏收录该内容

17 篇文章

订阅专栏

本文通过词频分析从Twitter数据中提取有意义的词汇，包括去除停用词、使用n-gram等方法，并提供了Python代码示例。

http://www.kdnuggets.com/2016/06/mining-twitter-data-python-part-3.html

Part 3 of this 7 part series focusing on mining Twitter data discusses the analysis of term frequencies for meaningful term extraction.

By Marco Bonzanini, Independent Data Science Consultant.

This is the third part in a series of articles about data mining on Twitter. After collecting data and pre-processing some text, we are ready for some basic analysis. In this article, we’ll discuss the analysis of term frequencies to extract meaningful terms from our tweets.

Twitter

Counting Terms

Assuming we have collected a list of tweets (see Part 1 of the tutorial), the first exploratory analysis that we can perform is a simple word count. In this way, we can observe what are the terms most commonly used in the data set. In this example, I’ll use the set of my tweets, so the most frequent words should correspond to the topics I discuss (not necessarily, but bear with be for a couple of paragraphs).

We can use a custom tokeniser to split the tweets into a list of terms. The following code uses the preprocess() function described in Part 2 of the tutorial, in order to capture Twitter-specific aspects of the text, such as #hashtags, @-mentions, emoticons and URLs. In order to keep track of the frequencies while we are processing the tweets, we can usecollections.Counter() which internally is a dictionary (term: count) with some useful methods like most_common():

import operator 
import json
from collections import Counter
 
fname = 'mytweets.json'
with open(fname, 'r') as f:
    count_all = Counter()
    for line in f:
        tweet = json.loads(line)
        # Create a list with all the terms
        terms_all = [term for term in preprocess(tweet['text'])]
        # Update the counter
        count_all.update(terms_all)
    # Print the first 5 most frequent words
    print(count_all.most_common(5))

The above code will produce some unimpressive results:

[(':', 44), ('rt', 26), ('to', 26), ('and', 25), ('on', 22)]

As you can see, the most frequent words (or should I say, tokens), are not exactly meaningful.

Removing stop-words

In every language, some words are particularly common. While their use in the language is crucial, they don’t usually convey a particular meaning, especially if taken out of context. This is the case of articles, conjunctions, some adverbs, etc. which are commonly called stop-words. In the example above, we can see three common stop-words – to, andand on. Stop-word removal is one important step that should be considered during the pre-processing stages. One can build a custom list of stop-words, or use available lists (e.g. NLTK provides a simple list for English stop-words).

Given the nature of our data and our tokenisation, we should also be careful with all the punctuation marks and with terms like RT (used for re-tweets) and via (used to mention the original author of an article or a re-tweet), which are not in the default stop-word list.

from nltk.corpus import stopwords
import string
 
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via']

We can now substitute the variable terms_all in the first example with something like:

terms_stop = [term for term in preprocess(tweet['text']) if term not in stop]

After counting, sorting the terms and printing the top 5, this is the result:

[('python', 11), ('@miguelmalvarez', 9), ('#python', 9), ('data', 8), ('@danielasfregola', 7)]

So apparently I mostly tweet about Python and data, and the users I re-tweet more often are @miguelmalvarez and @danielasfregola, it sounds about right.

More term filters

Besides stop-word removal, we can further customise the list of terms/tokens we are interested in. Here you have some examples that you can embed in the first fragment of code:

# Count terms only once, equivalent to Document Frequency
terms_single = set(terms_all)
# Count hashtags only
terms_hash = [term for term in preprocess(tweet['text']) 
              if term.startswith('#')]
# Count terms only (no hashtags, no mentions)
terms_only = [term for term in preprocess(tweet['text']) 
              if term not in stop and
              not term.startswith(('#', '@'))] 
              # mind the ((double brackets))
              # startswith() takes a tuple (not a list) if 
              # we pass a list of inputs

After counting and sorting, these are my most commonly used hashtags:

[('#python', 9), ('#scala', 6), ('#nosql', 4), ('#bigdata', 3), ('#nlp', 3)]

and these are my most commonly used terms:

[('python', 11), ('data', 8), ('summarisation', 6), ('twitter', 5), ('nice', 5)]

“nice”?

While the other frequent terms represent a clear topic, more often than not simple term frequencies don’t give us a deep explanation of what the text is about. To put things in context, let’s consider sequences of two terms (a.k.a. bigrams).

from nltk import bigrams 
 
terms_bigram = bigrams(terms_stop)

The bigrams() function from NLTK will take a list of tokens and produce a list of tuples using adjacent tokens. Notice that we could use terms_allto compute the bigrams, but we would probably end up with a lot of garbage. In case we decide to analyse longer n-grams (sequences of ntokens), it could make sense to keep the stop-words, just in case we want to capture phrases like “to be or not to be”.

So after counting and sorting the bigrams, this is the result:

[(('nice', 'article'), 4), (('extractive', 'summarisation'), 4), (('summarisation', 'sentence'), 3), (('short', 'paper'), 3), (('paper', 'extractive'), 2)]

So apparently I tweet about nice articles (I wouldn't bother sharing the boring ones) and extractive summarisation (the topic of my PhD dissertation). This also sounds about right.

Summary

This article has built on top of the previous ones to discuss some basis for extracting interesting terms from a data set of tweets, by using simple term frequencies, stop-word removal and n-grams. While these approaches are extremely simple to implement, they are quite useful to have a bird’s eye view on the data. We have used some components of NLTK (introduced in a previous article), so we don’t have to re-invent the wheel.

Bio: Marco Bonzanini is a Data Scientist based in London, UK. Active in the PyData community, he enjoys working in text analytics and data mining applications. He's the author of "Mastering Social Media Mining with Python" (Packt Publishing, July 2016).

Original. Reposted with permission.