Mining Twitter Data with Python Part 3: Term Frequencies

本文通过词频分析从Twitter数据中提取有意义的词汇,包括去除停用词、使用n-gram等方法,并提供了Python代码示例。

http://www.kdnuggets.com/2016/06/mining-twitter-data-python-part-3.html

Part 3 of this 7 part series focusing on mining Twitter data discusses the analysis of term frequencies for meaningful term extraction.

By Marco Bonzanini, Independent Data Science Consultant.

This is the third part in a series of articles about data mining on Twitter. After collecting data and pre-processing some text, we are ready for some basic analysis. In this article, we’ll discuss the analysis of term frequencies to extract meaningful terms from our tweets.

Twitter

Counting Terms

 
Assuming we have collected a list of tweets (see Part 1 of the tutorial), the first exploratory analysis that we can perform is a simple word count. In this way, we can observe what are the terms most commonly used in the data set. In this example, I’ll use the set of my tweets, so the most frequent words should correspond to the topics I discuss (not necessarily, but bear with be for a couple of paragraphs).

We can use a custom tokeniser to split the tweets into a list of terms. The following code uses the preprocess() function described in Part 2 of the tutorial, in order to capture Twitter-specific aspects of the text, such as #hashtags, @-mentions, emoticons and URLs. In order to keep track of the frequencies while we are processing the tweets, we can usecollections.Counter() which internally is a dictionary (term: count) with some useful methods like most_common():

import operator 
import json
from collections import Counter
 
fname = 'mytweets.json'
with open(fname, 'r') as f:
    count_all = Counter()
    for line in f:
        tweet = json.loads(line)
        # Create a list with all the terms
        terms_all = [term for term in preprocess(tweet['text'])]
        # Update the counter
        count_all.update(terms_all)
    # Print the first 5 most frequent words
    print(count_all.most_common(5))


The above code will produce some unimpressive results:

[(':', 44), ('rt', 26), ('to', 26), ('and', 25), ('on', 22)]


As you can see, the most frequent words (or should I say, tokens), are not exactly meaningful.

Removing stop-words

 
In every language, some words are particularly common. While their use in the language is crucial, they don’t usually convey a particular meaning, especially if taken out of context. This is the case of articles, conjunctions, some adverbs, etc. which are commonly called stop-words. In the example above, we can see three common stop-words – toandand on. Stop-word removal is one important step that should be considered during the pre-processing stages. One can build a custom list of stop-words, or use available lists (e.g. NLTK provides a simple list for English stop-words).

Given the nature of our data and our tokenisation, we should also be careful with all the punctuation marks and with terms like RT (used for re-tweets) and via (used to mention the original author of an article or a re-tweet), which are not in the default stop-word list.

from nltk.corpus import stopwords
import string
 
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via']


We can now substitute the variable terms_all in the first example with something like:

terms_stop = [term for term in preprocess(tweet['text']) if term not in stop]


After counting, sorting the terms and printing the top 5, this is the result:

[('python', 11), ('@miguelmalvarez', 9), ('#python', 9), ('data', 8), ('@danielasfregola', 7)]


So apparently I mostly tweet about Python and data, and the users I re-tweet more often are @miguelmalvarez and @danielasfregola, it sounds about right.

More term filters

 
Besides stop-word removal, we can further customise the list of terms/tokens we are interested in. Here you have some examples that you can embed in the first fragment of code:

# Count terms only once, equivalent to Document Frequency
terms_single = set(terms_all)
# Count hashtags only
terms_hash = [term for term in preprocess(tweet['text']) 
              if term.startswith('#')]
# Count terms only (no hashtags, no mentions)
terms_only = [term for term in preprocess(tweet['text']) 
              if term not in stop and
              not term.startswith(('#', '@'))] 
              # mind the ((double brackets))
              # startswith() takes a tuple (not a list) if 
              # we pass a list of inputs


After counting and sorting, these are my most commonly used hashtags:

[('#python', 9), ('#scala', 6), ('#nosql', 4), ('#bigdata', 3), ('#nlp', 3)]


and these are my most commonly used terms:

[('python', 11), ('data', 8), ('summarisation', 6), ('twitter', 5), ('nice', 5)]


“nice”?

While the other frequent terms represent a clear topic, more often than not simple term frequencies don’t give us a deep explanation of what the text is about. To put things in context, let’s consider sequences of two terms (a.k.a. bigrams).

from nltk import bigrams 
 
terms_bigram = bigrams(terms_stop)


The bigrams() function from NLTK will take a list of tokens and produce a list of tuples using adjacent tokens. Notice that we could use terms_allto compute the bigrams, but we would probably end up with a lot of garbage. In case we decide to analyse longer n-grams (sequences of ntokens), it could make sense to keep the stop-words, just in case we want to capture phrases like “to be or not to be”.

So after counting and sorting the bigrams, this is the result:

[(('nice', 'article'), 4), (('extractive', 'summarisation'), 4), (('summarisation', 'sentence'), 3), (('short', 'paper'), 3), (('paper', 'extractive'), 2)]


So apparently I tweet about nice articles (I wouldn't bother sharing the boring ones) and extractive summarisation (the topic of my PhD dissertation). This also sounds about right.

Summary

 
This article has built on top of the previous ones to discuss some basis for extracting interesting terms from a data set of tweets, by using simple term frequencies, stop-word removal and n-grams. While these approaches are extremely simple to implement, they are quite useful to have a bird’s eye view on the data. We have used some components of NLTK (introduced in a previous article), so we don’t have to re-invent the wheel.

Bio: Marco Bonzanini is a Data Scientist based in London, UK. Active in the PyData community, he enjoys working in text analytics and data mining applications. He's the author of "Mastering Social Media Mining with Python" (Packt Publishing, July 2016).

Original. Reposted with permission.


### 移除具有中间等位基因频率的回文SNPs 在遗传数据分析中,处理SNPs(单核苷酸多态性)时,移除具有中间等位基因频率的回文SNPs是一个常见的需求。这类问题涉及生物信息学和遗传学知识,同时结合算法设计来实现自动化处理。 #### 1. 回文SNPs的定义 回文SNPs是指那些在正向和反向序列中呈现对称性的SNPs。例如,`A/T` 和 `T/A` 是回文SNPs,因为它们在互补链上是相同的。同样,`G/C` 和 `C/G` 也是回文SNPs[^6]。这些SNPs在遗传学研究中可能引入混淆,尤其是在分析等位基因频率时。 #### 2. 中间等位基因频率的定义 中间等位基因频率通常指等位基因频率接近0.5的情况。这类SNPs由于其频率分布特性,在群体遗传学研究中可能需要特别关注或排除,以减少潜在的偏差[^7]。 #### 3. 移除回文SNPs的算法设计 为了移除具有中间等位基因频率的回文SNPs,可以采用以下方法: - **检测回文SNPs**:通过检查SNP的碱基配对是否为互补对称(如`A/T`或`G/C`),确定其是否为回文SNPs。 - **计算等位基因频率**:对于每个SNP,计算其等位基因频率,并判断是否位于中间范围(如0.4到0.6之间)。 - **筛选与移除**:根据上述条件,筛选出符合条件的SNPs并将其从数据集中移除。 以下是Python代码示例,展示如何实现这一过程: ```python def is_palindromic_snp(snp): """检查SNP是否为回文SNP""" complement = {&#39;A&#39;: &#39;T&#39;, &#39;T&#39;: &#39;A&#39;, &#39;G&#39;: &#39;C&#39;, &#39;C&#39;: &#39;G&#39;} return snp[1] == complement[snp[0]] def calculate_allele_frequency(genotypes): """计算等位基因频率""" total = len(genotypes) count_a = sum(1 for genotype in genotypes if genotype == &#39;A&#39;) frequency = count_a / total if total > 0 else 0 return frequency def remove_intermediate_palindromic_snps(snps, genotypes, threshold=0.1): """移除具有中间等位基因频率的回文SNPs""" result = [] for snp, genotype_list in zip(snps, genotypes): if not is_palindromic_snp(snp): # 非回文SNP直接保留 result.append((snp, genotype_list)) continue frequency = calculate_allele_frequency(genotype_list) if not (0.5 - threshold <= frequency <= 0.5 + threshold): # 排除中间频率 result.append((snp, genotype_list)) return result ``` #### 4. 示例应用 假设有一个SNP列表和对应的基因型数据: ```python snps = [(&#39;A&#39;, &#39;T&#39;), (&#39;G&#39;, &#39;C&#39;), (&#39;A&#39;, &#39;G&#39;), (&#39;T&#39;, &#39;A&#39;)] genotypes = [[&#39;A&#39;, &#39;A&#39;, &#39;T&#39;], [&#39;G&#39;, &#39;C&#39;, &#39;C&#39;], [&#39;A&#39;, &#39;G&#39;, &#39;G&#39;], [&#39;T&#39;, &#39;T&#39;, &#39;A&#39;]] filtered_data = remove_intermediate_palindromic_snps(snps, genotypes, threshold=0.1) print(filtered_data) ``` 输出结果将移除所有满足条件的回文SNPs。 #### 5. 注意事项 - 在实际应用中,阈值的选择(如`threshold=0.1`)应根据具体研究需求调整。 - 数据预处理步骤(如质量控制、缺失值处理)对最终结果有重要影响,需谨慎对待。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值