天池nlp入门竞赛-新闻文本分类-数据分析

最新推荐文章于 2023-05-07 11:06:39 发布

原创最新推荐文章于 2023-05-07 11:06:39 发布 · 269 阅读

CC 4.0 BY-SA版权

本文介绍了Python中可哈希对象的概念，强调不可变对象才能被哈希，例如int、float、tuple等。通过示例展示了如何使用`collections.Counter`进行词频统计，并利用`sorted()`函数按词频降序排序。此外，还探讨了如何统计每个词出现的句子数，通过将文本拆分成句子并应用集合操作，然后用`Counter`统计结果。

什么是可哈希（hashable）？

Anything that is not mutable (mutable means, likely to change) can be hashed. Besides the hash function to look for, if a class has it, by eg. dir(tuple) and looking for the hash method, here are some examples

#x = hash(set([1,2])) #set unhashable
x = hash(frozenset([1,2])) #hashable
#x = hash(([1,2], [2,3])) #tuple of mutable objects, unhashable
x = hash((1,2,3)) #tuple of immutable objects, hashable
#x = hash()
#x = hash({1,2}) #list of mutable objects, unhashable
#x = hash([1,2,3]) #list of immutable objects, unhashable

List of immutable types:

int, float, decimal, complex, bool, string, tuple, range, frozenset, bytes

List of mutable types:

list, dict, set, bytearray, user-defined classes

Counter

Dict subclass for counting hashable items. Sometimes called a bag or multiset. Elements are stored as dictionary keys and their counts are stored as dictionary values.

from collections import Counter
all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(' ')) # word_count是一个dict，{key: value}。key是word，value是count

Sorted

sorted(iterable, /, *, key=None, reverse=False) returns a new list containing all items from the iterable in ascending order.

A custom key function can be supplied to customize the sort order, and the reverse flag can be set to request the result in descending order.

word_count.items() returns a list of a given dictionary’s (key, value) tuple pair.如：dict_items([(‘A’, ‘Geeks’), (‘B’, 4), (‘C’, ‘Geeks’)]) # Return a new list containing all items from the iterable in ascending order.
key=lambda d: d[1] 按value排序，reverse=True 从大到小

word_count = sorted(word_count.items(), key=lambda d: d[1], reverse=True)

word_count.items()是一个list which is iterable，因此可以一次迭代处理一个(key, value) tuple pair，d代指当前处理的一个(key, value) tuple pair，d[1]是value，所以意思是按value排序。

如何统计词数？

from collections import Counter
all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(' ')) # word_count是一个dict，{key: value}。key是word，value是count
word_count = sorted(word_count.items(), key=lambda d: d[1], reverse=True) # word_count.items() returns a list of a given dictionary’s (key, value) tuple pair.如：dict_items([('A', 'Geeks'), ('B', 4), ('C', 'Geeks')]) # Return a new list containing all items from the iterable in ascending order.key=lambda d: d[1] 按value排序，reverse=True 从大到小
print(len(word_count))

如何统计每个词出现的句子数？

 # set使得每个词如‘3750’在一句话中只出现1次，即统计每句话都由哪些词组成，因为我们不关心一句话中某个词出现几次，
 # 我们关心某个词出现在哪些句子中，统计每个词出现句子的数量。
train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
print(train_df['text_unique'])
all_lines = ' '.join(list(train_df['text_unique'])) # make the values of a Series (Strings) a list, and join the list to a long string
word_count = Counter(all_lines.split(' ')) #统计每个词的出现次数，=出现句子数
print(type(word_count)) # type:<class 'collections.Counter'>
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse=True) #
print(type(word_count)) # type:<class 'list'>