Dictionary Word counter

本文介绍了一个简单的程序,用于统计用户输入的多行文本中每个单词出现的次数,并按字母顺序列出。程序忽略大小写和重复,适用于写作时避免重复用词。

Q:
Here’s something to stop you from getting repetitive when writing essays. Write a program that reads multiple lines of plain text from the user, then prints out each different word in the input with a count of how many times that word occurs. Don’t worry about punctuation or case – the input will just be words, all in lower case. Your output should list the words in alphabetical order.

For example:

Enter line: which witch
Enter line: is which
Enter line: 
is 1
which 2
witch 1

A??:

dic_line1 = {}
dic_line2 = {}
list = []
enter_line = input("Enter line: ").lower()

while enter_line:
  list_line = enter_line.split()
  for n in list_line:
    list.append(n)
    num_line = list.count(n)
    dic_line1[n] = num_line
    dic_line2.update(dic_line1)
  enter_line = input("Enter line: ").lower()

for word in sorted(dic_line2):
  print(word, dic_line2[word]) 
  • lower()
  • list.split()
  • list.append()
  • list.count()
  • 按顺序打印字典for n in sorted(dictionary)
import numpy as np from operator import itemgetter from collections import defaultdict import os from collections import Counter def make_Dictionary(root_dir): all_words = [] # 读取所有文件下的数据路径 emails = [os.path.join(root_dir, f) for f in os.listdir(root_dir)] # 依次遍历处理 for mail in emails: with open(mail) as m: for line in m: words = line.split() all_words += words dictionary = Counter(all_words) list_to_remove = list(dictionary) for item in list_to_remove: if not item.isalpha(): del dictionary[item] elif len(item) == 1: del dictionary[item] return dictionary def extract_features(mail_dir, dictionary): """ :param mail_dir: 数据路径 :param dictionary: 数据词典 :return: x,y """ files = [os.path.join(mail_dir, fi) for fi in os.listdir(mail_dir)] features_matrix = np.zeros((len(files), len(dictionary))) train_labels = np.zeros(len(files)) count = 0 docID = 0 for fil in files: with open(fil) as fi: for i, line in enumerate(fi): if i == 2: words = line.split() for word in words: for i, d in enumerate(dictionary): if d[0] == word: wordID = i features_matrix[docID, wordID] = words.count(word) train_labels[docID] = 0 filepathTokens = fil.split('/') lastToken = filepathTokens[len(filepathTokens) - 1] if "spmsg" in lastToken: train_labels[docID] = 1 count = count + 1 docID = docID + 1 return features_matrix, train_labels # 加载训练数据集 TRAIN_DIR = "src/train-mails" # 数据处理 dictionary = make_Dictionary(TRAIN_DIR) # 分割为数据和标签 X_train, y_train = extract_features(TRAIN_DIR, dictionary) # 找到特征feature_index下的特征值feature_value对应的多数类及其误差。 def train_feature_class(x, y_true, feature_index, feature_values): num_class = defaultdict(int) # 统计feature_value下不同类别对应的样例的数目 # ********** Begin *********# sorted(feature_value) # ********** End *********# # 进行排序,找出最多的类别。按从大到小排列 # sorted函数,sorted(iterable,key,reverse) # iterable表示可以迭代的对象,例如可以是 dict.items()、dict.keys()等 # key是一个函数,用来选取参与比较的元素 # reverse则是用来指定排序是倒序还是顺 序,reverse=true则是倒序,reverse=false时则是顺序,默认时reverse=false。 sorted_num_class = sorted(num_class.items(), key=itemgetter(1), reverse=True) # 获取排在最前面的,也就是多数类 most_frequent_class = sorted_num_class[0][0] # 计算误差个数 # ********** Begin *********# # ********** End *********# return most_frequent_class, error most_frequent_class, error = train_feature_class(X_train, y_train, 5, 2) print('多数类:%s,误差个数:%s'% (most_frequent_class,error))
11-05
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值