PTA 词频统计

AcleverPig

已于 2023-04-09 21:56:01 修改

阅读量547

点赞数 2

文章标签： c++ 算法

于 2023-04-04 19:45:42 首次发布

本文链接：https://blog.youkuaiyun.com/AcleverPig/article/details/129960393

版权

请编写程序，对一段英文文本，统计其中所有不同单词的个数，以及词频最大的前10%的单词。

所谓“单词”，是指由不超过80个单词字符组成的连续字符串，但长度超过15的单词将只截取保留前15个单词字符。而合法的“单词字符”为大小写字母、数字和下划线，其它字符均认为是单词分隔符。

输入格式:

输入给出一段非空文本，最后以符号#结尾。输入保证存在至少10个不同的单词。

输出格式:

在第一行中输出文本中所有不同单词的个数。注意“单词”不区分英文大小写，例如“PAT”和“pat”被认为是同一个单词。

随后按照词频递减的顺序，按照词频:单词的格式输出词频最大的前10%的单词。若有并列，则按递增字典序输出。

输入样例：

This is a test.

The word "this" is the word with the highest frequency.

Longlonglonglongword should be cut off, so is considered as the same as longlonglonglonee.  But this_8 is different than this, and this, and this...#
this line should be ignored.

输出样例：

23
5:this
4:is

思路：

1.先处理字符串；

2.用字符流把每个单词及个数存入map；

3.把map的转入结构体进行排序；

代码：

#include<bits/stdc++.h>
using namespace std;
#define ll long long
ll n , m ,T ;
struct node{
	string word;
	int num;
}a[12345];
bool cmp(node a,node b){
	if(a.num != b.num)return a.num > b.num;
	return a.word < b.word;
}//结构体排序出答案； 
int main(){
	string s;
	map<string,int>mp;
	while(getline(cin,s)){
		if(s[s.size()-1] == '#')s.insert(s.size()-1,"   ");
		//如果最后一位是'#'防止出现“xxxx#”的情况，在倒数第二位放一堆“ ” 
		for(int i = 0 ; i < s.size() ; i ++){
			if(s[i] >= 'A' && s[i] <= 'Z')s[i] += 32;//大写转小写 
			else if(!(s[i]>='a'&&s[i]<='z'||s[i]=='_'||s[i]>='0'&&s[i]<='9'||s[i]=='#'))
				s[i] = ' ';//遇到题目判断范围之外的，转为空格； 
		}
		stringstream ss(s);//使用字符流提取每一个单词 
		while(ss >> s){
			if(s == "#")continue;//防止把单个#当作单词 
			if(s.size() >= 15){
				mp[s.substr(0,15)] ++;//大于等于15位的词截取15个 
			}else mp[s] ++;
		}
		if(s[s.size()-1] == '#')break;//#退出 
	}
	int idx = 0;
	for(auto i : mp){
		a[++idx] = {i.first,i.second};
	}sort(a+1,a+idx+1,cmp);//存进结构体排序出答案； 
	cout << mp.size() << endl;//记得输出单词个数 
	for(int i = 1 ; i <= mp.size()/10 ; i ++){
		cout << a[i].num << ":" << a[i].word << endl;
	}
}

--------------------------------------------------------------------------------------------------------------------------------

记得输出单词个数！！！因为没输出这个wr了好多发。。。