统计URL出现的频次

最新推荐文章于 2024-07-08 21:33:33 发布

coder_oyang

最新推荐文章于 2024-07-08 21:33:33 发布

阅读量2.6k

点赞数 1

CC 4.0 BY-SA版权

分类专栏：机器学习算法文章标签： hashmap 机器学习百度统计频次

本文链接：https://blog.youkuaiyun.com/coder_oyang/article/details/48520705

机器学习算法专栏收录该内容

30 篇文章

订阅专栏

百度2015校园招聘机器学习笔试题

一个url文件，每行是一个url地址，可能有重复

1. 统计每个url的频次，设计函数实现实现

#include <fstream>
#include <string>
#include <iostream>
#include <unordered_map>
using namespace std;
int main()
{
	string str;
	ifstream infile;
	ofstream outfile;
	unordered_map<string, int> wordCount;
	unordered_map<string, int>::iterator iter;

	infile.open("in.txt");
	outfile.open("out.txt");
	//测试输入文件是否打开
	if (!infile)
	{
		cout << "error:unable to open input file:" << endl;
		return -1;
	}
	while (infile >> str)
	{
		wordCount[str]++;//统计单词出现次数
	}
	for (iter = wordCount.begin(); iter != wordCount.end(); ++iter)
	{
		cout << iter->first << ":" << wordCount[iter->first] << endl;//标准输出
		outfile << iter->first << ":" << wordCount[iter->first] << endl;//输出到文件
	}
	infile.close();
	outfile.close();
	return 0;
}

考点在于hashmap的使用

2. 设有10亿url，平均长度是20，现在机器有8G内存，怎么处理，写出思路

一直内存大小为8G，而10亿个URL的大小为 10G x 20 = 200G，URL明显无法全部放进内存，那么我们的处理办法是：

(1) 将所有的URL分为若干个文件。假设分到30个文件之中

通过一个hash函数，将URL散列到不同的文件之中，我们的字符串映射到整型数
unsigned int BKDRHash(char *str)
{
	unsigned int seed = 131; // 31 131 1313 13131 131313 etc..
	unsigned int hash = 0;

	while (*str)
	{
		hash = hash * seed + (*str++);
	}

	return (hash & 0x7FFFFFFF);
}
然后通过 BKDRHash(str) % 30 ，根据余数将这个URL分到对应的 txt 之中

(2) 统计每一个txt文件中的URL出现的频次，根据我们的假设，我们会得到 30 个（一个文件一个）数组

(3) 再将所有数组进行统计