大数据高频词算法实战（一）

最新推荐文章于 2025-09-25 23:03:52 发布

最新推荐文章于 2025-09-25 23:03:52 发布 · 951 阅读

文章标签：

#大数据 #高频词 #出现次数最多的元素 #前100个最频繁元素

算法和数据结构同时被 2 个专栏收录

10 篇文章

订阅专栏

智力考场

6 篇文章

订阅专栏

本文介绍了如何使用特定算法处理3G大小的文本文件，找出其中出现频率最高的前100个字符串。通过优化算法和空间分配策略，实现了在有限内存条件下高效的数据处理。此外，文章还提出了进一步优化方案和潜在的改进方向。

一、序言

我们群里发了了一个挑战，题目大概是：2亿随即字符串，在一个txt 文本里面，找出出现频率最高的前100 个字符串，双核CPU，4G 内存，当然JVM 只开了1G。

其实类似的题目，很多公司也都有了，但是可能思想说得多，实战稍微少点，这里我抽空也写了一种通用的，凡是上诉题目都可以按方法进行处理，也做可以做其他扩展和优化。

二、设计原理

1.我用他们提供的字符串生产文件，接近3G，肯定不能全部放入内存

2.我创建了一个2亿的int 数组，但是由于想hash 分布均匀，采用hashMap的算法模式，需要额外的空间，也就是268435456，相当于2.6亿空间，要刚好1G的内存，加上其他需要的空间分配，超过了限制，我这里分配的是1200M空间，当然原理类似，这里我们没对算法做更多的优化，后面有时间再单独研究。bits = new int[2.6亿];

3.然后读取每一行数据，进行hash 计算，以及index = indexFor(int hash,int length) 确定位置。这样我们就知道每一行数据存放的位置，然后进行 bit[index] ++运算，也就是说我们就可以知道所有字符串所出现的次数，当然可能会出现hash冲突，几率很小，这种算法允许低冲突率。

4.同时我们需要建立一个数组new Entry[100],这里用于存放我们的高频词的位置，当我们遍历每一行的同时，还需要将该行字符串出现的次数保存，按出现次数保留前100.最后就是我们要的结果。

三、实现代码

/**
 * 工具类，处理每行信息记录的数据结构
 * 未解决hash冲突，暂时允许一部分重复，需要改进
 * 这是初级版本，后续在改变
 * @author Michael_Ran
 * @param <V>
 */
public class CountMap<V> {
	// 内容数组
	transient Entry<V>[] table;
	// 下标数组
	transient int[] bits;;
	// 光标移动位置，实际长度
	transient int size;
	// 实际允许长度
	transient int threshold;
	
	// 下标位置
	transient int position;
	// 容量长度
	transient static int bits_capacity = 16;

	// 初始化长度
	@SuppressWarnings("unchecked")
	public CountMap(int initTopNum,int initTotalLines) {
		int capacity = 1;
		// 初始化内容数组
		while (capacity < initTopNum)
			capacity <<= 1;
		table = new Entry[capacity];
		
		
		// 初始化下标数组
		while(bits_capacity < initTotalLines)
			bits_capacity <<= 1;
		bits = new int[bits_capacity];
		this.threshold = initTopNum;
	}

	// 判断是否存在
	boolean contains(Object value) {
		for (Entry<V> e : table) {
			if (e != null && hash(value.hashCode()) == e.hash)
				return true;
		}
		return false;
	}

	// 计算hash，这里需要进行改进
	int hash(int h) {
		h ^= (h >>> 20) ^ (h >>> 12);
		return h ^ (h >>> 7) ^ (h >>> 4);
                // ConcurrentHashMap 的算法
                // h += (h <<  15) ^ 0xffffcd7d;
                // h ^= (h >>> 10);
                // h += (h <<   3);
                // h ^= (h >>>  6);
                // h += (h <<   2) + (h << 14);
                // return h ^ (h >>> 16);
	}

	// 计算位置
	int indexFor(int h, int length) {
		return h & (length - 1);
	}
	
    // 计算容量位置
	int putBits(V value,int hash){
		int index = indexFor(hash,bits_capacity);
		return ++ bits[index];
	}
	
	
	

	// 存放
	public void put(V value) {
		int hash = hash(value.hashCode());
		int count = putBits(value,hash);
		// 未存满
		if (size < threshold) {
			if (count == 1) {
				Entry<V> newEntry = new Entry<V>(value, hash);
				table[size] = newEntry;
				if (size > 0) {
					table[size - 1].after = newEntry;
					newEntry.before = table[size - 1];
				}
				size++;
			} else {
				for (int i = 0; i < threshold; i++) {
					Entry<V> e = table[i];
					// 获取最小的元素位置,直接替换
					if (e.value.equals(value)) {
						e.count = count;
						e.hash = hash;
						return;
					}
				}
			}
		}

		// 已经存满，因此从个数大于1开始寻找
		if (count > 1) {
			if(position == threshold){
				position = 0;
			}
			for (int i = position; i < threshold; i++) {
				Entry<V> e = table[i];
				if (e.count < count) {
					replace(e, value, count, hash);
					return;
				}
			}
		}
	}
	
	public void replace(Entry<V> oldEntry,V value,int count,int hash){
		oldEntry.count = count;
		oldEntry.value = value;
		oldEntry.hash = hash;
	}

	// 获取对象
	public Entry<V> getEntry(Object value) {
		for (Entry<V> e : table) {
			if (e != null && e.equals(value))
				return e;
		}
		return null;
	}

	// 获得集合
	public Entry<V>[] getEntries() {
		return table;
	}
	
	// 获得实际长度
	public int size() {
		return size;
	}

	// 内部类
	static class Entry<V> {
		int count, hash;
		V value;
		Entry<V> before, after;

		public Entry(V value, int hash) {
			super();
			this.count = 1;
			this.value = value;
			this.hash = hash;
		}

		// 自增
		public int incrementAndGet() {
			return count++;
		}

		@Override
		public String toString() {
			return this.value + ":" + count;
		}
	}

}

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;

public class T {
	// 文件路径
	static final String FILEPATH = "d:/b.txt";
	// 出现次数排名最多的数
	static int TOP_NUM = 100;
	// 文件行数，只要大于总行数就行，默认2亿
	static int TOTAL_LINES = 200000000;
	// 容器
	static CountMap<String> map = new CountMap<String>(TOP_NUM, TOTAL_LINES);	
	public static void main(String[] args) throws ClassNotFoundException, InstantiationException, IllegalAccessException {
		long begin = System.currentTimeMillis();
		// 计算
		getTopN(FILEPATH);
	    long result = System.currentTimeMillis()-begin;
        System.out.println("总时间为:"+result +"毫秒");
        
        System.out.println("-------数据结果---------");
	    for(int i = 0;i<map.size();i++){
	    	System.out.println(map.getEntries()[i]);
	    }
	}
	


	// 计算
	public static void getTopN(String filepath){
		BufferedReader bw = null;
		try {
			File f = new File(filepath);
			bw = new BufferedReader(new FileReader(f));
			String str = null;
			while ((str = bw.readLine()) != null) {
				map.put(str);
			}
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
			try {
				if(bw != null){
					bw.close();
				}
			} catch (IOException e) {
				e.printStackTrace();
			}
		}
    }

}

四、优化方案

1.可以看出我们仍然需要比较大的空间，我们可以改进hash算法，不用那么大的空间

2.在重复率比较高的情况下，我们大概知道不重复数据只有N 的时候，就根据N 创建空间

3.在上面都不能实现的情况，我们可以分块空间，比如我们刚才最糟糕的情况，全都不重复，需要2亿空间，实际空间接近2.7亿，浪费很大。那么我们可以我可以开辟2个空间，第一个在获得2.7亿之前，也就是小于2亿的的空间，然后计算这部分行数的排行，同理计算第二批空间的排行，这样可能出现有一部分字符串分布不均匀和均匀问题，导致汇总数据不对，那么我们可以对每个空间进行扩大范围查询，也就是我们可以对第一个空间查询100*10 的排行，第二个空间同理，最后再汇总计算前100，可以避免很大部分误差。

4.上面是以空间换时间，我们可以采用链表结构，不创建那么大的数组，以链表的形式记录每行的hash值，以及count次数，然后每次遍历进行循环，这样链表节点必定小于不重复的数据。这里需要做链表节点的优化，最好用hash树的形式，节约查找的时间。同时也可以用hash范围空间进行查找，反正得减少每次循环的次数。

小结：

1.上面方法临时写的，很多没来得及优化，大概想了一下方法，为了减少一定比较，也没直接继承hashMap，欢迎大家吐槽！

2.关于hash 分布的问题，我也考虑过寻址以及链表的形式或者树的形式，算法这块不强，有熟悉的朋友可以介绍接晒。

3.对于现在数据，几亿数据，10G以内的文件，我相信这种方法在现在计算机内存运算来说还比较快的，虽然损耗了一部分准确率。

4.大数据很多的处理，可以先采用文件分割，然后多线程(和CPU数量有关)处理，放大结果集，然后合并的也是不错的选择。

5.不好的地方请指教，有更好的方法欢迎提供，非常感谢！后续有更好的，我也会给大家分享。

6.JDK 1.8 有个特性，好像是用堆外内存来处理OOM，也就是说暂时可以借系统内存，也可以限定范围，有兴趣的朋友可以研究研究！