海量搜索词统计top100

原创已于 2025-05-21 23:02:28 修改 · 1.4k 阅读

32 ·

CC 4.0 BY-SA版权

文章标签：

#算法 #面试

于 2025-05-20 00:17:34 首次发布

处理海量搜索词统计并获取出现次数最多的前100个词，可以通过分治法和堆排序高效实现。具体步骤如下：

1. 分片与哈希分配

步骤：将原始数据分割成多个分片，每个分片大小适合内存处理。对每个分片中的词进行哈希，确保相同词分配到同一中间文件。
实现：
- 使用哈希函数（如MD5或MurmurHash）将每个词映射到固定数量（R）的中间文件中的一个，如 hash(word) % R。
- 每个分片处理完成后，生成R个中间文件，每个文件包含对应哈希值的词。

2. 统计各中间文件词频（Map阶段）

步骤：逐个处理每个分片，统计词频并按哈希分配中间文件。

实现：

def map_function(input_chunk, R):
    word_counts = defaultdict(int)
    for word in input_chunk:
        word_counts[word] += 1
    for word, count in word_counts.items():
        hash_idx = hash(word) % R
        write_to_intermediate_file(hash_idx, (word, count))

3. 合并中间文件并统计全局词频（Reduce阶段）

步骤：处理每个中间文件，累加所有分片中相同词的总次数。

实现：

def reduce_function(intermediate_files):
    global_counts = defaultdict(int)
    for file in intermediate_files:
        for word, count in read_intermediate_file(file):
            global_counts[word] += count
    # 输出到结果文件
    for word, count in global_counts.items():
        write_output_file(word, count)

4. 使用最小堆获取Top100

步骤：遍历所有全局统计结果，维护一个大小为100的最小堆，动态保留最高频词。

实现：

import heapq

def get_top100(output_files):
    heap = []
    for file in output_files:
        for word, count in read_output_file(file):
            if len(heap) < 100:
                heapq.heappush(heap, (count, word))
            else:
                if count > heap[0][0]:
                    heapq.heappop(heap)
                    heapq.heappush(heap, (count, word))
    # 堆中元素按升序排列，需反转得到降序
    return sorted(heap, reverse=True)