Word-frequency filter

最新推荐文章于 2021-03-16 16:35:07 发布

春泥面包

最新推荐文章于 2021-03-16 16:35:07 发布

阅读量1.1k

点赞数

分类专栏： Shell

Shell 专栏收录该内容

19 篇文章

订阅专栏

本文介绍了一个使用Shell脚本解决复杂问题的方法：统计文本文件中出现频率最高的n个单词及其出现次数。通过一系列Unix工具如tr、sort、uniq等组合使用，实现了高效的数据处理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

摘自 Robbins A., Beebe N. - Classic Shell Scripting - 2005

Chapter 5.

Problem:

Given a text file and an integer n, you are to print the words (and their frequencies of occurrence) whose frequencies of occurrence are among the n largest in order of decreasing frequency.(找到一个文档中出现次数最多的n哥单词，并显示他们的出现次数)

McIlroy’s program illustrates the power of the Unix tools approach: break a complex problem into simpler parts that you already know how to handle. To solve the word-frequency problem, McIlroy converted the text file to a list of words, one per line (tr does the job), mapped words to a single lettercase (tr again), sorted the list (sort), reduced it to a list of unique words with counts (uniq), sorted that list by descending counts (sort), and finally, printed the first several entries in the list (sed, though head would work too).

Example 5-5. Word-frequency filter
#! /bin/sh
# Read a text stream on standard input, and output a list of
# the n (default: 25) most frequently occurring words and
# their frequency counts, in order of descending counts, on
# standard output.
#
# Usage:
# wf [n]

tr -cs A-Za-z\' '\n' |              Replace nonletters with newlines    
    tr A-Z a-z |                    Map uppercase to lowercase
        sort |                      Sort the words in ascending order    
            uniq -c |               Eliminate duplicates, showing their counts
                sort -k1,1nr -k2 |  Sort by descending count, and then by ascending word
                    sed ${1:-25}q   Print only the first n (default: 25) lines; see Chapter 3