shell 统计单词频率

本文介绍如何利用Shell脚本对文本进行处理,统计其中每个单词出现的次数,实现简单的文本分析功能。通过读取文件内容,利用awk、sort和uniq命令,可以有效地计算并展示文本中单词的频率。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

#!/bin/bash
#n个出现频率最高的单词
help(){ echo "该shell脚本统计一个文本中出现次数最多的n个单词"
	      echo "usage: sh "$0" filename n"
	      echo "filename 为你要统计的文本名称 n为要统计的单词个数"
	      echo "sh "$0" englist_statment.txt 10"
	    }
	    
:<<EOF

First Flight
  Mr. Johnson had never been up in an aerophane before and he had read a lot about air accidents, so one day when a friend offered to take him for a ride in his own small phane, Mr. Johns
on was very worried about accepting. Finally, however, his friend persuaded him that it was very safe, and Mr. Johnson boarded the plane.
  His friend started the engine and began to taxi onto the runway of the airport. Mr. Johnson had heard that the most dangerous part of a flight were the take-off and the landing, so he w
as extremely frightened and closed his eyes.
  After a minute or two he opened them again, looked out of the window of the plane, and said to his friend, Look at those people down there. They look as small as ants, dont they?
  Those are ants, answered his friend. Were still on the ground.
EOF

if [[ -z "$1" || -z "$2" ]];then
	 help
	 exit 
fi 

if [[ -f "$1" ]];then
	 statis=$(more "$1" |tr -cs "[a-z][A-Z]" "\n"|tr A-Z a-z|sort|uniq -c|sort -k1nr -k2|head -"$2")
	 echo "$statis"
else 
    help 
  exit 1

fi



[root@oracle shellscript]# sh statis_word.sh englist_statment.txt 5
     10 the
      6 and
      6 his
      5 a
      5 friend

#如果没有正确使用 打印帮助信息
[root@oracle shellscript]# sh statis_word.sh englist_statment.txt 
该shell脚本统计一个文本中出现次数最多的n个单词
usage: sh statis_word.sh filename n
filename 为你要统计的文本名称 n为要统计的单词个数
sh statis_word.sh englist_statment.txt 10




[root@oracle shellscript]# tr --help
Usage: tr [OPTION]... SET1 [SET2]
Translate, squeeze, and/or delete characters from standard input,
writing to standard output.

  -c, -C, --complement    first complement SET1
  -d, --delete            delete characters in SET1, do not translate
  -s, --squeeze-repeats   replace each input sequence of a repeated character
                            that is listed in SET1 with a single occurrence
                            of that character
  -t, --truncate-set1     first truncate SET1 to length of SET2
      --help     display this help and exit
      --version  output version information and exit

SETs are specified as strings of characters.  Most represent themselves.
Interpreted sequences are:

  \NNN            character with octal value NNN (1 to 3 octal digits)
  \\              backslash
  \a              audible BEL
  \b              backspace
  \f              form feed
  \n              new line
  \r              return
  \t              horizontal tab
  \v              vertical tab
  CHAR1-CHAR2     all characters from CHAR1 to CHAR2 in ascending order
  [CHAR*]         in SET2, copies of CHAR until length of SET1
  [CHAR*REPEAT]   REPEAT copies of CHAR, REPEAT octal if starting with 0
  [:alnum:]       all letters and digits
  [:alpha:]       all letters
  [:blank:]       all horizontal whitespace
  [:cntrl:]       all control characters
  [:digit:]       all digits
  [:graph:]       all printable characters, not including space
  [:lower:]       all lower case letters
  [:print:]       all printable characters, including space
  [:punct:]       all punctuation characters
  [:space:]       all horizontal or vertical whitespace
  [:upper:]       all upper case letters
  [:xdigit:]      all hexadecimal digits
  [=CHAR=]        all characters which are equivalent to CHAR

Translation occurs if -d is not given and both SET1 and SET2 appear.
-t may be used only when translating.  SET2 is extended to length of
SET1 by repeating its last character as necessary.  Excess characters
of SET2 are ignored.  Only [:lower:] and [:upper:] are guaranteed to
expand in ascending order; used in SET2 while translating, they may
only be used in pairs to specify case conversion.  -s uses SET1 if not
translating nor deleting; else squeezing uses SET2 and occurs after
translation or deletion.

Report bugs to <bug-coreutils@gnu.org>.



tr -cs "[A-Z][a-z]" "[\n*]"

#测试下 -c的意思,有一个test0.sh的文件.里面有大写字母 小写字母 数字
[root@oracle shellscript]# more test0.sh


M C  a b 8 6

[root@oracle shellscript]# more test0.sh |tr -c "[A-Z]" "$"
$$M$C$$$$$$$$$$$

[root@oracle shellscript]# more test0.sh |tr -c "[a-z]" "$"
$$$$$$$a$b$$$$$$

[root@oracle shellscript]# more test0.sh |tr -c "[:digit:]" "$"
$$$$$$$$$$$8$6$$

可以看出-c是取反的意思.意思是把除SET1之外的替换为 SET2

-s 就是把连续出现的只保留一个.
[root@oracle shellscript]# more test0.sh |tr -cs "[:digit:]" "$"
$8$6$[root@oracle shellscript]#


tr -cs "[a-z][A-Z]" "\n" 就是把除单词之外的替换为换行符.然后只保留一个.

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值