有趣的统计英文单词频率的例子

统计一篇英文文档或一本小说中单词出现的次数,下面代码使用的是英文版小说"悲惨世界"做例子。 有两个需要注意的地方,一个是如何使用正则式分割单词,一个是HashMap中对元素按值排序无法直接完成,中间做了一下转化:


import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.regex.Pattern;

public class EnglishWordsStatics {
public static final String EN_FOLDER_FILE = "C:/resources/Books/English/Les Miserables.txt";
public static final String OUTPUT = "C:/resources/Books/English/Les Miserables - Words.txt";

private HashMap<String, Integer> result = new HashMap<String, Integer>();
private int total = 0;

/**
* Handle one English fiction
*
* @param file
* @throws IOException
*/
public void handleOneFile(File file) throws IOException {
if (file == null)
throw new NullPointerException();

BufferedReader in = new BufferedReader(new FileReader(file));
String line;

// split by space ' ( ) * + ' . / [0-9] : ; ? [ ] ` { } |
Pattern pattern = Pattern
.compile("[ ,?;.!\"'|[0-9]:`\\-\\(\\)\\[\\]]+");

while ((line = in.readLine()) != null) {
line = line.toLowerCase();
String[] words = pattern.split(line);

for (String word : words) {
if (word.length() > 0) {
total++;
if (!result.containsKey(word)) {
result.put(word, 1);
} else {
Integer i = result.get(word);
i++;
result.put(word, i);
}
}
}
}
in.close();
System.out.println("Total words: " + total);
System.out.println("Total different words: " + result.size());
}

/**
* Print the statics result
* @throws IOException
*/
public void saveResult() throws IOException {
// Sorting
List<Node> list = new ArrayList<Node>();
for (String word : result.keySet()) {
Node p = new Node(word, result.get(word));
list.add(p);
}

Collections.sort(list);

FileWriter fw = new FileWriter (new File (OUTPUT));
for (Node p : list) {
fw.write(p.getWord() + "\t" + p.getNum()+"\n");
}
fw.close();
System.out.println ("Done");
}

/**
* @param args
*/
public static void main(String[] args) throws IOException {
EnglishWordsStatics ews = new EnglishWordsStatics();
ews.handleOneFile(new File(EN_FOLDER_FILE));
ews.saveResult();
}
}

/**
* For sorting, store the words - num
*
*/
class Node implements Comparable<Node> {
private String word;
private int num;

public Node() {
}

public Node(String word, int num) {
super();
this.word = word;
this.num = num;
}

public String getWord() {
return word;
}

public int getNum() {
return num;
}

@Override
public int compareTo(Node o) {
return o.getNum() - num;
}
}


结果如下:
Total words: 607563
Total different words: 22882
Done


部分输出:

the 43538
of 21107
and 15865
a 15365
to 14663
in 11813
he 10280
was 9251
that 8413
it 7026
his 6813
had 6564
is 6504
which 5506
with 4737
on 4714
at 4292
this 4208
not 3981
i 3910
you 3768
one 3500
as 3447
for 3129
him 3118
have 2919
there 2869
her 2767
who 2676
all 2606
she 2605
by 2604
from 2568
be 2484
are 2258
an 2249
they 2236
but 2187
s 2141
man 2107
no 2058
were 1962
what 1932
said 1879
been 1601
marius 1471
when 1429
we 1407
their 1323
two 1284
jean 1275
so 1262
will 1258
me 1207
my 1206
more 1198
himself 1155
valjean 1154
them 1126
has 1122
would 1114
these 1097
then 1097
into 1058
like 1055
out 1047
did 1046
little 1034
cosette 1033
m 1005
very 976
its 969
up 965
or 955
do 952
other 940
old 939
than 930
day 869
only 837
some 830
good 830
made 823
time 795
nothing 794
those 779
your 765
if 752
without 739
could 727
de 725
rue 720
first 681
about 678
well 665
where 663
father 638
men 638
say 635
here 631
now 608
should 592
moment 591
over 585
come 582
hand 576
see 573
through 571
any 570
eyes 566
am 561
know 560
even 559
same 551
us 549
after 549
still 546
thenardier 544
great 543
just 538
thought 534
must 533
before 530
once 514
under 511
upon 509
door 508
three 499
being 493
people 491
child 490
how 489
book 487
house 487
head 482
let 480
sort 478
again 474
young 474
go 473
every 472
night 471
each 471
longer 469
javert 465
light 463
right 460
name 458
paris 458
woman 455
can 454
such 446
way 445
place 444
long 443
life 443
back 438
went 431
saint 430
seemed 424
never 421
called 420
four 417
took 416
take 400
seen 397
t 395
years 389
something 389
chapter 388
air 384
left 382
love 381
whom 381
make 380
monsieur 377
though 377
god 373
point 371
mother 368
whole 367
might 367
most 367
between 366
may 363
shall 361
does 358
voice 352
street 352
last 352
almost 351
much 350
down 348
our 348
turned 346
own 342
thing 341
having 340
towards 338
passed 336
face 336
everything 334
always 329
poor 329
soul 327
against 327
order 327
felt 322
off 320
hundred 320
bishop 320
side 318
replied 315
la 314
things 313
certain 312
word 312
away 312
gavroche 312
wall 311
another 308
behind 307
because 307
few 306
hour 306
going 306
room 306
barricade 302
taken 299
five 299
francs 297
too 297
fact 297
black 296
saw 296
fauchelevent 296
put 294
while 291
heard 291
came 290
found 290
heart 284
end 282
enjolras 282
entered 282
madeleine 281
near 281
why 280
themselves 269
madame 268
bed 267
dead 265
sometimes 265
words 265
yes 261
white 260
ah 259
evening 259
girl 253
death 252
six 252
garden 252
le 251
mind 250
itself 249
since 248
thus 248
morning 247
began 246
remained 246
open 245
also 245
gillenormand 244
nor 241
beneath 240
many 240
children 239
half 237
second 237
think 236
table 236
opened 235
set 235
don 233
get 232
terrible 231
hands 231
full 231
done 228
herself 228
large 228
become 228
world 225
anything 224
feet 224
both 223
human 223
person 222
water 219
arms 217
work 217
alone 217
sewer 214
fantine 213
far 211
whose 210
fell 210
idea 210
courfeyrac 209
o 208
police 207
twenty 204
days 204
matter 199
give 199
above 199
already 198
added 198
returned 196
window 194
exclaimed 193
thousand 193
possible 191
corner 190
france 190
earth 190
later 188
however 188
held 187
d 186
knew 186
front 186
louis 186
age 185
less 185
round 183
case 183
speak 182
sir 181
fire 181
tell 180
among 180
yet 180
clock 179
true 179
cold 178
revolution 178
grave 177
lost 176
saying 176
des 174
resumed 173
glance 173
women 173
l 172
part 172
silence 171
look 170
became 170
jondrette 170
rather 169
arm 169
manner 168
new 168
stood 167
sister 167
nevertheless 166
pass 166
iron 165
stone 165
low 164
appeared 164
caught 163
reached 162
oh 162
perhaps 162
raised 162
hair 162
convent 161
read 161
war 160
grand 160
du 159
society 159
beheld 158
fall 158
placed 157
wine 156
shadow 154
happy 154
forth 154
form 154
within 153
making 152
small 152
ground 152
turn 151
state 151
hours 151
nature 151
following 151
grandfather 151
darkness 151
coat 150
joy 149
chamber 149
presence 148
suddenly 148
find 148
myself 147
road 147
letter 147
shop 147
live 147
eye 146
fine 146
foot 146
law 145
paper 145
sight 145
napoleon 145
close 144
smile 144
closed 144
times 143
trees 143
moreover 142
th 142
walls 142
reader 142
seized 142
neither 141
gave 141
quarter 141
history 140
short 140
ancient 139
battle 139
asked 139
beginning 139
king 139
course 138
red 138
present 138
better 138
told 138
third 138
want 137
brought 137
question 137
ever 137
streets 137
piece 137
others 136
rose 136
lay 136
during 136
continued 136
given 136
looked 136
along 136
century 135
knows 134
sound 134
pocket 134
taking 134
rest 134
force 134
enter 134
money 133
direction 133
understand 132
waterloo 132
formed 132
call 131
perceived 131
necessary 130
able 129
strange 129
around 129
melancholy 129
english 129
return 128
sun 128
thou 128
year 128
seated 128
public 127
daughter 127
single 126
mysterious 126
bottom 126
filled 126
gazed 125
floor 125
dark 125
boulevard 124
ten 124
beside 124
cried 124
bourgeois 123
whether 123
die 123
cast 123
visible 123
past 122
seven 122
convict 122
country 121
impossible 121
mayor 120
cut 120
guard 120
hardly 120
appearance 120
shadows 119
laid 119
charming 119
hole 118
means 118
town 118
probably 118
gloomy 117
drew 117
broken 117
pontmercy 117
disappeared 117
blood 117
profound 117
french 117
galleys 116
morrow 116
mademoiselle 116
nearly 116
epoch 116
makes 116
except 116
doubt 115
happiness 115
sous 115
received 115
often 115
together 114
general 114
living 114
mabeuf 114
post 114
least 114
followed 113
comes 113
cannot 113
outside 113
bad 113
says 113
stones 112
leblanc 112
eight 112
paid 112
arrived 112
beautiful 112
houses 112
movement 112
lived 112
re 111
cross 111
known 111
truth 111
depths 111
step 110
hear 110
carriage 110
flowers 109
immense 109
gone 109
lighted 108
progress 108
ideas 108
bread 107
evil 107
girls 107
mouth 107
brother 106
steps 106
sword 106
quite 106
social 106
escape 106
hideous 106
liberty 105
recognized 105
carried 105
army 105
caused 105
certainly 105
pretty 104
hold 104
mingled 104
attention 104
spot 104
effect 103
combeferre 103
thirty 103
slang 103
fallen 103
coming 103
future 103
ago 103
wish 102
pay 102
heaven 102
need 102
shot 102
really 102
family 102
struck 101
passing 101
until 101
below 101
midst 101
months 101
horse 101
city 99
wife 99
conscience 99
loved 99
friends 98
line 98
teeth 98
yourself 97
duty 97
soon 97
breath 97
chair 97
served 97
bent 96
enough 96
sign 96
justice 96
unknown 96
grantaire 96
body 95
seems 95
distance 95
frightful 94
thoughts 94
remain 94
candle 94
high 94
sleep 94
although 93
hat 93
produced 93
covered 93
singular 93
fellow 93
forty 93
wind 92
eponine 92
moments 92
instant 92
simple 92
fifteen 92
further 92
fear 92
secret 92
peace 92
understood 91
insurrection 91
presented 91
bit 91
slowly 91
ll 90
walked 90
occasion 90
formidable 90
doctor 90
gaze 90
square 90
top 90
becomes 90
porter 90
allowed 90
brow 90
glass 89
rendered 89
sad 89
blind 89
husband 89
souls 89
montparnasse 89
horrible 89
windows 89
according 88
monseigneur 88
son 88
pale 88
leave 88
halted 87
enormous 87
succeeded 87
minutes 87
stranger 87
dressed 86
vague 86
ran 86
either 86
power 86
serious 86
uttered 86
tone 86
laugh 86
none 86
forest 86
obliged 86
blue 85
spring 85
sombre 85
use 85
heads 85
touched 84
existed 84
home 84
pavement 84
view 84
despair 84
petit 84
forms 84
prison 84
reply 84
june 83
sky 83
sur 83
doing 83
knees 83
middle 82
hope 82
fixed 82
colonel 82
watch 82
haste 81
killed 81
care 81
misery 81
cannon 81
noise 81
real 81
names 81
prisoner 80
eat 80
bossuet 80
several 80
letters 80
burst 80
spoke 80
youth 80
big 80
destiny 80
tree 80
crime 79
church 79
gentleman 79
fashion 79
deal 79
address 79
rich 79
lower 79
entering 79
asleep 79
vast 78
hence 78
perfectly 78
honest 78
composed 78
standing 78
concealed 78
master 78
resembled 78
service 78
whence 78
sure 77
motionless 77
gun 77
stars 77
number 77
winter 77
civilization 77
terror 77
amid 77
besides 77
magloire 77
chimney 77
honor 77
thrust 77
forced 77
thinking 77
walk 76
baron 76
et 76
chance 76
reason 76
deserted 76
gloom 76
begun 76
school 76
paces 76
neck 76
emperor 76
affair 76
seeing 76
rain 76
ideal 76
speaking 75
latter 75
traversed 75
inn 75
everywhere 75
persons 75
cry 75
lofty 75
beyond 75
march 75
wild 75
feel 75
respect 75
montfermeil 75
got 74
paused 74
holy 74
subject 74
beings 74
else 74
court 74
dawn 74
fault 74
whatever 74
priest 74
aside 74
mass 74
turning 73
peculiar 73
wrong 73
creature 73
rope 73
worthy 73
tholomyes 73
wore 73
shouted 73
race 73
drawing 72
space 72
opening 72
fifty 72
horses 72
sou 72
divine 72
gate 72
shoes 72
double 72
wounded 72
breast 72
spirit 72
free 72
recognize 72
waiting 72
walking 71
change 71
thanks 71
written 71
lines 71
soldier 71
box 71
coffin 71
stared 71
pronounced 70
play 70
account 70
listened 70
bench 70
gentle 70
passage 70
silver 70
evidently 70
memory 70
situation 70
addressed 70
dream 70
kept 70
named 70
possessed 69
key 69
erect 69
behold 69
pity 69
green 69
building 69
cap 69
fresh 69
sainte 68
run 68
bare 68
departure 68
preceding 68
cart 68
mean 68
tried 68
narrow 68
picpus 68
keep 68
soldiers 68
ill 67
obscure 67
angle 67
cloud 67
wellington 67
talking 67
ended 67
finished 67
approached 67
condemned 67
month 67
existence 67
virtue 66
story 66
distant 66
habit 66
quitted 66
attack 66
object 66
wood 66
complete 66
immediately 66
shut 66
sent 66
absolute 65
lightning 65
supreme 65
etc 65
sweet 65
dog 65
dropped 65
noticed 65
revery 65
calm 65
listen 65
believe 65
entrance 65
wrath 65
heavy 65
bore 64
obscurity 64
crowd 64
abyss 64
finally 64
ask 64
rags 64
shoulders 63
pure 63
flight 63
takes 63
goes 63
thither 63
happened 63
died 63
doors 63
emerged 63
advanced 63
twilight 63
fatal 63
gamin 63
deep 63
effort 63
horror 63
stupid 63
committed 63
demanded 63
prioress 63
possession 63
plumet 63
advance 62
sense 62
fifth 62
blow 62
instinct 62
bring 62
best 62
roof 62
daylight 62
revolt 62
purpose 62
merely 62
questions 62
linen 62
aunt 62
conscious 62
tomb 62
gold 62
note 62
attitude 62
encountered 62
field 61
descended 61
england 61
ourselves 61
talk 61
flung 61
suffering 61
action 61
faubourg 61
rise 61
yellow 61
absolutely 61
lies 61
merry 61
required 61
illuminated 61
cure 61
seem 61
self 61
exists 61
repeated 60
ear 60
across 60
mentioned 60
hall 60
falling 60
occupied 60
infinite 60
straw 60
smoke 60
straight 60
branch 60
philosophy 60
cause 60
observed 60
lips 60
pistol 59
holding 59
horizon 59
knowing 59
violent 59
former 59
maire 59
indescribable 59
hung 59
bridge 58
自己写的,版权所有哈~~~~有错误请指点 题目描述: 从硬盘上读取一文本文件(一篇英语文章),将这篇文章的内容使用适当的数据结构保存起来,能够方便的统计出各个单词出现的频率、和查询特定的单词。 在主函数中实现下列控制命令 openfile <输入文件名> display 显示文件单词统计结果 find <查询的单词> quit 对于执行失败的命令,给出相应的提示信息。 操作过程: 1) 打开文件 command: openfile <输入文件名> 2) 显示文件统计结果 command: display 输出结果:显示各个单词出现的频率并且降序排列:<单词> <次数> <频率> 3) 查询单词 cmmand: find <查询的单词> 输出结果: <段序号>-<句序号>-<单词序号> <段序号>-<句序号>-<单词序号> <段序号>-<句序号>-<单词序号> 共出现了<单词个数>次 4) 退出 command: quit 5) 显示帮助 command: help 输出结果: openfile <输入文件名> display 显示单词统计结果 find <查询的单词> quit 退出 考察点: 1) 对象的继承和使用 2) 对象的构造和析构 3) 控制台的输入和输出 4) 程序的调试和运行 提示: 1) 将文章分为词、句、段等三级结构,定义三个对象保存词、句、段。单词的分割符是空格、分号、顿号(ascii码0x20,0x2D、0x3B),句子的分割符是句号、问号和感叹号(ascii码0x21、0x2E、0x2F),段落的分割符是回车(ascii码0x0D 0x0A),其它符号省略。 2) 对象保存可以使用Vector或动态数组
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值