使用集合BSTSet读取txt文本方式
注意
此部分文件名要写绝对路径
ArrayList<String> words1 = new ArrayList<>();
if(FileOperation.readFile("C:\\IdeaProjects\\Interview\\src\\com\\Set\\a-tale-of-two-cities.txt", words1)) {
System.out.println("Total words: " + words1.size());
BSTSet<String> set1 = new BSTSet<>();
for (String word : words1)
set1.add(word);
System.out.println("Total different words: " + set1.getSize());
}
FileOperation类
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Locale;
import java.util.Scanner;
// 文件相关操作
public class FileOperation {
// 读取文件名称为filename中的内容,并将其中包含的所有词语放进words中
public static boolean readFile(String filename, ArrayList<String> words){
if (filename == null || words == null){
System.out.println("filename is null or words is null");
return false;
}
// 文件读取
Scanner scanner;
try {
File file = new File(filename);
if(file.exists()){
FileInputStream fis = new FileInputStream(file);
scanner = new Scanner(new BufferedInputStream(fis), "UTF-8");
scanner.useLocale(Locale.ENGLISH);
}
else
return false;
}
catch(IOException ioe){
System.out.println("Cannot open " + filename);
return false;
}
// 简单分词
// 这个分词方式相对简陋, 没有考虑很多文本处理中的特殊问题
// 在这里只做demo展示用
if (scanner.hasNextLine()) {
String contents = scanner.useDelimiter("\\A").next();
int start = firstCharacterIndex(contents, 0);
for (int i = start + 1; i <= contents.length(); )
if (i == contents.length() || !Character.isLetter(contents.charAt(i))) {
String word = contents.substring(start, i).toLowerCase();
words.add(word);
start = firstCharacterIndex(contents, i);
i = start + 1;
} else
i++;
}
return true;
}
// 寻找字符串s中,从start的位置开始的第一个字母字符的位置
private static int firstCharacterIndex(String s, int start){
for( int i = start ; i < s.length() ; i ++ )
if( Character.isLetter(s.charAt(i)) )
return i;
return s.length();
}
}
BSTSet类
package com.Set;
import java.util.ArrayList;
public class BSTSet<E extends Comparable<E>> implements Set<E>{
private BST<E> bst ;//forget key word bst
public BSTSet(){
bst= new BST<>();
}
@Override
public void add(E e) {
bst.add(e);
}
@Override
public void remove(E e) {
bst.remove(e);
}
@Override
public boolean contains(E e) {
return bst.contains(e);
}
@Override
public int getSize() {
return bst.size();
}
@Override
public boolean isEmpty() {
return bst.isEmpty();
}
}
Main函数
public static void main(String[] args) {
System.out.println("Pride and Prejudice");
ArrayList<String> words1 = new ArrayList<>();
if(FileOperation.readFile("C:\\IdeaProjects\\Interview\\src\\com\\Set\\pride-and-prejudice.txt", words1)) {
System.out.println("Total words: " + words1.size());
BSTSet<String> set1 = new BSTSet<>();
for (String word : words1)
set1.add(word);
System.out.println("Total different words: " + set1.getSize());
}
System.out.println();
System.out.println("A Tale of Two Cities");
ArrayList<String> words2 = new ArrayList<>();
if(FileOperation.readFile("C:\\IdeaProjects\\Interview\\src\\com\\Set\\a-tale-of-two-cities.txt", words2)){
System.out.println("Total words: " + words2.size());
BSTSet<String> set2 = new BSTSet<>();
for(String word: words2)
set2.add(word);
System.out.println("Total different words: " + set2.getSize());
}
}
该程序读取两部文学作品——《傲慢与偏见》和《双城记》的txt文本,通过BSTSet数据结构去除重复单词,分别统计并输出各自文本中的总词汇数和不同词汇数。这展示了文件操作和数据结构在文本分析中的应用。
4965

被折叠的 条评论
为什么被折叠?



