（4）文本挖掘（一）——准备文本读写及对Map操作的工具类

最新推荐文章于 2021-10-09 20:33:57 发布

原创

最新推荐文章于 2021-10-09 20:33:57 发布 · 2.3k 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#java #文本挖掘 #文本读写 #Map按值排序

文本挖掘是一个对具有丰富语义的文本进行分析，从而理解其所包含的内容和意义的过程。文本挖掘包含分词、文本表示、文本特征选择、文本分类、文本聚类、文档自动摘要等方面的内容。文本挖掘的具体流程图可下图所示：
这里写图片描述
我的项目是以复旦大学中文语料库和路透社英文语料库为数据集的，都是有类别的两层目录文本集。
不管你要做什么，你首先都要先读取文本，为了方便后面的操作，我写了几个工具类。

一、文本信息类Text

利用该类来存储文本的文件路径、类别ID、进行分类或聚类后所属的类别ID、文本词向量、文本长度，方便我们设置或获取需要用到的信息。


package util;

import java.util.Map;

/**
 * 文本信息类，包含文本的文件路径，类别，词向量等
 * @author Angela
 */
public class Text {
   
   

    /**文本路径**/
    private String path;
    /**文本类别ID**/
    private int originLabelID;
    /**文本分类或聚类类别ID**/
    private int judegeLabelID;
    /**文本词-权重**/
    private Map<String,Double> words;
    /**文本长度**/
    private double length;

    /**
     * @return the path
     */
    public String getPath() {
        return path;
    }

    /**
     * @param path the path to set
     */
    public void setPath(String path) {
        this.path = path;
    }

    /**
     * @return the words
     */
    public Map<String,Double> getWords() {
        return words;
    }

    /**
     * @param words the words to set
     */
    public void setWords(Map<String,Double> words) {
        this.words = words;
    }

    /**
     * @return the length
     */
    public double getLength() {
        return length;
    }

    /**
     * @param length the length to set
     */
    public void setLength(double length) {
        this.length = length;
    }

    /**
     * @return the originLabelID
     */
    public int getOriginLabelID() {
        return originLabelID;
    }

    /**
     * @param originLabelID the originLabelID to set
     */
    public void setOriginLabelID(int originLabelID) {
        this.originLabelID = originLabelID;
    }

    /**
     * @return the judegeLabelID
     */
    public int getJudegeLabelID() {
        return judegeLabelID;
    }

    /**
     * @param judegeLabelID the judegeLabelID to set
     */
    public void setJudegeLabelID(int judegeLabelID) {
        this.judegeLabelID = judegeLabelID;
    }

}

二、Map操作类MapUtil

因为在项目中有很多地方，需要对Map进行排序，打印，截取等操作，所以这里把这些操作单独出来，成为这个类。


package util;

import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.LinkedList;
import java.util.Map;
import java.util.Set;

/**
 * Map操作类，包括排序，打印，截取
 * @author Angela
 */
public class MapUtil {
   
   

    /**对Map按键值升序排序**/
    public static <K, V extends Comparable<? super V>> 
            Map<K, V> asc( Map<K, V> map){  
        //将map.entrySet()转换成list
        LinkedList<Map.Entry<K, V>> list =  
                new LinkedList<Map.Entry<K, V>>( map.entrySet() );  
        //然后通过比较器来实现排序
        Collections.sort( list, new Comparator<Map.Entry<K, V>>() {  
            //升序排序
            public int compare( Map.Entry<K, V> o1, Map.Entry<K, V> o2 ){  
                return (o1.getValue()).compareTo( o2.getValue() ); 
            }  
        } );    
        Map<K, V> result = new LinkedHashMap<K, V>();  
        for (M