合肥工业大学数据挖掘实验分类任务

原创

已于 2024-12-20 14:09:39 修改 · 661 阅读

9 ·

CC 4.0 BY-SA版权

文章标签：

#数据挖掘

于 2022-12-26 13:35:48 首次发布

本文介绍了一个中文文本分类实验，重点在于使用jieba分词、去除停用词等预处理技术，并利用随机森林和决策树模型进行分类。实验还探讨了模型参数对分类效果的影响。

（文章最后有全部源代码）

一、实验要求

1.1实验目的
1）理解分类任务；
2）考察学生对数据预处理步骤的理解，强化预处理的重要性；
3）基模型可以调用已有的包，训练学生熟悉数据挖掘的基本框架；
4）学会多维度的对模型进行评估以及模型中参数的讨论。

1.2数据集
1）新闻文本分类为中文数据集，需要进行一定的预处理，包括分词，去停用词等；图像数据集可根据情况自行处理。
2）数据中的其他问题可自行酌情处理；
数据说明：自行划分 train 和 test，一般按 7：3 划分。

1.3实验环境
开发环境：Python 3.7( jieba、pandas、numpy、sklearn、matplotlib.pyplot)

1.4方法要求
1）要有针对数据特点的预处理步骤，包括去停用词，降维等；
2）原则上不限制模型，决策树，NB，NN，SVM，random forest 均可，且不限于上述方法。
3）文本可采用 BOW，主题模型以及词向量等多种表示方式，图像数据集可采用 LBP，HOG，SURF 等特征表示方式。

1.5结果要求
1）实现一个或多个基本分类模型，并计算其评估指标如准确率，召回率等
2）对模型中的关键参数，（如决策树中停止分裂条件，NN 中层数等参数）进行不同范围的取值，讨论参数的最佳取值范围。
3）对比分析不同的特征表示方法对结果的影响。
4）若对同一数据采用两种或多种模型进行了分类，对多种模型结果进行对比，以评估模型对该数据集上分类任务的适用性。

二、实验内容

2.1数据预处理
对在停用词表中的分词进行过滤操作：
def pre_treating(para):
words = jieba.cut(str(para))#分词
words = [word for word in words if len(word)>1]
words = [word for word in words if word not in stopWords]
return words

对数据集（训练集和测试集）中的无用分词删除，只保留可能有用的分词:
for name in classname:
data[name][‘words’]= data[name][‘content’].apply(pre_treating)
test[name][‘words’]= data[name][‘content’].apply(pre_treating)

将标签加入属性中:
i = 0;
for name in classname:
data[name][‘flag’] = i
test[name][‘flag’] = i
i += 1;

2.2词频统计
汇集所有表的内容：
result = data[classname[0]]
testdata = test[classname[0]]
for name in classname[1:]:
result = result.append(data[name])
testdata = testdata.append(data[name])

统计800个频率最高的词组:
topWordNum = 800
items = result[‘words’].values.tolist()
words = []
for item in items:
words.extend(item)
wordCount = pd.Series(words).value_counts()[0:topWordNum]
wordCount = wordCount.index.values.tolist()

2.3将词组转换为向量
将词组转为向量，此处向量的数为高频词出现次数:
def wordsToVec(words):
vec = map(lambda word:words.count(word),wordCount)
vec = list(vec)
return vec

将向量添加到汇集的结果中,并去除表中无用的部分:
result[‘vec’] = result[‘words’].apply(wordsToVec)
result = result.drop([‘content’],axis=1)
result = result.drop([‘channelName’],axis=1)

testdata[‘vec’] = testdata[‘words’].apply(wordsToVec)
testdata = testdata.drop([‘content’],axis=1)
testdata = testdata.drop([‘channelName’],axis=1)

2.4随机森林方法
将标签和向量转换为x,y的值：
xTrain = result[‘vec’].tolist()
yTrain = result[‘flag’].tolist()
xTest = testdata[‘vec’].tolist()
yTest = testdata[‘flag’].tolist()

随机森林方法：
def get_rf_ascore(my_para):

clf = RandomForestClassifier(n_estimators=my_para)
clf.fit(xTrain, yTrain)

y_pre = clf.predict(xTest)
y_test = np.array(yTest)
y_pre = np.array(y_pre)

score = accuracy_score(y_test, y_pre)
return score

def get_rf_rscore(my_para):

clf = RandomForestClassifier(n_estimators=my_para)
clf.fit(xTrain, yTrain)

y_pre = clf.predict(xTest)
y_test = np.array(yTest)
y_pre = np.array(y_pre)

score = recall_score(y_test, y_pre,average = 'macro')
return score

2.5决策树方法
使用 sklearn 函数包来实现决策树方法。分类随机森林对应的类DecisionTreeClassifier。实验代码如下：
def get_dt_ascore(my_para):

clf = tree.DecisionTreeClassifier(max_depth=my_para)
clf.fit(xTrain, yTrain)

y_pre = clf.predict(xTest)
y_test = np.array(yTest)
y_pre = np.array(y_pre)

score = accuracy_score(y_test, y_pre)
return score

def get_dt_rscore(my_para):

clf = tree.DecisionTreeClassifier(max_depth=my_para)
clf.fit(xTrain, yTrain)

y_pre = clf.predict(xTest)
y_test = np.array(yTest)
y_pre = np.array(y_pre)

score = recall_score(y_test, y_pre,average = 'macro')

return score

2.6随机森林结果显示
rfaScore = [ ]
rfrScore = [ ]
estimator = np.arange(1, 20, 1)

for i in estimator:
temp_ascore = get_rf_ascore(i)
temp_rscore = get_rf_rscore(i)
rfaScore.append(temp_ascore)
rfrScore.append(temp_rscore)

plt.plot(rfaScore,color=‘red’)
plt.plot(rfrScore,color=‘green’)
plt.xlabel(‘estimator’)
plt.ylabel(‘testDepth’)
plt.show()

2.7决策树方法结果显示
dtaScore = [ ]
dtrScore = [ ]
testDepth = np.arange(1, 100, 1)

for i in testDepth:
temp_ascore = get_dt_ascore(i)
temp_rscore = get_dt_rscore(i)
dtaScore.append(temp_ascore)
dtrScore.append(temp_rscore)

plt.plot(dtaScore,color=‘yellow’)
plt.plot(dtrScore,color=‘blue’)
plt.ylabel(‘score’)
plt.xlabel(‘testDepth’)
plt.show()

三、实验分析和总结

3.1实验分析
在这里插入图片描述
子树的数量 n_estimators 从 1 到 20，系统评分所示。从图中可以看出，在1到20范围内，随着子树数量的增加，模型评分随之增加，分类预测的准确率、回归率随之提高。至于20棵子树之后的评分趋势，则应该进行额外的实验来验证。(红色的是准确率，绿色的是召回率)
在这里插入图片描述
决策树最大深度 max_depth 从 1 到 100，系统评分所⽰。从图中可以
看出，在 1 到 100 范围内，随着树的最大深度的增加，模型评分随之增加，分
类预测的准确率随之提高。至于深度大于 100 的评分趋势，则应该进行额外的
实验来验证。(黄色的是准确率，蓝色的是召回率)
3.2实验总结
通过本次实验，我熟悉了sklearn包中几个模型的使用。这些模型在学术科研中得到广泛使用。通过上网查阅资料，我学习了对中文文本的预处理，即停用词的过滤。此外，还加深了对随机森林算法的理解，通过 sklearn 包实现算法的训练与预测，我对于数据挖掘和python的使用有了更深的理解，在python的使用上更加得心应手，通过实验，让我再次感受到python的便捷性。

源代码

import pandas as pd
import re
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score

#读取训练集
data = pd.read_excel("train.xlsx", encoding = 'utf-8')
test = pd.read_excel("test.xlsx", encoding = 'utf-8')

#去除没有标签的样本
index = data['channelName'].notnull()
data = data[index]
index = data['title'].notnull()
data = data[index]
index = test['channelName'].notnull()
test = test[index]
#print(news)

#去标点
re_obj = re.compile(r"['~`!#$%^&*()_+-=|\';:/.,?><~·！@#￥%……&*（）——+-=“：’；、。，？》《{}'：【】《》‘’“”\s]+")
def get_stopword():
    s = set()
    with open('中文停用词表.txt', encoding = 'utf-8') as f:
        for line in f:
            s.add(line.strip())
    return s
stopword = get_stopword()

def remove_stopword(words):
    return [word for word in words if word not in stopword]
def Data_preprocessing(text):
    text = re_obj.sub("", text)
    text = jieba.lcut(text)
    text = remove_stopword(text)
    return " ".join(text)
    
data['title'] = data['title'].apply(Data_preprocessing)</