18、基于NLP的异构集成文本分类与影评情感分析

最新推荐文章于 2025-11-13 12:36:11 发布

root9

最新推荐文章于 2025-11-13 12:36:11 发布

阅读量34

点赞数

CC 4.0 BY-SA版权

分类专栏：集成机器学习实战精讲文章标签： NLP 文本分类情感分析

本文链接：https://blog.youkuaiyun.com/root9/article/details/152550989

集成机器学习实战精讲专栏收录该内容

19 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

基于NLP的异构集成文本分类与影评情感分析

1. 文本分类模型构建流程

在文本分类任务中，我们采用了多种算法构建模型，并分别在计数数据（Count Data）和TF - IDF数据上进行训练。以下是具体的模型构建顺序和步骤：
| 序号 | 模型 | 数据类型 |
| ---- | ---- | ---- |
| 1 | 朴素贝叶斯 | 计数数据 |
| 2 | 朴素贝叶斯 | TF - IDF数据 |
| 3 | 带RBF核的支持向量机（SVM） | 计数数据 |
| 4 | 带RBF核的支持向量机（SVM） | TF - IDF数据 |
| 5 | 随机森林 | 计数数据 |
| 6 | 随机森林 | TF - IDF数据 |

具体步骤如下：
1. 数据准备 ：
- 导入所需的库，定义绘制混淆矩阵的函数。
- 读取数据集，使用UTF8编码。
- 检查数据集中垃圾邮件和正常邮件的比例。
- 使用 CountVectorizer 和 TfidfVectorizer 模块将文本分别转换为向量和TF - IDF向量。
2. 模型训练与评估 ：
- 朴素贝叶斯模型 ：
- 在计数数据上构建朴素贝叶斯模型，使用 classification_report() 检查性能指标，调用 plot_confusion_matrix() 绘制混淆矩阵。
- 在TF - IDF数据上构建朴素贝叶斯模型并评估性能。
- 支持向量机模型 ：
- 在计数数据上使用RBF核训练支持向量机模型，使用 classification_report 评估性能并绘制混淆矩阵。同时，展示使用 GridSearchCV 查找最佳参数的示例。
- 在TF - IDF数据上重复上述步骤。
- 随机森林模型 ：
- 在计数数据上使用网格搜索训练随机森林模型，设置 gini 和 entropy 作为 criterion 超参数，以及多个参数值，如 min_samples_split 、 max_depth 和 min_samples_leaf 。评估模型性能。
- 在TF - IDF数据上训练随机森林模型，使用 predic_proba() 函数获取测试数据的类别概率，绘制带有AUC分数注释的ROC曲线，以比较模型性能。
- 平均从计数数据和TF - IDF数据模型中得到的概率，绘制集成结果的ROC曲线。
- 绘制在计数数据和TF - IDF数据上构建的每个模型的测试准确率。

以下是部分代码示例：

# 计算预测值的众数以进行最大投票，得到最终预测结果
predicted_array = mode(predicted_array)
print(predicted_array)
print("The accuracy for test")
accuracy_score(Y_test, predicted_array[0][0])

2. 影评情感分析案例

情感分析是自然语言处理（NLP）中广泛研究的领域，我们以互联网电影数据库（IMDb）的影评数据为例，进行影评的情感分类（积极或消极）。

2.1 数据集准备

我们有1000条积极影评和1000条消极影评，分别存储在 .txt 文件中，分为 positive 和 negative 两个文件夹。以下是数据集准备的步骤：
1. 导入所需库 ：

import os
import glob
import pandas as pd

设置工作文件夹 ：

os.chdir("/.../Chapter 11/CS - IMDB Classification")
os.getcwd()

读取积极影评 ：

path="/.../Chapter 11/CS - IMDB Classification/txt_sentoken/pos/*.txt"
files = glob.glob(path)
text_pos = []
for p in files:
    file_read = open(p, "r")
    to_append_pos = file_read.read()
    text_pos.append(to_append_pos)
    file_read.close()
df_pos = pd.DataFrame({'text':text_pos,'label':'positive'})
df_pos.head()

读取消极影评 ：

path="/Users/Dippies/CODE PACKT - EML/Chapter 11/CS - IMDB Classification/txt_sentoken/neg/*.txt"
files = glob.glob(path)
text_neg = []
for n in files:
    file_read = open(n, "r")
    to_append_neg = file_read.read()
    text_neg.append(to_append_neg)
    file_read.close()
df_neg = pd.DataFrame({'text':text_neg,'label':'negative'})
df_neg.head()

合并数据框 ：

df_moviereviews=pd.concat([df_pos, df_neg])

打乱数据 ：

from sklearn.utils import shuffle
df_moviereviews=shuffle(df_moviereviews)
df_moviereviews.head(10)

验证数据维度 ：

df_moviereviews.shape

保存为CSV文件 ：

df_moviereviews.to_csv("/.../Chapter 11/CS - IMDB Classification/Data_IMDB.csv")

查看正负影评比例 ：

df_moviereviews["label"].value_counts().plot(kind='pie')
plt.tight_layout(pad=1,rect=(0, 0, 0.7, 1))
plt.text(x=-0.9,y=0.1, s=(np.round(((df_moviereviews["label"].value_counts()[0])/(df_moviereviews["label"].value_counts()[0] + df_moviereviews["label"].value_counts()[1])),2)))
plt.text(x=0.4,y=-0.3, s=(np.round(((df_moviereviews["label"].value_counts()[1])/(df_moviereviews["label"].value_counts()[0] + df_moviereviews["label"].value_counts()[1])),2)))
plt.title("% Share of the Positive and Negative reviews in the dataset")

替换标签 ：

df_moviereviews.loc[df_moviereviews["label"]=='positive',"label"]=1
df_moviereviews.loc[df_moviereviews["label"]=='negative',"label"]=0

数据预处理 ：

lemmatizer = WordNetLemmatizer()
def process_text(text):
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    clean_words = [word.lower() for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    clean_words = [lemmatizer.lemmatize(lem) for lem in clean_words]
    clean_words = " ".join(clean_words)
    return clean_words

df_moviereviews['text'] = df_moviereviews['text'].apply(process_text)

以下是数据集准备的流程图：

graph LR
    A[导入所需库] --> B[设置工作文件夹]
    B --> C[读取积极影评]
    B --> D[读取消极影评]
    C --> E[合并数据框]
    D --> E
    E --> F[打乱数据]
    F --> G[验证数据维度]
    G --> H[保存为CSV文件]
    H --> I[查看正负影评比例]
    I --> J[替换标签]
    J --> K[数据预处理]

通过以上步骤，我们完成了影评情感分析的数据集准备工作，为后续的模型训练和评估奠定了基础。在接下来的内容中，我们将详细介绍在该数据集上构建基础学习器并评估集成结果的过程。

3. 影评情感分析的模型训练与评估

在完成数据集准备后，我们将在影评数据集上构建多个基础学习器，并评估集成结果。以下是具体的操作步骤：

3.1 导入所需库

import os
import numpy as np
import pandas as pd
import itertools
import warnings
import string
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score as auc
from sklearn.metrics import roc_curve
from sklearn.metrics import accuracy_score
from scipy.stats import mode

3.2 分离目标变量和预测变量

X = df_moviereviews.loc[:,'text']
Y = df_moviereviews.loc[:,'label']
Y = Y.astype('int')

3.3 进行数据的训练集和测试集划分

X_train,X_test,y_train,y_test = train_test_split(X, Y, test_size=.3, random_state=1)

3.4 使用CountVectorizer和TfidfVectorizer转换文本

# 使用CountVectorizer将文本转换为向量
count_vectorizer = CountVectorizer()
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)

# 使用TfidfVectorizer将文本转换为TF - IDF向量
tfidf = TfidfVectorizer()
tfidf_train = tfidf.fit_transform(X_train)
tfidf_test = tfidf.transform(X_test)

3.5 训练基础学习器

我们将使用随机森林模型、朴素贝叶斯模型和支持向量分类器模型在计数数据和TF - IDF数据上进行训练。

随机森林模型（计数数据）

# 设置网格搜索的参数
rf_params = {"criterion":["gini","entropy"],
             "min_samples_split":[2,3],
             "max_depth":[None,2,3],
             "min_samples_leaf":[1,5],
             "max_leaf_nodes":[None],
             "oob_score":[True]}
# 创建随机森林分类器实例
rf = RandomForestClassifier()
warnings.filterwarnings("ignore")
# 使用GridSearchCV进行网格搜索
rf_count = GridSearchCV(rf, rf_params, cv=5)
rf_count.fit(count_train, y_train)
# 预测类别和类别概率
rf_count_predicted_values = rf_count.predict(count_test)
rf_count_probabilities = rf_count.predict_proba(count_test)
rf_count_train_accuracy = rf_count.score(count_train, y_train)
rf_count_test_accuracy = rf_count.score(count_test, y_test)
print('The accuracy for the training data is {}'.format(rf_count_train_accuracy))
print('The accuracy for the testing data is {}'.format(rf_count_test_accuracy))
# 评估模型性能
print(classification_report(y_test, rf_count_predicted_values))
cm = confusion_matrix(y_test, rf_count_predicted_values)
plt.figure()
plot_confusion_matrix(cm, classes=target_names,normalize=False)
plt.show()

随机森林模型（TF - IDF数据）

# 设置网格搜索的参数
rf_params = {"criterion":["gini","entropy"],"min_samples_split":[2,3],"max_depth":[None,2,3],"min_samples_leaf":[1,5],"max_leaf_nodes":[None],"oob_score":[True]}
# 创建随机森林分类器实例
rf = RandomForestClassifier()
warnings.filterwarnings("ignore")
# 使用GridSearchCV进行网格搜索
rf_tfidf = GridSearchCV(rf, rf_params, cv=5)
rf_tfidf.fit(tfidf_train, y_train)
# 预测类别和类别概率
rf_tfidf_predicted_values = rf_tfidf.predict(tfidf_test)
rf_tfidf_probabilities = rf_tfidf.predict_proba(tfidf_test)
rf_train_accuracy = rf_tfidf.score(tfidf_train, y_train)
rf_test_accuracy = rf_tfidf.score(tfidf_test, y_test)
print('The accuracy for the training data is {}'.format(rf_train_accuracy))
print('The accuracy for the testing data is {}'.format(rf_test_accuracy))
# 评估模型性能
print(classification_report(y_test, rf_tfidf_predicted_values))
cm = confusion_matrix(y_test, rf_tfidf_predicted_values)
plt.figure()
plot_confusion_matrix(cm, classes=target_names,normalize=False)
plt.show()

朴素贝叶斯模型（计数数据）

nb_count = MultinomialNB()
nb_count.fit(count_train, y_train)
nb_count_predicted_values = nb_count.predict(count_test)
nb_count_probabilities = nb_count.predict_proba(count_test)
nb_train_accuracy = nb_count.score(count_train, y_train)
nb_test_accuracy = nb_count.score(count_test, y_test)
print('The accuracy for the training data is {}'.format(nb_train_accuracy))
print('The accuracy for the testing data is {}'.format(nb_test_accuracy))
# 评估模型性能
print(classification_report(y_test, nb_count_predicted_values))
cm = confusion_matrix(y_test, nb_count_predicted_values)
plt.figure()
plot_confusion_matrix(cm, classes=target_names,normalize=False)
plt.show()

朴素贝叶斯模型（TF - IDF数据）

nb_tfidf = MultinomialNB()
nb_tfidf.fit(tfidf_train, y_train)
nb_tfidf_predicted_values = nb_tfidf.predict(tfidf_test)
nb_tfidf_probabilities = nb_tfidf.predict_proba(tfidf_test)
nb_train_accuracy = nb_tfidf.score(tfidf_train, y_train)
nb_test_accuracy = nb_tfidf.score(tfidf_test, y_test)
print('The accuracy for the training data is {}'.format(nb_train_accuracy))
print('The accuracy for the testing data is {}'.format(nb_test_accuracy))
# 评估模型性能
print(classification_report(y_test, nb_tfidf_predicted_values))
cm = confusion_matrix(y_test, nb_tfidf_predicted_values)
plt.figure()
plot_confusion_matrix(cm, classes=target_names,normalize=False)
plt.show()

支持向量分类器模型（计数数据）

svc_count = SVC(kernel='linear',probability=True)
svc_params = {'C':[0.001, 0.01, 0.1, 1, 10]}
svc_gcv_count = GridSearchCV(svc_count, svc_params, cv=5)
svc_gcv_count.fit(count_train, y_train)
svc_count_predicted_values = svc_gcv_count.predict(count_test)
svc_count_probabilities = svc_gcv_count.predict_proba(count_test)
svc_count_train_accuracy = svc_gcv_count.score(count_train, y_train)
svc_count_test_accuracy = svc_gcv_count.score(count_test, y_test)
print('The accuracy for the training data is {}'.format(svc_gcv_count.score(count_train, y_train)))
print('The accuracy for the testing data is {}'.format(svc_gcv_count.score(count_test, y_test)))
# 评估模型性能
print(classification_report(y_test, svc_count_predicted_values))
cm = confusion_matrix(y_test, svc_count_predicted_values)
plt.figure()
plot_confusion_matrix(cm, classes=target_names,normalize=False)
plt.show()

支持向量分类器模型（TF - IDF数据）

svc_tfidf = SVC(kernel='linear',probability=True)
svc_params = {'C':[0.001, 0.01, 0.1, 1, 10]}
svc_gcv_tfidf = GridSearchCV(svc_tfidf, svc_params, cv=5)
svc_gcv_tfidf.fit(tfidf_train, y_train)
svc_tfidf_predicted_values = svc_gcv_tfidf.predict(tfidf_test)
svc_tfidf_probabilities = svc_gcv_tfidf.predict_proba(tfidf_test)
svc_tfidf_train_accuracy = svc_gcv_tfidf.score(tfidf_train, y_train)
svc_tfidf_test_accuracy = svc_gcv_tfidf.score(tfidf_test, y_test)
print('The accuracy for the training data is {}'.format(svc_gcv_tfidf.score(tfidf_train, y_train)))
print('The accuracy for the testing data is {}'.format(svc_gcv_tfidf.score(tfidf_test, y_test)))
# 评估模型性能
print(classification_report(y_test, svc_tfidf_predicted_values))
cm = confusion_matrix(y_test, svc_tfidf_predicted_values)
plt.figure()
plot_confusion_matrix(cm, classes=target_names)
plt.show()

3.6 绘制ROC曲线

# 以随机森林（计数数据）为例
fpr, tpr, thresholds = roc_curve(y_test, rf_count_probabilities[:,1])
roc_auc = auc(y_test, rf_count_probabilities[:,1])
plt.title('ROC Random Forest Count Data')
plt.plot(fpr, tpr, 'b',label='AUC = %0.3f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.0])
plt.ylim([-0.1,1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

3.7 绘制集成结果的ROC曲线并计算集成准确率

# 绘制集成结果的ROC曲线
# 计算计数数据的集成预测值
predicted_values_count = np.array([rf_count_predicted_values, 
                                   nb_count_predicted_values, 
                                   svc_count_predicted_values])
# 计算TF - IDF数据的集成预测值
predicted_values_tfidf = np.array([rf_tfidf_predicted_values, 
                                   nb_tfidf_predicted_values, 
                                   svc_tfidf_predicted_values])
# 进行最大投票
predicted_values_count = mode(predicted_values_count)
predicted_values_tfidf = mode(predicted_values_tfidf)

# 绘制测试准确率
count = np.array([rf_count_test_accuracy,
                  nb_count_test_accuracy,
                  svc_count_test_accuracy,
                  accuracy_score(y_test, predicted_values_count[0][0])])
tfidf = np.array([rf_tfidf_test_accuracy,
                  nb_tfidf_test_accuracy,
                  svc_tfidf_test_accuracy,
                  accuracy_score(y_test, predicted_values_tfidf[0][0])])
label_list = ["Random Forest", "Naive_Bayes", "SVM_Linear", "Ensemble"]
plt.plot(count)
plt.plot(tfidf)
plt.xticks([0,1,2,3],label_list)
for i in range(4):
    plt.text(x=i,y=(count[i]+0.001), s=np.round(count[i],4))
for i in range(4):
    plt.text(x=i,y=tfidf[i]-0.003, s=np.round(tfidf[i],4))
plt.legend(["Count","TFIDF"])
plt.title("Test accuracy")
plt.tight_layout(pad=1,rect=(0, 0, 2.5, 2))
plt.show()

总结

通过以上步骤，我们完成了文本分类和影评情感分析的任务。在文本分类中，我们使用多种算法在计数数据和TF - IDF数据上构建模型，并进行了性能评估。在影评情感分析中，我们从数据集准备开始，经过数据预处理、模型训练和评估，最终得到了各个模型和集成模型的性能指标。通过比较不同模型的准确率、ROC曲线和AUC分数，我们可以选择最适合的模型进行情感分类任务。

以下是整个影评情感分析流程的表格总结：
| 步骤 | 操作内容 |
| ---- | ---- |
| 1 | 导入所需库 |
| 2 | 分离目标变量和预测变量 |
| 3 | 划分训练集和测试集 |
| 4 | 使用CountVectorizer和TfidfVectorizer转换文本 |
| 5 | 训练随机森林、朴素贝叶斯和支持向量分类器模型（计数数据和TF - IDF数据） |
| 6 | 评估模型性能（准确率、分类报告、混淆矩阵） |
| 7 | 绘制ROC曲线 |
| 8 | 计算集成预测值并绘制集成结果的ROC曲线和测试准确率 |

通过这些步骤和代码，我们可以有效地进行文本分类和影评情感分析，为自然语言处理任务提供了实用的解决方案。