【案例共创】基于spaCy的NER模型构建与深度EDA解析：Twitter情感短语提取-优快云博客

本案例由开发者：天津师范大学协同育人项目--翟羽佳提供

一、概述

1. 案例介绍

社交媒体已成为全球用户表达情感与观点的重要平台，Twitter 作为典型代表，每日产生海量文本数据。情感分析作为自然语言处理的重要分支，在舆情监测、品牌口碑分析等领域发挥关键作用。传统的情感分析多依赖基于规则或简单词袋模型，难以捕捉复杂语义与领域特定情感表达。在 Twitter 文本中，包含大量口语化、非正式的短语，精准提取这些情感短语并构建高效情感分析模型，成为提升社交媒体情感分析准确性与实用性的关键挑战。数据科学价值

通过实际操作，可基于spaCy构建定制化NER模型，经数据增强与模型调优，显著提升情感分析精度；实现从海量Twitter文本中快速提取关键情感信息，为舆情监控、市场趋势分析等提供高效支持，助力企业决策与公共事件分析。

2. 适用对象

企业
个人开发者
高校学生

3. 案例时间

本案例总时长预计60分钟。

4. 案例流程

说明：

配置AI Notebook和运行环境环境；
从OBS下载文件；
编辑并运行代码；

5. 资源总览

本案例预计花费0元。

资源名称	规格	单价（元）	时长（分钟）
开发者空间AI Notebook	NPU basic \| 1 * NPU 910B \| 8v CPU \| 24GB \| euler2.9-py310-torch2.1.0-cann8.0-openmind0.9.1-notebook	0	60

二、开发者空间AI Notebook和运行环境配置

1. 开发者空间AI Notebook配置

本案例中，使用开发者空间AI Notebook进行代码编写、功能实现，华为开发者空间Notebook是一款面向开发者的一站式云端开发工具，主要用于AI开发、数据分析、模型训练等场景。

开发者直接进入到开发者空间工作台。

进入到开发者空间工作台后，找打AI Notebook，点击立即前往。

进入到AI Notebook页面后，选择NPU环境点击立即启动。

稍等片刻后点击查看Notebook，前往Notebook主页面。

至此，成功打开Notebook。

2. 运行环境配置

打开Notebook后，点击笔记下的python 3，创建代码编写文件。

安装第三方库：

pip install numpy

pip install pandas

pip install matplotlib

pip install seaborn

pip install plotly

pip install pillow

pip install wordcloud

pip install nltk

pip install tqdm

pip install spacy

注意，如果安装失败可以使用国内镜像和最新库名进行安装：

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple numpy pandas matplotlib seaborn plotly pillow wordcloud nltk tqdm spacy

系统会自动安装相关的包和工具，当安装完成后，系统会返回所有已成功安装的库，如下图所示：

安装成功后对脚本进行运行可得到关于 Python 数据处理全流程开发，并具备将本地模型迁移至华为云实现工业级部署的能力。

注意，安装包安装完毕后需要手动重启内核来更新环境，点击内核>重启内核。

三、从OBS下载文件

1. 从obs下载所需要的文件

为了方便项目运行，已提前将文件上传OBS，之后通过分享链接在Notebook中可直接下载使用。

文件压缩包中包含后续所有的文件：

OBS链接地址：https://case-aac4.obs.cn-north-4.myhuaweicloud.com/data%26models.zip%20

下载命令行：

!wget <https://case-aac4.obs.cn-north-4.myhuaweicloud.com/data%26models.zip%20>

解压压缩包并重命名：

!unzip 'data&models.zip '

!mv 'data&models' data_models

注：若需使用字面意义的&，需用引号或反斜杠转义，这里为避免歧义，进行了重命名

四、编辑并运行代码

1. 复制代码进入Notebook

总代码展示，直接复制到Notebook：

# 导入所需的库
import re  # 用于正则表达式操作
import string  # 用于字符串处理
import numpy as np  # 用于数值计算
import random  # 用于生成随机数
import pandas as pd  # 用于数据处理和分析
import matplotlib.pyplot as plt  # 用于绘图
import seaborn as sns  # 用于统计绘图
from plotly import graph_objs as go  # 用于交互式图表
import plotly.express as px  # 用于高级交互式图表
import plotly.figure_factory as ff  # 用于创建复杂图表
from collections import Counter  # 用于计数
from PIL import Image  # 用于图像处理
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator  # 用于生成词云
import nltk  # 用于自然语言处理
from nltk.corpus import stopwords  # 用于停用词处理
from tqdm import tqdm  # 用于显示进度条
import os  # 用于文件和目录操作
import spacy  # 用于高级自然语言处理
from spacy.util import compounding, minibatch  # 用于模型训练
import warnings  # 用于忽略警告
from pathlib import Path  # 确保导入 Path
from spacy.training import Example  # 确保导入 Example 类
warnings.filterwarnings("ignore")  # 忽略警告信息

# 遍历指定目录并打印文件路径
# 用于检查数据文件的位置
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# 随机颜色生成函数
# 用于为图表和词云生成随机颜色
def random_colours(number_of_colors):
    colors = []
    # 生成指定数量的随机颜色
    for i in range(number_of_colors):
        # 随机选择十六进制字符生成颜色代码
        colors.append("#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)]))
    return colors
# 读取训练集数据，文件路径为本地路径
train = pd.read_csv(r"data_models/train.csv")
# 读取测试集数据，文件路径为本地路径
test = pd.read_csv(r"data_models/test.csv")
# 读取样本提交文件，文件路径为本地路径
ss = pd.read_csv(r"data_models/sample_submission.csv")
# 打印训练集的形状（行数和列数）
print(train.shape)
# 打印测试集的形状（行数和列数）
print(test.shape)
train.dropna(inplace=True)
# 按情感类别（sentiment）对训练集进行分组，统计每种情感类别的推文数量，并按数量降序排列
temp = train.groupby('sentiment').count()['text'].reset_index().sort_values(by='text', ascending=False)
# 为 temp DataFrame 添加背景渐变效果，使用紫色调（Purples）
temp.style.background_gradient(cmap='Purples')
# 创建一个大小为 12x6 英寸的 Matplotlib 图表
plt.figure(figsize=(12, 6))
# 使用 Seaborn 绘制情感类别的柱状图，x 轴为不同的情感类别
sns.countplot(x='sentiment', data=train)
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'sentiment_countplot.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 定义 Jaccard 相似度计算函数
# Jaccard 相似度用于衡量两个集合的相似程度，取值范围在 0 到 1 之间，值越接近 1 表示两个集合越相似
def jaccard(str1, str2): 
    # 将字符串转换为小写，并按空格分割成单词列表，再转换为集合
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    # 计算两个集合的交集
    c = a.intersection(b)
    # 计算 Jaccard 相似度得分，公式为交集元素个数除以并集元素个数
    return float(len(c)) / (len(a) + len(b) - len(c))
# 初始化一个空列表，用于存储计算得到的 Jaccard 相似度结果
results_jaccard = []
# 遍历训练集的每一行
for ind, row in train.iterrows():
    # 获取当前行的 text 列的值作为第一个句子
    sentence1 = row.text
    # 获取当前行的 selected_text 列的值作为第二个句子
    sentence2 = row.selected_text
    # 调用 jaccard 函数计算两个句子的 Jaccard 相似度得分
    jaccard_score = jaccard(sentence1, sentence2)
    # 将两个句子和对应的 Jaccard 相似度得分添加到结果列表中
    results_jaccard.append([sentence1, sentence2, jaccard_score])
# 将存储 Jaccard 相似度结果的列表转换为 DataFrame
jaccard = pd.DataFrame(results_jaccard, columns=["text", "selected_text", "jaccard_score"])
# 将包含 Jaccard 相似度得分的 DataFrame 与训练集进行外连接合并
train = train.merge(jaccard, how='outer')
# 计算训练集中 selected_text 列每个文本的单词数量，并将结果存储在新列 Num_words_ST 中
train['Num_words_ST'] = train['selected_text'].apply(lambda x: len(str(x).split())) 
# 计算训练集中 text 列每个文本的单词数量，并将结果存储在新列 Num_word_text 中
train['Num_word_text'] = train['text'].apply(lambda x: len(str(x).split())) 
# 计算训练集中 text 列和 selected_text 列单词数量的差值，并将结果存储在新列 difference_in_words 中
train['difference_in_words'] = train['Num_word_text'] - train['Num_words_ST'] 
# 提取训练集中 Num_words_ST 列和 Num_word_text 列的数据，用于绘制分布直方图
hist_data = [train['Num_words_ST'], train['Num_word_text']]
# 为直方图中的不同数据集设置标签
group_labels = ['Selected_Text', 'Text']
# 使用 Plotly 的 create_distplot 函数创建分布直方图，不显示曲线
fig = ff.create_distplot(hist_data, group_labels, show_curve=False)
# 设置直方图的标题
fig.update_layout(title_text='Distribution of Number Of words')
# 设置图形的大小和背景颜色
fig.update_layout(
    autosize=False,
    width=900,
    height=700,
    paper_bgcolor="LightSteelBlue",
)
# 显示绘制好的直方图
fig.show()
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'sentiment_distplot.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 设置图形的大小
plt.figure(figsize=(12, 6))
# 使用 Seaborn 绘制训练集 Num_words_ST 列的核密度估计图，设置阴影和颜色为红色，并设置标题
p1 = sns.kdeplot(train['Num_words_ST'], shade=True, color="r").set_title('Kernel Distribution of Number Of words')
# 使用 Seaborn 绘制训练集 Num_word_text 列的核密度估计图，设置阴影和颜色为蓝色
p1 = sns.kdeplot(train['Num_word_text'], shade=True, color="b")
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'word_kdeplot.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 设置图形的大小
plt.figure(figsize=(12, 6))
# 筛选出训练集中情感为 positive 的数据，绘制 difference_in_words 列的核密度估计图，设置阴影和颜色为蓝色，并设置标题
p1 = sns.kdeplot(train[train['sentiment'] == 'positive']['difference_in_words'], shade=True, color="b").set_title('Kernel Distribution of Difference in Number Of words')
# 筛选出训练集中情感为 negative 的数据，绘制 difference_in_words 列的核密度估计图，设置阴影和颜色为红色
p2 = sns.kdeplot(train[train['sentiment'] == 'negative']['difference_in_words'], shade=True, color="r")
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'word_difference_kdeplot.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 设置图形的大小
plt.figure(figsize=(12, 6))
# 筛选出训练集中情感为 neutral 的数据，绘制 difference_in_words 列的分布直方图，不显示核密度曲线
sns.distplot(train[train['sentiment'] == 'neutral']['difference_in_words'], kde=False)
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'difference_in_words_distplot.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 设置图形的大小
plt.figure(figsize=(12, 6))
# 筛选出训练集中情感为 positive 的数据，绘制 jaccard_score 列的核密度估计图，设置阴影和颜色为蓝色，并设置标题
p1 = sns.kdeplot(train[train['sentiment'] == 'positive']['jaccard_score'], shade=True, color="b").set_title('KDE of Jaccard Scores across different Sentiments')
# 筛选出训练集中情感为 negative 的数据，绘制 jaccard_score 列的核密度估计图，设置阴影和颜色为红色
p2 = sns.kdeplot(train[train['sentiment'] == 'negative']['jaccard_score'], shade=True, color="r")
# 添加图例，显示 positive 和 negative 对应的曲线
plt.legend(labels=['positive', 'negative'])
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'jaccard_score_kdeplot.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 设置图形的大小
plt.figure(figsize=(12, 6))
# 筛选出训练集中情感为 neutral 的数据，绘制 jaccard_score 列的分布直方图，不显示核密度曲线
sns.distplot(train[train['sentiment'] == 'neutral']['jaccard_score'], kde=False)
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'jaccard_score_distplot.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 从 train 数据集中筛选出 'Num_word_text' 列的值小于等于 2 的行，并将结果存储在变量 k 中
k = train[train['Num_word_text'] <= 2]
# 从数据集 k 中选择 'sentiment' 和 'jaccard_score' 两列，并将结果存储在变量 jaccard_scores 中
jaccard_scores = k[['sentiment', 'jaccard_score']]
# 按 'sentiment' 列对数据集 jaccard_scores 进行分组，并计算每组的 'jaccard_score' 列的均值
# 最后从分组结果中提取 'jaccard_score' 列的均值，并将结果存储在变量 result 中
result = jaccard_scores.groupby('sentiment').mean()['jaccard_score']
# 打印变量 result 的值，即每个情感类别的平均 Jaccard 分数
print(result)
# 从数据集 k 中筛选出 'sentiment' 列的值等于 'positive' 的行，以便进一步分析
k[k['sentiment'] == 'positive']
def clean_text(text):
    # 将文本转换为小写
    text = str(text).lower()
    # 移除方括号内的文本（例如：[...])
    text = re.sub('\[.*?\]', '', text)
    # 移除链接（包含http、https或www开头的URL）
    text = re.sub('https?://\S+|www\.\S+', '', text)
    # 移除HTML标签（例如：<br />)
    text = re.sub('<.*?>+', '', text)
    # 移除标点符号
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    # 移除换行符
    text = re.sub('\n', '', text)
    # 移除包含数字的单词
    text = re.sub('\w*\d\w*', '', text)
    # 返回清洗后的文本
    return text
# 对 train 数据集的 'text' 列应用 clean_text 函数，清洗文本
train['text'] = train['text'].apply(lambda x: clean_text(x))
# 对 train 数据集的 'selected_text' 列应用 clean_text 函数，清洗文本
train['selected_text'] = train['selected_text'].apply(lambda x: clean_text(x))
# 为每个 'selected_text' 的值创建一个单词列表（按空格分割）
train['temp_list'] = train['selected_text'].apply(lambda x: str(x).split())
# 使用列表推平操作，将所有子列表中的单词放入一个单一的列表中
top = Counter([item for sublist in train['temp_list'] for item in sublist])
# 将最常见的 20 个单词及其出现次数转换为 DataFrame
temp = pd.DataFrame(top.most_common(20))
# 为 DataFrame 的两列命名
temp.columns = ['Common_words', 'count']
# 为 DataFrame 添加一个渐变颜色背景，使用蓝色调
temp.style.background_gradient(cmap='Blues')
# 使用 Plotly 创建一个水平条形图，展示最常见的单词及其计数
fig = px.bar(temp, x="count", y="Common_words", title='Common Words in Selected Text', orientation='h', 
             width=700, height=700, color='Common_words')
fig.show()
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'Common_Words_barplot.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 定义一个函数，用于移除输入列表中的停用词（英文）
def remove_stopword(x):
    return [y for y in x if y not in stopwords.words('english')]
# 应用 remove_stopword 函数到 'temp_list' 列，移除列表中的所有英文停用词
train['temp_list'] = train['temp_list'].apply(lambda x: remove_stopword(x))
# 再次统计移除停用词后的所有单词，并找出最常见的 20 个单词
top = Counter([item for sublist in train['temp_list'] for item in sublist])
# 创建一个新的 DataFrame，包含最常见的 20 个单词及其计数
temp = pd.DataFrame(top.most_common(20))
# 删除 DataFrame 的第一行（通常是停用词）
temp = temp.iloc[1:, :]
# 为 DataFrame 的两列命名
temp.columns = ['Common_words', 'count']
# 为 DataFrame 添加一个渐变颜色背景，使用紫色调
temp.style.background_gradient(cmap='Purples')
# 使用 Plotly 创建一个树状图，展示最常见的单词及其计数
fig = px.treemap(temp, path=['Common_words'], values='count', title='Tree of Most Common Words')
fig.show()
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'Common_Words_treemap.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 为每个 'text' 的值创建一个单词列表（按空格分割）
train['temp_list1'] = train['text'].apply(lambda x: str(x).split())
# 应用 remove_stopword 函数到 'temp_list1' 列，移除列表中的所有英文停用词
train['temp_list1'] = train['temp_list1'].apply(lambda x: remove_stopword(x))
# 使用列表推平操作，将所有子列表中的单词放入一个单一的列表中
top = Counter([item for sublist in train['temp_list1'] for item in sublist])
# 将最常见的 25 个单词及其出现次数转换为 DataFrame
temp = pd.DataFrame(top.most_common(25))
# 删除 DataFrame 的第一行（通常是停用词）
temp = temp.iloc[1:, :]
# 为 DataFrame 的两列命名
temp.columns = ['Common_words', 'count']
# 为 DataFrame 添加一个渐变颜色背景，使用蓝色调
temp.style.background_gradient(cmap='Blues')
# 使用 Plotly 的 bar 函数创建一个水平条形图
fig = px.bar(
    temp,          # 数据来源，即包含单词及其计数的 DataFrame
    x="count",     # 横轴为单词的计数
    y="Common_words",  # 纵轴为最常见的单词
    title='Common Words in Text',  # 图表标题
    orientation='h',  # 设置为水平条形图
    width=700,        # 图表的宽度设置为 700 像素
    height=700,       # 图表的高度设置为 700 像素
    color='Common_words'  # 按单词对条形进行颜色编码
)
# 显示创建的图表
fig.show()
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'Common_Words_barplot.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 筛选出情感类别为正面（positive）的推文
Positive_sent = train[train['sentiment'] == 'positive']
# 筛选出情感类别为负面（negative）的推文
Negative_sent = train[train['sentiment'] == 'negative']
# 筛选出情感类别为中性（neutral）的推文
Neutral_sent = train[train['sentiment'] == 'neutral']
# ******************* 最常见的正面词汇 *******************
# 使用列表推平操作，将所有子列表中的单词放入一个单一的列表中
top = Counter([item for sublist in Positive_sent['temp_list'] for item in sublist])
# 将最常见的 20 个正面单词及其出现次数转换为 DataFrame
temp_positive = pd.DataFrame(top.most_common(20))
# 为 DataFrame 的两列命名
temp_positive.columns = ['Common_words', 'count']
# 为 DataFrame 添加一个渐变颜色背景，使用绿色调
temp_positive.style.background_gradient(cmap='Greens')
# 使用 Plotly 创建一个水平条形图，展示最常见的正面单词及其计数
fig = px.bar(
    temp_positive,  # 数据来源，即常见的正面单词及其计数
    x="count",  # 横轴为单词的计数
    y="Common_words",  # 纵轴为最常见的单词
    title='Most Common Positive Words',  # 图表标题
    orientation='h',  # 设置为水平条形图
    width=700,  # 图表的宽度设置为 700 像素
    height=700,  # 图表的高度设置为 700 像素
    color='Common_words'  # 按单词对条形进行颜色编码
)
fig.show()  # 显示图表
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'Positive_Words_barplot.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# ******************* 最常见的负面词汇 *******************
# 使用列表推平操作，将所有子列表中的单词放入一个单一的列表中
top = Counter([item for sublist in Negative_sent['temp_list'] for item in sublist])
# 将最常见的 20 个负面单词及其出现次数转换为 DataFrame
temp_negative = pd.DataFrame(top.most_common(20))
# 删除 DataFrame 的第一行（通常是停用词）
temp_negative = temp_negative.iloc[1:, :]
# 为 DataFrame 的两列命名
temp_negative.columns = ['Common_words', 'count']
# 为 DataFrame 添加一个渐变颜色背景，使用红色调
temp_negative.style.background_gradient(cmap='Reds')
# 使用 Plotly 创建一个树状图，展示最常见的负面单词及其计数
fig = px.treemap(
    temp_negative,  # 数据来源，即常见的负面单词及其计数
    path=['Common_words'],  # 确定树状图的层次结构
    values='count',  # 数据值，即单词的计数
    title='Tree Of Most Common Negative Words'  # 图表标题
)
fig.show()  # 显示图表
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'Negative_Words_treemap.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# ******************* 最常见的中性词汇 *******************
# 使用列表推平操作，将所有子列表中的单词放入一个单一的列表中
top = Counter([item for sublist in Neutral_sent['temp_list'] for item in sublist])
# 将最常见的 20 个中性单词及其出现次数转换为 DataFrame
temp_neutral = pd.DataFrame(top.most_common(20))
# 删除 DataFrame 的第一行（通常是停用词）
temp_neutral = temp_neutral.loc[1:, :]
# 为 DataFrame 的两列命名
temp_neutral.columns = ['Common_words', 'count']
# 为 DataFrame 添加一个渐变颜色背景，使用红色调
temp_neutral.style.background_gradient(cmap='Reds')
# 使用 Plotly 创建一个水平条形图，展示最常见的中性单词及其计数
fig = px.bar(
    temp_neutral,  # 数据来源，即常见的中性单词及其计数
    x="count",  # 横轴为单词的计数
    y="Common_words",  # 纵轴为最常见的单词
    title='Most Common Neutral Words',  # 图表标题
    orientation='h',  # 设置为水平条形图
    width=700,  # 图表的宽度设置为 700 像素
    height=700,  # 图表的高度设置为 700 像素
    color='Common_words'  # 按单词对条形进行颜色编码
)
fig.show()  # 显示图表
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'Neutral_Words_barplot.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 使用 Plotly 创建一个树状图，展示最常见的中性单词及其计数
fig = px.treemap(
    temp_neutral,  # 数据来源，即常见的中性单词及其计数
    path=['Common_words'],  # 确定树状图的层次结构
    values='count',  # 数据值，即单词的计数
    title='Tree Of Most Common Neutral Words'  # 图表标题
)
fig.show()  # 显示图表
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'Neutral_Words_treemap.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 创建一个包含所有单词的列表，每个单词来自 train['temp_list1'] 列中的每个子列表
raw_text = [word for word_list in train['temp_list1'] for word in word_list]
def words_unique(sentiment, numwords, raw_words):
    # 收集所有不属于当前情感类别的单词
    allother = []
    for item in train[train.sentiment != sentiment]['temp_list1']:
        for word in item:
            allother.append(word)
    allother = list(set(allother))  # 去重
    # 找出在当前情感类别中特有的单词
    specificnonly = [x for x in raw_text if x not in allother]
    # 统计当前情感类别中所有单词的出现次数
    mycounter = Counter()
    for item in train[train.sentiment == sentiment]['temp_list1']:
        for word in item:
            mycounter[word] += 1
    # 只保留那些在当前情感类别中特有的单词
    keep = list(specificnonly)
    for word in list(mycounter):
        if word not in keep:
            del mycounter[word]
    # 将结果转换为数据框，并选择出现次数最多的前 numwords 个单词
    Unique_words = pd.DataFrame(mycounter.most_common(numwords), columns=['words', 'count'])
    return Unique_words
# 调用函数，获取情感类别为 'positive' 的前 20 个独特单词
Unique_Positive = words_unique('positive', 20, raw_text)
print("The top 20 unique words in Positive Tweets are:")
Unique_Positive.style.background_gradient(cmap='Greens')  # 添加渐变背景颜色
# 使用 Plotly 创建树状图，展示积极情感类别中的独特单词
fig = px.treemap(Unique_Positive, path=['words'], values='count', title='Tree Of Unique Positive Words')
fig.show()
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'Unique_Positive_Words_treemap.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 使用 Matplotlib 创建甜甜圈图，展示积极情感类别中的独特单词
import palettable.colorbrewer.qualitative
plt.figure(figsize=(16, 10))
my_circle = plt.Circle((0, 0), 0.7, color='white')  # 创建一个白色的圆形，用于甜甜圈图的中心
plt.pie(Unique_Positive['count'], labels=Unique_Positive.words, colors=Pastel1_7.hex_colors)  # 绘制饼图
p = plt.gcf()
p.gca().add_artist(my_circle)  # 将白色圆形添加到图表中
plt.title('DoNut Plot Of Unique Positive Words')  # 添加标题
plt.show()
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'Unique_Positive_Words_colorbrewer.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 调用函数，获取情感类别为 'negative' 的前 10 个独特单词
Unique_Negative = words_unique('negative', 10, raw_text)
print("The top 10 unique words in Negative Tweets are:")
Unique_Negative.style.background_gradient(cmap='Reds')  # 添加渐变背景颜色
# 使用 Matplotlib 创建甜甜圈图，展示消极情感类别中的独特单词
plt.figure(figsize=(16, 10))
my_circle = plt.Circle((0, 0), 0.7, color='white')  # 创建一个白色的圆形，用于甜甜圈图的中心
plt.rcParams['text.color'] = 'black'  # 设置文本颜色为黑色
plt.pie(Unique_Negative['count'], labels=Unique_Negative.words, colors=Pastel1_7.hex_colors)  # 绘制饼图
p = plt.gcf()
p.gca().add_artist(my_circle)  # 将白色圆形添加到图表中
plt.title('DoNut Plot Of Unique Negative Words')  # 添加标题
plt.show()
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'Unique_Negative_Words_colorbrewer.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 调用函数，获取情感类别为 'neutral' 的前 10 个独特单词
Unique_Neutral = words_unique('neutral', 10, raw_text)
print("The top 10 unique words in Neutral Tweets are:")
Unique_Neutral.style.background_gradient(cmap='Oranges')  # 添加渐变背景颜色
# 使用 Matplotlib 创建甜甜圈图，展示中性情感类别中的独特单词
plt.figure(figsize=(16, 10))
my_circle = plt.Circle((0, 0), 0.7, color='white')  # 创建一个白色的圆形，用于甜甜圈图的中心
plt.pie(Unique_Neutral['count'], labels=Unique_Neutral.words, colors=Pastel1_7.hex_colors)  # 绘制饼图
p = plt.gcf()
p.gca().add_artist(my_circle)  # 将白色圆形添加到图表中
plt.title('DoNut Plot Of Unique Neutral Words')  # 添加标题
plt.show()
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'Unique_Neutral_Words_colorbrewer.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
def plot_wordcloud(text, mask=None, max_words=200, max_font_size=100, figure_size=(24.0, 16.0), color='white',
                   title=None, title_size=40, image_color=False):
    # 导入默认的停用词
    stopwords = set(STOPWORDS)
    # 自定义额外的停用词（如 'u' 和 'im'）
    more_stopwords = {'u', "im"}
    stopwords = stopwords.union(more_stopwords)
    # 创建 WordCloud 对象
    wordcloud = WordCloud(
        background_color=color,  # 设置背景颜色
        stopwords=stopwords,    # 使用自定义的停用词列表
        max_words=max_words,    # 控制词云中显示的最大单词数
        max_font_size=max_font_size,  # 控制词云中字体的最大大小
        random_state=42,        # 设置随机种子以确保结果可复现
        width=400,              # 设置词云的宽度
        height=200,             # 设置词云的高度
    )
    # 使用生成所有单词的文字（字符串）来生成词云
    wordcloud.generate(str(text))
    # 设置图表大小
    plt.figure(figsize=figure_size)
    plt.title(title, fontdict={'size': title_size, 'color': 'black', 'verticalalignment': 'bottom'})  # 添加标题
    plt.axis('off')  # 隐藏坐标轴
    plt.tight_layout()  # 调整布局
# 调用 plot_wordcloud 函数，生成中性情感类别的词云
plot_wordcloud(Neutral_sent.text, color='white', max_font_size=100, title_size=30, title="WordCloud of Neutral Tweets")
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'Neutral_Words_wordcloud.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 调用 plot_wordcloud 函数，生成积极情感类别的词云
plot_wordcloud(Positive_sent.text, title="Word Cloud Of Positive tweets", title_size=30)
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'Positive_Words_wordcloud.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 调用 plot_wordcloud 函数，生成消极情感类别的词云
plot_wordcloud(Negative_sent.text, title="Word Cloud of Negative Tweets", color='white', title_size=30)
# 保存图片到本地，你可以修改保存路径和文件名
save_path = 'Negative_Words_wordcloud.png'
plt.savefig(save_path, dpi=300, bbox_inches='tight')
# 读取训练数据、测试数据和提交示例数据
df_train = pd.read_csv(r"data_models/train.csv")  # 读取训练数据
df_test = pd.read_csv(r"data_models/test.csv")     # 读取测试数据
df_submission = pd.read_csv(r"data_models/sample_submission.csv")  # 读取提交示例数据
# 添加一列 'Num_words_text'，表示每条文本的单词数量
df_train['Num_words_text'] = df_train['text'].apply(lambda x: len(str(x).split()))
# 筛选掉单词数量少于3的文本，以减少噪声数据
df_train = df_train[df_train['Num_words_text'] >= 3]
# 定义保存模型的函数
def save_model(output_dir, nlp, new_model_name):
    # 动态获取桌面路径
    output_dir = os.path.expanduser(f"~/Desktop/{output_dir}")
    if output_dir is not None:
        # 如果路径不存在，则创建路径
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)
        # 设置模型的元数据名称
        nlp.meta["name"] = new_model_name
        # 保存模型到指定路径
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)
# 定义训练模型的函数
def train(train_data, output_dir, n_iter=3, model=None):
    # 如果提供了现有模型路径，则加载该模型
    if model is not None:
        nlp = spacy.load(model)  # 加载现有的 spaCy 模型
        print("Loaded model '%s'" % model)  # 打印加载的模型名称
    else:
        # 如果没有提供模型路径，则创建一个新的空白英文模型
        nlp = spacy.blank("en")  # 创建空白的英文模型
        print("Created blank 'en' model")  # 打印创建的模型信息

    # 检查模型中是否已经存在 NER 组件
    if "ner" not in nlp.pipe_names:
        # 如果不存在，则添加 NER 组件到模型管道中
        nlp.add_pipe("ner", last=True)  # 添加 NER 组件到管道末尾
    # 获取模型中的 NER 组件
    ner = nlp.get_pipe("ner")  # 获取 NER 组件实例

    # 遍历训练数据，添加实体标签到 NER 组件
    for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])  # 添加实体标签到 NER 组件

    # 根据是否提供现有模型，初始化训练器
    if model is None:
        optimizer = nlp.begin_training()  # 开始新的训练
    else:
        optimizer = nlp.resume_training()  # 继续之前的训练

    # 训练循环
    for itn in range(n_iter):
        losses = {}  # 初始化损失字典
        random.shuffle(train_data)  # 打乱训练数据顺序
        # 将训练数据转换为 Example 对象
        examples = []
        for text, annotations in train_data:
            doc = nlp.make_doc(text)  # 创建文档对象
            example = Example.from_dict(doc, annotations)  # 创建 Example 对象
            examples.append(example)  # 将 Example 对象添加到列表中
        # 使用 spaCy 的 minibatch 函数进行批量训练
        batches = spacy.util.minibatch(examples, size=spacy.util.compounding(4.0, 32.0, 1.001))
        for batch in batches:
            nlp.update(batch, sgd=optimizer, drop=0.35, losses=losses)  # 更新模型
        print("Losses at iteration", itn, ":", losses)  # 打印当前迭代的损失值

    # 保存训练好的模型
    if output_dir is not None:
        output_dir = Path(output_dir)  # 转换为 Path 对象
        if not output_dir.exists():
            output_dir.mkdir(parents=True, exist_ok=True)  # 如果路径不存在，则创建路径
        nlp.to_disk(output_dir)  # 将模型保存到指定路径
        print("Saved model to", output_dir)  # 打印保存模型的路径
# 定义获取模型输出路径的函数
def get_model_out_path(sentiment):
    model_out_path = None
    if sentiment == 'positive':
        # 返回正面情感模型的输出路径
        model_out_path = r"data_models/model_pos"
    elif sentiment == 'negative':
        # 返回负面情感模型的输出路径
        model_out_path = r"data_models/model_neg"
    return model_out_path
# 定义获取训练数据的函数
def get_training_data(sentiment):
    # 初始化一个空列表，用于存储训练数据
    train_data = []
    # 遍历训练数据集的每一行
    for index, row in df_train.iterrows():
        # 检查当前行的情感类别是否与指定的情感类别匹配
        if row.sentiment == sentiment:
            # 获取选定文本
            selected_text = row.selected_text
            # 获取原始文本
            text = row.text
            # 找到选定文本在原始文本中的起始位置
            start = text.find(selected_text)
            # 计算选定文本在原始文本中的结束位置
            end = start + len(selected_text)
            # 将原始文本和实体信息添加到训练数据列表中
            train_data.append((text, {"entities": [[start, end, 'selected_text']]}))
    # 返回训练数据列表
    return train_data
# 设置情感类别为 'positive'，获取训练数据和模型路径，并开始训练正面情感模型
sentiment = 'positive'
train_data = get_training_data(sentiment)  # 获取正面情感的训练数据
model_path = get_model_out_path(sentiment)  # 获取正面情感模型的输出路径
# 调用 train 函数，训练正面情感模型，设置迭代次数为 3，不使用现有模型
train(train_data, model_path, n_iter=3, model=None)
from pathlib import Path  # 确保导入 Path
# 设置情感类别为 'negative'，获取训练数据和模型路径，并开始训练负面情感模型
sentiment = 'negative'
train_data = get_training_data(sentiment)  # 获取负面情感的训练数据
model_path = get_model_out_path(sentiment)  # 获取负面情感模型的输出路径
# 调用 train 函数，训练负面情感模型，设置迭代次数为 3，不使用现有模型
train(train_data, model_path, n_iter=3, model=None)
# 定义预测实体的函数
def predict_entities(text, model):
    # 使用模型对输入文本进行预测
    doc = model(text)
    ent_array = []
    # 遍历预测出的实体
    for ent in doc.ents:
        # 计算实体在原文本中的起始和结束位置
        start = text.find(ent.text)
        end = start + len(ent.text)
        # 将实体的位置和标签添加到实体数组
        new_int = [start, end, ent.label_]
        if new_int not in ent_array:
            ent_array.append([start, end, ent.label_])
    # 如果预测到实体，则返回选定文本；否则返回原始文本
    selected_text = text[ent_array[0][0]: ent_array[0][1]] if len(ent_array) > 0 else text
    return selected_text
# 初始化一个空列表，用于存储选定的文本结果
selected_texts = []
# 设置模型的基路径
MODELS_BASE_PATH = r"models"
# 检查模型路径是否有效
if MODELS_BASE_PATH is not None:
    print("Loading Models from ", MODELS_BASE_PATH)  # 打印模型加载路径
    # 加载正面情感模型
    model_pos = spacy.load(MODELS_BASE_PATH + '/model_pos')
    # 加载负面情感模型
    model_neg = spacy.load(MODELS_BASE_PATH + '/model_neg')
    # 遍历测试数据集
    for index, row in df_test.iterrows():
        text = row.text  # 获取当前行的文本
        output_str = ""  # 初始化输出字符串
        # 如果情感类别为中性，或者文本单词数量少于等于2，直接使用原始文本
        if row.sentiment == 'neutral' or len(text.split()) <= 2:
            selected_texts.append(text)  # 将原始文本添加到选定文本列表
        # 如果情感类别为正面
        elif row.sentiment == 'positive':
            # 调用预测函数，使用正面情感模型预测实体，并将结果添加到选定文本列表
            selected_texts.append(predict_entities(text, model_pos))
        # 如果情感类别为负面
        else:
            # 调用预测函数，使用负面情感模型预测实体，并将结果添加到选定文本列表
            selected_texts.append(predict_entities(text, model_neg))
# 将选定的文本结果添加到测试数据集中
df_test['selected_text'] = selected_texts
# 将测试数据集的选定文本同步到提交数据集中
df_submission['selected_text'] = df_test['selected_text']
# 保存提交文件到本地
df_submission.to_csv("submission.csv", index=False)