AutoGluon多模态预测器文本处理快速入门指南-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00751/article/details/148439969

AutoGluon多模态预测器文本处理快速入门指南

autogluon AutoGluon: AutoML for Image, Text, Time Series, and Tabular Data 项目地址: https://gitcode.com/gh_mirrors/au/autogluon

AutoGluon是一个强大的自动化机器学习工具包，其中的MultiModalPredictor组件能够处理包含文本、图像、数值和类别特征的多模态数据。本文将重点介绍如何使用MultiModalPredictor解决纯文本相关的自然语言处理(NLP)任务。

环境准备

在开始之前，我们需要安装必要的库并设置环境：

!pip install autogluon.multimodal

import numpy as np
import warnings
import matplotlib.pyplot as plt

warnings.filterwarnings('ignore')
np.random.seed(123)

情感分析任务

情感分析是NLP中的经典任务，旨在判断文本表达的情感倾向。我们使用斯坦福情感树库(SST)数据集进行演示。

数据加载与探索

from autogluon.core.utils.loaders import load_pd
train_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/train.parquet')
test_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/dev.parquet')
subsample_size = 1000  # 为快速演示抽样数据
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head(10)

数据以表格形式存储，其中包含文本特征和标签列(0表示负面情感，1表示正面情感)。

模型训练

from autogluon.multimodal import MultiModalPredictor
import uuid

model_path = f"./tmp/{uuid.uuid4().hex}-automm_sst"
predictor = MultiModalPredictor(label='label', eval_metric='acc', path=model_path)
predictor.fit(train_data, time_limit=180)

关键参数说明：

label: 指定标签列名
eval_metric: 评估指标(此处为准确率)
path: 模型保存路径
time_limit: 训练时间限制(秒)

模型评估与预测

# 评估模型性能
test_score = predictor.evaluate(test_data)
print(test_score)

# 支持多种评估指标
test_score = predictor.evaluate(test_data, metrics=['acc', 'f1'])
print(test_score)

# 单条预测示例
sentence1 = "it's a charming and often affecting journey."
sentence2 = "It's slow, very, very, very slow."
predictions = predictor.predict({'sentence': [sentence1, sentence2]})
print('"Sentence":', sentence1, '"Predicted Sentiment":', predictions[0])
print('"Sentence":', sentence2, '"Predicted Sentiment":', predictions[1])

# 获取预测概率
probs = predictor.predict_proba({'sentence': [sentence1, sentence2]})
print('"Sentence":', sentence1, '"Predicted Class-Probabilities":', probs[0])
print('"Sentence":', sentence2, '"Predicted Class-Probabilities":', probs[1])

模型保存与加载

# 加载已保存模型
loaded_predictor = MultiModalPredictor.load(model_path)
loaded_predictor.predict_proba({'sentence': [sentence1, sentence2]})

# 保存到新路径
new_model_path = f"./tmp/{uuid.uuid4().hex}-automm_sst"
loaded_predictor.save(new_model_path)

特征提取与可视化

MultiModalPredictor可以提取文本的嵌入表示：

embeddings = predictor.extract_embedding(test_data)
print(embeddings.shape)

# 使用TSNE可视化嵌入
from sklearn.manifold import TSNE
X_embedded = TSNE(n_components=2, random_state=123).fit_transform(embeddings)
for val, color in [(0, 'red'), (1, 'blue')]:
    idx = (test_data['label'].to_numpy() == val).nonzero()
    plt.scatter(X_embedded[idx, 0], X_embedded[idx, 1], c=color, label=f'label={val}')
plt.legend(loc='best')

句子相似度任务

句子相似度任务是评估两个句子语义相似程度的回归任务。

数据准备

sts_train_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sts/train.parquet')[['sentence1', 'sentence2', 'score']]
sts_test_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sts/dev.parquet')[['sentence1', 'sentence2', 'score']]
sts_train_data.head(10)

print('Min score=', min(sts_train_data['score']), ', Max score=', max(sts_train_data['score']))

模型训练与评估

sts_model_path = f"./tmp/{uuid.uuid4().hex}-automm_sts"
predictor_sts = MultiModalPredictor(label='score', path=sts_model_path)
predictor_sts.fit(sts_train_data, time_limit=60)

# 多指标评估
test_score = predictor_sts.evaluate(sts_test_data, metrics=['rmse', 'pearsonr', 'spearmanr'])
print('RMSE = {:.2f}'.format(test_score['rmse']))
print('PEARSONR = {:.4f}'.format(test_score['pearsonr']))
print('SPEARMANR = {:.4f}'.format(test_score['spearmanr']))

相似度预测示例

sentences = ['The child is riding a horse.',
             'The young boy is riding a horse.',
             'The young man is riding a horse.',
             'The young man is riding a bicycle.']

score1 = predictor_sts.predict({'sentence1': [sentences[0]], 'sentence2': [sentences[1]]}, as_pandas=False)
score2 = predictor_sts.predict({'sentence1': [sentences[0]], 'sentence2': [sentences[2]]}, as_pandas=False)
score3 = predictor_sts.predict({'sentence1': [sentences[0]], 'sentence2': [sentences[3]]}, as_pandas=False)
print(score1, score2, score3)