机器学习——情感分析

最新推荐文章于 2025-05-13 14:14:26 发布

原创

最新推荐文章于 2025-05-13 14:14:26 发布 · 4.8k 阅读

22 ·

CC 4.0 BY-SA版权

《python machine learning》 chapter 8 Applying Machine Learning to Sentiment Analysis

git源码：https://github.com/xuman-Amy/sentimental-analysis

项目说明：根据Internet Movie Database (IMDb)上获取的50000个影评，预测影评是积极的还是消极的。

（1）清洗准备文本数据

（2）从数据集中构建特征向量

（3）训练模型区分影评的positive 和 negative

（4）out-of-core处理大数据集

（5）从文本分类中推断主题

【1、准备数据】

数据说明：影评集为50000的大数据集，每条影评被标记为positive 和 negative，positive表示电影获得六星及以上的好评；negative表示六星以下。

【获取数据】

import pandas as pd
df = pd.read_csv("G:\Machine Learning\python machine learning\python machine learning code\code\ch08\movie_data.csv")
df.head()

【bag-of-words】

利用bag-of-words将文本数据转换为数值型特征向量。

bag-of-words的基本思想：

（1）创建一个具有唯一token的单词表，例如来自整个文档的单词

（2）在每个文档中创建一个特征向量——特征向量包含每个单词在特定文档中出现的频率。

【sklearn 实现bag-of-words】

将单词转换为特征向量

利用

#bag-of-words
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
doc = np.array([ 
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(doc)
print(count.vocabulary_)

将CountVectorizer将每个单词存储在字典中，与之相映射的是字典的数字索引。