实现思路
1.爬取股票评论和对应数据
2.对股票评论使用jieba库进行分词
3.对分词后的文本进行特征化提取
4.使用已经分词的积极和消极文本对svm,knn,逻辑回归,决策树,贝叶斯,随机森林,adaboost等方法进行分词结果评测
5.得到svm结果最优,使用svm对已处理好样本进行训练得到模型
6.使用模型对评论进行划分,得到评论是否积极或者消极
7.根据对所有评论划分结果使用 bi=ln(1+pos)/(1+neg)指数来表示不同评价对每天的影响。
8.将bi的指数放大窗口10倍,来评价情绪对股票的影响
import os
from time import time
import pandas as pd
import numpy as np
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.utils.extmath import density
from sklearn import svm
from sklearn import naive_bayes
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.utils import shuffle
np.random.seed(42)
comment_file = './data/stock_comments_seg.csv'
data_path = './data'
pos_corpus = 'positive.txt'
neg_corpus = 'negative.txt'
K_Best_Features = 3000
def load_dataset():
pos_file = os.path.join(data_path, pos_corpus)
neg_file = os.path.join(data_path, neg_corpus)
pos_sents = []
with open(pos_file, 'r', encoding=

最低0.47元/天 解锁文章
813

被折叠的 条评论
为什么被折叠?



