文本分类(Text Classification)是自然语言处理中的一个重要应用技术,根据文档的内容或主题,自动识别文档所属的预先定义的类别标签。文本分类是很多应用场景的基础,比如垃圾邮件识别,舆情分析,情感识别,新闻自动分类,智能客服机器人的知识库分类等等。本文用标注好的搜狗新闻语料,基于scikit-learn机器学习Python库,将文本分类的完整过程实现一遍。本文代码放在了GitHub上。本文博客地址。
1. 语料预处理
首先加载所有需要用到的Python库
import os
import shutil
import re
import jieba
import numpy as np
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import classification_report, confusion_matrix
定义搜狗新闻文本标签的名称,类似C000008
这样的标签是语料的子目录,在网上搜到标签对应的新闻类别,为了便于理解,定义了这个映射词典,并保留原有编号信息。在网上搜索下载搜狗分类新闻.20061127.zip
语料并解压至CN_Corpus
目录下,解压之后目录结构为:
CN_Corpus
└─SogouC.reduced
└─Reduced
├─C000008
├─C000010
├─C000013
├─C000014
├─C000016
├─C000020
├─C000022
├─C000023
└─C000024
category_labels = {
'C000008': '_08_Finance',
'C000010': '_10_IT',
'C000013': '_13_Health',
'C000014': '_14_Sports',
'C000016': '_16_Travel',
'C000020': '_20_Education',
'C000022': '_22_Recruit',
'C000023': '_23_Culture',
'C000024': '_24_Military'
}
下面进行语料的切分,将每个类别的前80%作为训练语料,后20%作为测试语料。切分完之后的语料目录如下:
data
├─test
│ ├─_08_Finance
│ ├─_10_IT
│ ├─_13_Health
│ ├─_14_Sports
│ ├─_16_Travel
│ ├─_20_Education
│ ├─_22_Recruit
│ ├─_23_Culture
│ └─_24_Military
└─train
├─_08_Finance
├─_10_IT
├─_13_Health
├─_14_Sports
├─_16_Travel
├─_20_Education
├─_22_Recruit
├─_23_Culture
└─_24_Military
def split_corpus():
# original data directory
original_dataset_dir = './CN_Corpus/SogouC.reduced/Reduced'
base_dir = 'data/'
if (os.path.exists(base_dir)):
print('`data` seems already exist.')
return
# make new folders
os.mkdir(base_dir)
train_dir = os.path.join(base_dir, 'train')
os.mkdir(train_dir)
test_dir = os.path.join(base_dir, 'test')
os.mkdir(test_dir)
# split corpus
for cate in os.listdir(original_dataset_dir):
cate_dir = os.path.join(original_dataset_dir, cate)
file_list = os.listdir(cate_dir)
print("cate: {}, len: {}".format(cate, len(file_list)))
# train data
fnames = file_list[:1500]
dst_dir = os.path.join(train_dir, category_labels[cate