sklearn 练习垃圾邮件分类时,运行以下代码
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
news=fetch_20newsgroups(subset='all')#all 表示全部数据集,可以导入train,test
print(len(news.data))#共多少篇
print(news.target_names)#标签名,共20类
print(len(news.target))
出现Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)
解决方案:
从python安装目录中找到下列路径,修改
盘符:\Program Files (x86)\python\Lib\site-packages\sklearn\datasets路径下的twenty_newsgroups.py文件
将以下内容替换原来的文件即可
"""Caching loader for the 20 newsgroups text classification dataset
The description of the dataset is available on the official website at:
http://people.csail.mit.edu/jrennie/20Newsgroups/
Quoting the introduction:
The 20 Newsgroups data set is a collection of approximately 20,000
newsgroup documents, partitioned (nearly) evenly across 20 different
newsgroups. To the best of my knowledge, it was originally collected
by Ken Lang, probably for his Newsweeder: Learning to filter netnews
paper, though he does not explicitly mention this collection. The 20
newsgroups collection has become a popular data set for experiments
in text applications of machine learning techniques, such as text
classification and text clustering.
This dataset loader will download the recommended "by date" variant of the
dataset and which features a point in time split between the train and
test sets. The compressed dataset size is around 14 Mb compressed. Once
uncompressed the train set is 52 MB and the test set is 34 MB.
The data is downloaded, extracted and cached in the '~/scikit_learn_data'
folder.
The `fetch_20newsgroups` function will not vectorize the data into numpy
arrays but the dataset lists the filenames of the posts and their categories
as target labels.
The `fetch_20newsgroups_vectorized` function will in addition do a simple
tf-idf vectorization step.
"""
# Copyright (c) 2011 Olivier Grisel <olivier.grisel@ensta.org>
# License: BSD 3 clause
import os
import logging
import tarfile
import pickle
import shutil
import re
import codecs
import numpy as np
import scipy.sparse as sp
from .base import get_data_home
from .base import Bunch
from .base import load_files
from .base import _pkl_filepath
from ..utils import check_random_state
from ..feature_extraction.text import CountVectorizer
from ..preprocessing import normalize
from ..externals import joblib, six
if six.PY3:
from urllib.request import urlopen
else:
from urllib2 import urlopen
logger = logging.getLogger(__name__)
URL