sklearn 练习垃圾邮件分类时,无法下载数据集问题解决

sklearn 练习垃圾邮件分类时,运行以下代码

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

news=fetch_20newsgroups(subset='all')#all 表示全部数据集,可以导入train,test
print(len(news.data))#共多少篇
print(news.target_names)#标签名,共20类
print(len(news.target))

出现Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)

 

解决方案:

从python安装目录中找到下列路径,修改

盘符:\Program Files (x86)\python\Lib\site-packages\sklearn\datasets路径下的twenty_newsgroups.py文件

将以下内容替换原来的文件即可


"""Caching loader for the 20 newsgroups text classification dataset
 
 
The description of the dataset is available on the official website at:
 
   http://people.csail.mit.edu/jrennie/20Newsgroups/
 
Quoting the introduction:
 
   The 20 Newsgroups data set is a collection of approximately 20,000
   newsgroup documents, partitioned (nearly) evenly across 20 different
   newsgroups. To the best of my knowledge, it was originally collected
   by Ken Lang, probably for his Newsweeder: Learning to filter netnews
   paper, though he does not explicitly mention this collection. The 20
   newsgroups collection has become a popular data set for experiments
   in text applications of machine learning techniques, such as text
   classification and text clustering.
 
This dataset loader will download the recommended "by date" variant of the
dataset and which features a point in time split between the train and
test sets. The compressed dataset size is around 14 Mb compressed. Once
uncompressed the train set is 52 MB and the test set is 34 MB.
 
The data is downloaded, extracted and cached in the '~/scikit_learn_data'
folder.
 
The `fetch_20newsgroups` function will not vectorize the data into numpy
arrays but the dataset lists the filenames of the posts and their categories
as target labels.
 
The `fetch_20newsgroups_vectorized` function will in addition do a simple
tf-idf vectorization step.
 
"""
# Copyright (c) 2011 Olivier Grisel <olivier.grisel@ensta.org>
# License: BSD 3 clause
 
import os
import logging
import tarfile
import pickle
import shutil
import re
import codecs
 
import numpy as np
import scipy.sparse as sp
 
from .base import get_data_home
from .base import Bunch
from .base import load_files
from .base import _pkl_filepath
from ..utils import check_random_state
from ..feature_extraction.text import CountVectorizer
from ..preprocessing import normalize
from ..externals import joblib, six
 
if six.PY3:
    from urllib.request import urlopen
else:
    from urllib2 import urlopen
 
 
logger = logging.getLogger(__name__)
 
 
URL 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值