sklearn 练习垃圾邮件分类时，无法下载数据集问题解决

最新推荐文章于 2025-08-16 21:00:00 发布

myhome908

最新推荐文章于 2025-08-16 21:00:00 发布

阅读量3.1k

点赞数

CC 4.0 BY-SA版权

分类专栏：机器学习文章标签： python 机器学习

本文链接：https://blog.youkuaiyun.com/myhome908/article/details/88560418

sklearn 练习垃圾邮件分类时，运行以下代码

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

news=fetch_20newsgroups(subset='all')#all 表示全部数据集，可以导入train,test
print(len(news.data))#共多少篇
print(news.target_names)#标签名，共20类
print(len(news.target))

出现Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)

解决方案：

从python安装目录中找到下列路径，修改

盘符:\Program Files (x86)\python\Lib\site-packages\sklearn\datasets路径下的twenty_newsgroups.py文件

将以下内容替换原来的文件即可


"""Caching loader for the 20 newsgroups text classification dataset
 
 
The description of the dataset is available on the official website at:
 
   http://people.csail.mit.edu/jrennie/20Newsgroups/
 
Quoting the introduction:
 
   The 20 Newsgroups data set is a collection of approximately 20,000
   newsgroup documents, partitioned (nearly) evenly across 20 different
   newsgroups. To the best of my knowledge, it was originally collected
   by Ken Lang, probably for his Newsweeder: Learning to filter netnews
   paper, though he does not explicitly mention this collection. The 20
   newsgroups collection has become a popular data set for experiments
   in text applications of machine learning techniques, such as text
   classification and text clustering.
 
This dataset loader will download the recommended "by date" variant of the
dataset and which features a point in time split between the train and
test sets. The compressed dataset size is around 14 Mb compressed. Once
uncompressed the train set is 52 MB and the test set is 34 MB.
 
The data is downloaded, extracted and cached in the '~/scikit_learn_data'
folder.
 
The `fetch_20newsgroups` function will not vectorize the data into numpy
arrays but the dataset lists the filenames of the posts and their categories
as target labels.
 
The `fetch_20newsgroups_vectorized` function will in addition do a simple
tf-idf vectorization step.
 
"""
# Copyright (c) 2011 Olivier Grisel <olivier.grisel@ensta.org>
# License: BSD 3 clause
 
import os
import logging
import tarfile
import pickle
import shutil
import re
import codecs
 
import numpy as np
import scipy.sparse as sp
 
from .base import get_data_home
from .base import Bunch
from .base import load_files
from .base import _pkl_filepath
from ..utils import check_random_state
from ..feature_extraction.text import CountVectorizer
from ..preprocessing import normalize
from ..externals import joblib, six
 
if six.PY3:
    from urllib.request import urlopen
else:
    from urllib2 import urlopen
 
 
logger = logging.getLogger(__name__)
 
 
URL