新闻类爬虫库：Newspaper

最新推荐文章于 2024-11-19 15:49:39 发布

软件质量保障

最新推荐文章于 2024-11-19 15:49:39 发布

阅读量450

点赞数

CC 4.0 BY-SA版权

文章标签：爬虫 python 开发语言

本文链接：https://blog.youkuaiyun.com/csd11311/article/details/125972207

newspaper是一个Python库，专用于新闻内容的抓取和分析。它简化了爬虫流程，使得新手也能快速上手。通过实例展示了如何获取新闻URL、提取分类、下载解析文章、利用nlp进行文本分析，以及实现多任务处理。此外，还介绍了库的一些额外功能，如获取热门搜索和新闻源。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

newspaper库是一个主要用来提取新闻内容及分析的Python爬虫框架。此库适合抓取新闻网页。操作简单易学，即使对完全没了解过爬虫的初学者也非常的友好，简单学习就能轻易上手，除此之外，使用过程你不需要考虑HTTP Header、IP代理，也不需要考虑网页解析，网页源代码架构等问题。

我们以https://www.wired.com/为例，进行演示。

获取新闻

import newspaper
from newspaper import Article
from newspaper import fulltext
url = 'https://www.wired.com/'
paper = newspaper.build(url, language="en", memoize_articles=False)

输出新闻对象

<newspaper.source.Source object at 0x7fe82c98c1d0>

默认情况下，newspaper 缓存所有以前提取的文章，并删除它已经提取的任何文章，使用 memoize_articles 参数选择退出此功能。

提取新闻URL

提取站点页面的新闻URL

import newspaper
from newspaper import Article
from newspaper import fulltext
url = 'https://www.wired.com/'
paper = newspaper.build(url, language="en", memoize_articles=False)
for article in paper.articles:
    print(article.url)

输出内容

提取新闻分类

支持提取站点下的新闻分类

for category in paper.category_urls():
    print(category)

提取新闻内容：Article

文章对象是新闻文章的抽象。例如，新闻Source将是Wired，而新闻Article是其站点下的Wired文章，这样就可以提取出新闻的标题、作者、插图、内容等。

article = Article('https://www.wired.com/story/preterm-babies-lonely-terror-of-a-pandemic-nicu/')
article.download()
article.parse()
print("title=", article.title)
print("author=", article.authors)
print("publish_date=", article.publish_date)
print("top_iamge=", article.top_image)
print("movies=", article.movies)
print("text=", article.text)
print("summary=", article.summary)

下载解析

我们选取其中一篇文章为例，如下所示：

first_url = paper.articles[0]
first_url.download()
first_url.parse()
print(first_url.title)
print(first_url.publish_date)
print(first_url.authors)
print(first_url.top_image)
print(first_url.summary)
print(first_url.movies)
print(first_url.text)

解析html

通过 requests 库获取文章 html 信息，用 newspaper 进行解析，如下所示：

html = requests.get('https://www.wired.com/story/preterm-babies-lonely-terror-of-a-pandemic-nicu/').text
print('获取的原信息-->', html)
text = fulltext(html, language='en')
print('解析后的信息', text)

结合nlp

通过使用nlp方法，可以从文本中提取自然语言属性。

first_article = paper.articles[1]
first_article.download()
first_article.parse()
first_article.nlp()
print(first_article.summary)
print(first_article.keywords)

多任务

当我们需要从多个渠道获取新闻信息时可以采用多任务的方式，如下所示：

import newspaper
from newspaper import news_pool
lr_paper = newspaper.build('https://lifehacker.com/', language="en")
wd_paper = newspaper.build('https://www.wired.com/', language="en")
ct_paper = newspaper.build('https://www.cnet.com/news/', language="en")
papers = [lr_paper, wd_paper, ct_paper]
# 线程数为 3 * 2 = 6
news_pool.set(papers, threads_per_source=2)
news_pool.join()
print(lr_paper.articles[0].html)

其他

hot()返回Google上最热门的术语列表。

popular_urls()返回热门新闻来源网址的列表。

newspaper.hot()
newspaper.popular_urls()

新闻类爬虫库：Newspaper

获取新闻

提取新闻URL

提取新闻分类

提取新闻内容：Article

下载解析

解析html

结合nlp

多任务

其他