Newspaper3k：Python新闻文章抓取与内容提取库详解-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00320/article/details/148377948

Newspaper3k：Python新闻文章抓取与内容提取库详解

newspaper News, full-text, and article metadata extraction in Python 3. Advanced docs: 项目地址: https://gitcode.com/gh_mirrors/ne/newspaper

概述

Newspaper3k是一个强大的Python3库，专门用于从新闻网站抓取和提取文章内容。它受到著名的requests库简洁性的启发，并利用lxml库实现高性能解析。这个库能够自动识别新闻文章的结构，提取标题、正文、作者、发布日期等关键信息，还能进行自然语言处理(NLP)分析。

核心功能

Newspaper3k提供了一系列强大的功能，使新闻内容提取变得简单高效：

文章内容提取：自动识别并提取文章正文，去除广告、导航等无关内容
元数据提取：获取作者、发布日期、图片、视频等信息
自然语言处理：关键词提取、自动摘要生成
多语言支持：支持包括中文在内的10多种语言
整站爬取：可以构建整个新闻站点的爬虫

快速入门

安装

pip3 install newspaper3k

注意：Python3用户必须安装newspaper3k，而不是newspaper。

基本使用示例

from newspaper import Article

# 创建文章对象
url = 'http://example.com/news-article.html'
article = Article(url)

# 下载并解析文章
article.download()
article.parse()

# 获取文章信息
print("标题:", article.title)
print("作者:", article.authors)
print("发布日期:", article.publish_date)
print("正文:", article.text[:200])  # 打印前200个字符
print("顶部图片:", article.top_image)

自然语言处理功能

# 执行NLP分析
article.nlp()

print("关键词:", article.keywords)
print("摘要:", article.summary)

高级功能

整站爬取

Newspaper3k可以轻松爬取整个新闻网站：

import newspaper

# 构建新闻源
cnn_paper = newspaper.build('http://cnn.com')

# 打印所有文章链接
for article in cnn_paper.articles:
    print(article.url)

# 打印所有分类链接
for category in cnn_paper.category_urls():
    print(category)

多语言支持

Newspaper3k支持多种语言，包括中文：

# 中文文章处理
url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_city_news.shtml'
article = Article(url, language='zh')  # 指定中文

article.download()
article.parse()

print(article.title)  # 打印中文标题
print(article.text[:150])  # 打印前150个中文字符

安装注意事项

Ubuntu/Debian系统

sudo apt-get install python3-pip python-dev libxml2-dev libxslt-dev libjpeg-dev zlib1g-dev libpng-dev
pip3 install newspaper3k
curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3

macOS系统

brew install libxml2 libxslt libtiff libjpeg webp little-cms2
pip3 install newspaper3k
curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3