Python第三方库 | newspaper教程 | newspaper3k实战教程 | 使用Python newspaper库进行新闻文章抓取和处理，一文通！

H-大叔

已于 2024-11-19 15:52:35 修改

阅读量1.7k

点赞数 14

分类专栏： python爬虫宝典 Python 第三方库文章标签： python newspaper库 newspaper教程 newspaper3k

于 2024-11-19 15:49:39 首次发布

本文链接：https://blog.youkuaiyun.com/HRG520JN/article/details/143886618

版权

python爬虫宝典同时被 2 个专栏收录

7 篇文章

订阅专栏

Python 第三方库

4 篇文章

订阅专栏

一、引言

newspaper是一个用Python编写的流行开源库，用于从网站上抓取新闻文章。它提供了一种简单而有效的方法来提取新闻内容、图片、作者信息等，并且支持多语言。

主要功能：

文章提取：可以从给定的URL中提取完整的新闻文章文本，包括标题、正文、作者、发布日期等。
多语言支持：虽然主要针对英语，但它也支持其他多种语言的文章提取。
图片下载：可以下载文章中的图片到本地文件系统。
视频下载：支持下载嵌入在文章中的视频链接。
关键词和摘要生成：能够自动为文章生成关键词列表和摘要。
用户代理轮换：内置了用户代理轮换功能，有助于避免因频繁请求而被目标网站封禁。
并行处理：支持多线程或多进程并行处理，加快数据抓取速度。

二、`newspaper`库的基本用法

1. 安装`newspaper`库

安装newspaper非常简单，可以通过pip命令直接安装：

pip install newspaper3k

注意：pip install newspaper是python2.x版本的，如果使用的是python3.x一定要安装newspaper3k

在实际运行过程中，还需要注意是否存在lxml、lxml-html-clean 2个库，如果没有需要额外加上。

pip install lxml
pip install lxml-html-clean

2. 基本文章提取

以下是一个简单的示例，展示如何使用newspaper库来提取一个网页上的新闻文章：

from newspaper import Article

url = 'https://example.com/some-news-article'
article = Article(url)

# 下载并解析文章
article.download()
article.parse()

# 输出文章标题和作者
print('Title:', article.title)
print('Authors:', article.authors)

# 提取关键词和摘要
article.nlp()
print('Keywords:', article.keywords)
print('Summary:', article.summary)

三、高级用法

对于更复杂的场景，比如批量处理多个URL或定制化爬虫行为，newspaper提供了更多的配置选项和API方法，例如设置不同的用户代理、调整下载超时时间、自定义解析规则等。

1. 批量处理多个URL

使用列表存储URL

首先，您可以将需要处理的多个URL存储在一个列表中，然后遍历这个列表，对每个URL进行处理。

示例代码

以下是一个示例代码，展示了如何批量处理多个URL：

from newspaper import Article

# 定义一个包含多个URL的列表
urls = [
    'https://example.com/article1',
    'https://example.com/article2',
    'https://example.com/article3',
]

def process_article(url):
    # 创建Article对象
    article = Article(url)
    
    # 下载并解析文章
    article.download()
    article.parse()
    
    # 提取关键词和摘要
    article.nlp()
    
    # 返回文章的相关信息
    return {
        'title': article.title,
        'authors': article.authors,
        'publish_date': article.publish_date,
        'text': article.text,
        'keywords': article.keywords,
        'summary': article.summary,
    }

# 存储处理结果
results = []

for url in urls:
    try:
        result = process_article(url)
        results.append(result)
        print(f"Processed {url}")
    except Exception as e:
        print(f"Failed to process {url}: {e}")

# 打印所有处理结果
for result in results:
    print(result)

2. 定制化爬虫行为

2.1 设置用户代理

为了防止被目标网站封禁，可以设置不同的用户代理（User-Agent）。

from newspaper import Config, Article

# 自定义Config对象
config = Config()
config.browser_user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'

# 使用自定义Config对象创建Article
article = Article(url, config=config)

2.2 调整下载超时时间

有时网络连接不稳定，可以设置更长的下载超时时间。

config.request_timeout = 30  # 设置请求超时时间为30秒

2.3 自定义解析规则

如果默认的解析规则不能满足需求，可以自定义解析规则。

from newspaper import clean_html

# 自定义HTML清理函数
def custom_clean_html(html):
    # 这里可以添加自定义的HTML清理逻辑
    cleaned_html = clean_html(html)
    return cleaned_html

# 使用自定义的HTML清理函数
article = Article(url, config=config)
article.set_html(custom_clean_html(article.html))
article.parse()

2.4 并行处理

为了提高处理效率，可以使用多线程或多进程来并行处理多个URL。

使用多线程

import threading
from newspaper import Article

def process_article(url):
    article = Article(url)
    article.download()
    article.parse()
    article.nlp()
    print(f"Processed {url}")

# 创建线程列表
threads = []

for url in urls:
    thread = threading.Thread(target=process_article, args=(url,))
    threads.append(thread)
    thread.start()

# 等待所有线程完成
for thread in threads:
    thread.join()

使用多进程

import multiprocessing
from newspaper import Article

def process_article(url):
    article = Article(url)
    article.download()
    article.parse()
    article.nlp()
    print(f"Processed {url}")

# 创建进程池
with multiprocessing.Pool(processes=4) as pool:
    pool.map(process_article, urls)

可以看到newspaper库不仅支持单个URL的处理，还可以方便地批量处理多个URL，并且提供了丰富的配置选项来定制化爬虫行为。

四、newspaper实战案例

1. 图片和视频下载

示例代码：

from newspaper import Article

url = 'https://example.com/some-news-article'
article = Article(url)

# 下载并解析文章
article.download()
article.parse()

# 下载图片
for image_url in article.images:
    response = requests.get(image_url)
    if response.status_code == 200:
        with open(f'images/{image_url.split("/")[-1]}', 'wb') as f:
            f.write(response.content)

# 下载视频
for video_url in article.movies:
    response = requests.get(video_url)
    if response.status_code == 200:
        with open(f'videos/{video_url.split("/")[-1]}', 'wb') as f:
            f.write(response.content)

print("Images and videos downloaded successfully.")

2. 关键词和摘要生成

示例代码：

from newspaper import Article

url = 'https://example.com/some-news-article'
article = Article(url)

# 下载并解析文章
article.download()
article.parse()
article.nlp()

# 输出关键词和摘要
print('Keywords:', article.keywords)
print('Summary:', article.summary)

3. 处理多语言文章

示例代码：

from newspaper import Article

url = 'https://example.com/some-news-article-in-spanish'
article = Article(url, language='es')  # 指定语言为西班牙语

# 下载并解析文章
article.download()
article.parse()
article.nlp()

# 输出文章信息
print('Title:', article.title)
print('Authors:', article.authors)
print('Publish Date:', article.publish_date)
print('Text:', article.text)
print('Keywords:', article.keywords)
print('Summary:', article.summary)

4. 保存文章到数据库

示例代码：

假设我们使用SQLite数据库来保存文章信息。

import sqlite3
from newspaper import Article

# 连接到SQLite数据库
conn = sqlite3.connect('articles.db')
cursor = conn.cursor()

# 创建表
cursor.execute('''
CREATE TABLE IF NOT EXISTS articles (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    title TEXT,
    authors TEXT,
    publish_date TEXT,
    text TEXT,
    keywords TEXT,
    summary TEXT
)
''')

def save_article_to_db(article):
    cursor.execute('''
    INSERT INTO articles (title, authors, publish_date, text, keywords, summary)
    VALUES (?, ?, ?, ?, ?, ?)
    ''', (
        article.title,
        ', '.join(article.authors),
        str(article.publish_date),
        article.text,
        ', '.join(article.keywords),
        article.summary
    ))
    conn.commit()

url = 'https://example.com/some-news-article'
article = Article(url)

# 下载并解析文章
article.download()
article.parse()
article.nlp()

# 保存文章到数据库
save_article_to_db(article)

# 关闭数据库连接
conn.close()

5. 处理RSS Feed

示例代码：

newspaper库本身不直接支持RSS Feed，但可以结合feedparser库来处理。

import feedparser
from newspaper import Article

# 解析RSS Feed
feed = feedparser.parse('https://example.com/rss.xml')

# 遍历每篇文章
for entry in feed.entries:
    url = entry.link
    article = Article(url)

    # 下载并解析文章
    article.download()
    article.parse()
    article.nlp()

    # 输出文章信息
    print('Title:', article.title)
    print('Authors:', article.authors)
    print('Publish Date:', article.publish_date)
    print('Text:', article.text)
    print('Keywords:', article.keywords)
    print('Summary:', article.summary)

6. 定期更新文章

示例代码：

使用schedule库来定期执行文章抓取任务。

import schedule
import time
from newspaper import Article

def fetch_and_process_articles():
    urls = [
        'https://example.com/article1',
        'https://example.com/article2',
        'https://example.com/article3',
    ]

    for url in urls:
        article = Article(url)
        article.download()
        article.parse()
        article.nlp()
        
        print('Title:', article.title)
        print('Authors:', article.authors)
        print('Publish Date:', article.publish_date)
        print('Text:', article.text)
        print('Keywords:', article.keywords)
        print('Summary:', article.summary)

# 每天早上8点执行
schedule.every().day.at("08:00").do(fetch_and_process_articles)

while True:
    schedule.run_pending()
    time.sleep(1)

newspaper库在处理新闻文章时的强大功能和灵活性。无论是单个URL的处理、多语言支持、图片和视频下载，还是批量处理和定期更新，newspaper都能提供简洁高效的解决方案。

五、常见问题及解决方法

使用newspaper库时，有一些常见的问题和注意事项需要特别留意，以确保您的应用程序能够稳定、高效地运行。以下是一些常见的问题及其解决方法：

1. 网络请求失败

问题描述：

目标网站可能由于网络问题或服务器维护导致无法访问。
请求超时或返回错误状态码。

解决方法：

设置请求超时时间。
捕获异常并重试。

from newspaper import Article
import requests

url = 'https://example.com/some-news-article'
article = Article(url)

try:
    article.download()
    article.parse()
    article.nlp()
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")

2. 用户代理被封禁

问题描述：

目标网站可能检测到大量来自同一IP的请求，从而封禁该IP。

解决方法：

使用不同的用户代理。
设置合理的请求间隔时间。

from newspaper import Config, Article

config = Config()
config.browser_user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
config.request_timeout = 30  # 设置请求超时时间为30秒

article = Article(url, config=config)
article.download()
article.parse()
article.nlp()

3. 文章内容解析不准确

问题描述：

默认的解析规则可能无法完全适应某些网站的结构。

解决方法：

自定义解析规则。
使用BeautifulSoup等工具手动解析。

from newspaper import Article
from bs4 import BeautifulSoup
import requests

url = 'https://example.com/some-news-article'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# 手动提取文章内容
title = soup.find('h1').get_text()
text = soup.find('div', class_='article-body').get_text()

print('Title:', title)
print('Text:', text)

4. 大规模数据抓取

问题描述：

大规模数据抓取可能导致目标网站服务器压力过大，甚至被封禁。

解决方法：

合理安排请求频率。
使用代理IP池分散请求来源。
尊重网站的robots.txt协议。

import time
from newspaper import Article

urls = [
    'https://example.com/article1',
    'https://example.com/article2',
    'https://example.com/article3',
]

for url in urls:
    article = Article(url)
    article.download()
    article.parse()
    article.nlp()
    
    print('Title:', article.title)
    print('Text:', article.text)
    
    time.sleep(1)  # 每次请求后暂停1秒

5. 图片和视频下载失败

问题描述：

图片或视频URL可能无效或被限制访问。

解决方法：

捕获下载过程中的异常。
检查URL是否有效。

import requests
from newspaper import Article

url = 'https://example.com/some-news-article'
article = Article(url)
article.download()
article.parse()

for image_url in article.images:
    try:
        response = requests.get(image_url)
        if response.status_code == 200:
            with open(f'images/{image_url.split("/")[-1]}', 'wb') as f:
                f.write(response.content)
    except requests.exceptions.RequestException as e:
        print(f"Failed to download {image_url}: {e}")

for video_url in article.movies:
    try:
        response = requests.get(video_url)
        if response.status_code == 200:
            with open(f'videos/{video_url.split("/")[-1]}', 'wb') as f:
                f.write(response.content)
    except requests.exceptions.RequestException as e:
        print(f"Failed to download {video_url}: {e}")

6. 关键词和摘要生成不准确

问题描述：

自动生成的关键词和摘要可能不够准确。

解决方法：

使用其他自然语言处理工具（如NLTK、spaCy）进行更精细的处理。
调整newspaper库的NLP参数。

from newspaper import Article

url = 'https://example.com/some-news-article'
article = Article(url)
article.download()
article.parse()
article.nlp()

print('Keywords:', article.keywords)
print('Summary:', article.summary)

7. 存储和管理大量数据

问题描述：

处理大量数据时，存储和管理可能会变得复杂。

解决方法：

使用数据库（如MySQL、PostgreSQL、MongoDB）存储数据。
定期备份数据。

import sqlite3
from newspaper import Article

conn = sqlite3.connect('articles.db')
cursor = conn.cursor()

cursor.execute('''
CREATE TABLE IF NOT EXISTS articles (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    title TEXT,
    authors TEXT,
    publish_date TEXT,
    text TEXT,
    keywords TEXT,
    summary TEXT
)
''')

def save_article_to_db(article):
    cursor.execute('''
    INSERT INTO articles (title, authors, publish_date, text, keywords, summary)
    VALUES (?, ?, ?, ?, ?, ?)
    ''', (
        article.title,
        ', '.join(article.authors),
        str(article.publish_date),
        article.text,
        ', '.join(article.keywords),
        article.summary
    ))
    conn.commit()

url = 'https://example.com/some-news-article'
article = Article(url)
article.download()
article.parse()
article.nlp()

save_article_to_db(article)

conn.close()