爬虫开发案例&项目源码

最新推荐文章于 2024-07-08 17:14:04 发布

清风明月9987321

最新推荐文章于 2024-07-08 17:14:04 发布

阅读量835

点赞数 10

文章标签：爬虫

本文链接：https://blog.youkuaiyun.com/qq_36585997/article/details/140219075

版权

下面是一个使用Python开发的简单网页爬虫案例及相关项目源码。这个项目将展示如何使用requests和BeautifulSoup库来抓取网页内容，并提取特定的信息。

项目简介

该项目展示了如何编写一个简单的网页爬虫来抓取网页上的新闻标题和链接。我们将以一个新闻网站（例如BBC新闻）为例，抓取首页上的新闻标题和链接。

目录结构

复制代码
news_scraper/
├── scraper.py
├── requirements.txt
└── README.md

1. 准备工作

安装依赖

创建一个requirements.txt文件，内容如下：

txt
复制代码
requests
beautifulsoup4

使用以下命令安装依赖：

bash
复制代码
pip install -r requirements.txt

2. 爬虫代码

scraper.py

python
复制代码
import requests
from bs4 import BeautifulSoup

def fetch_news(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    else:
        print(f"Failed to retrieve content: {response.status_code}")
        return None

def parse_news(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    news_list = []
    for item in soup.find_all('a', class_='gs-c-promo-heading'):
        title = item.get_text()
        link = item['href']
        if link.startswith('/'):
            link = f"https://www.bbc.com{link}"
        news_list.append({'title': title, 'link': link})
    return news_list

def save_news(news_list, filename='news.txt'):
    with open(filename, 'w') as file:
        for news in news_list:
            file.write(f"{news['title']}\n{news['link']}\n\n")

def main():
    url = "https://www.bbc.com/news"
    html_content = fetch_news(url)
    if html_content:
        news_list = parse_news(html_content)
        save_news(news_list)
        print(f"Saved {len(news_list)} news articles to news.txt")

if __name__ == "__main__":
    main()

3. 代码解析

fetch_news(url)：
- 使用requests库发送HTTP GET请求，获取网页内容。
- 设置User-Agent头以模拟浏览器请求，防止被目标网站阻止。
- 返回网页内容的HTML文本。
parse_news(html_content)：
- 使用BeautifulSoup解析HTML内容。
- 查找所有新闻标题和链接（假设它们位于带有class_='gs-c-promo-heading'的<a>标签中）。
- 构建新闻标题和链接的列表。
save_news(news_list, filename='news.txt')：
- 将新闻标题和链接保存到文本文件中。
main()：
- 定义要抓取的新闻网站URL。
- 获取并解析网页内容，提取新闻标题和链接。
- 将提取到的新闻标题和链接保存到文件中。