Python爬虫入门实例：Python7个爬虫小案例（附源码）

最新推荐文章于 2025-05-08 15:10:51 发布

小尤笔记

最新推荐文章于 2025-05-08 15:10:51 发布

阅读量1.5k

点赞数 27

文章标签： python 爬虫开发语言 Python基础

本文链接：https://blog.youkuaiyun.com/2301_78096295/article/details/144830108

版权

Python爬虫是指使用Python编程语言编写的网络爬虫程序。这些程序能够自动化地访问互联网上的网页，收集并提取所需的数据。Python之所以成为爬虫开发的热门选择，是因为它拥有强大的网络请求库（如requests）、HTML解析库（如BeautifulSoup和lxml）、以及异步编程和并发处理的能力（如通过asyncio或multithreading模块实现）。
优快云大礼包：《2024年最新全套学习资料包》免费分享
以下是一个简化的Python爬虫开发流程：

明确目标：
- 确定要抓取的网站和数据类型。
- 遵守网站的robots.txt协议和相关的法律法规。
发送请求：
- 使用requests等库向目标网站发送HTTP请求。
- 处理可能的异常，如网络错误、超时等。
解析网页：
- 使用BeautifulSoup、lxml或pyquery等库解析网页的HTML内容。
- 提取所需的数据，如文本、链接、图片等。
处理数据：
- 对提取的数据进行清洗、转换和存储。
- 可以使用pandas库进行数据处理和分析。
存储数据：
- 将数据保存到本地文件（如CSV、JSON格式）或数据库中。
优化和调试：
- 使用日志记录（如logging模块）来跟踪程序的运行情况。
- 优化代码以提高效率和可靠性。
- 调试代码以解决可能出现的错误和问题。
遵守法律和道德：
- 确保你的爬虫行为符合目标网站的服务条款和法律法规。
- 避免对目标网站造成过大的负载或损害。
考虑使用框架：
- 对于更复杂的爬虫任务，可以考虑使用Scrapy等爬虫框架来简化开发和管理工作。

以下是一个简单的Python爬虫示例，用于抓取一个网页的标题和所有链接：

import requests
from bs4 import BeautifulSoup

# 目标URL
url = 'https://www.example.com'

# 发送HTTP GET请求
response = requests.get(url)

# 检查请求是否成功
if response.status_code == 200:
    # 解析网页内容
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 提取标题
    title = soup.title.string
    print(f'Title: {title}')
    
    # 提取所有链接
    links = [a.get('href') for a in soup.find_all('a', href=True)]
    print('Links:')
    for link in links:
        print(link)
else:
    print(f'Failed to retrieve the webpage. Status code: {response.status_code}')

请注意，这个示例仅用于教学目的，并不适用于实际的生产环境。在实际应用中，你需要考虑更多的因素，如处理动态加载的内容（可能需要使用Selenium等工具）、处理反爬虫机制（如验证码、IP封锁等）、以及数据清洗和存储等。

下面，我将为你介绍七个简单的 Python 爬虫案例，每个案例都附有源码，帮助你入门 Python 爬虫。

案例 1：抓取网页内容

目标：抓取一个网页的 HTML 内容并打印出来。

工具：requests 库

import requests

url = 'https://www.example.com'
response = requests.get(url)

if response.status_code == 200:
    print(response.text)
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

案例 2：解析网页内容（使用 BeautifulSoup）

目标：抓取网页并解析其中的特定内容，比如标题。

工具：requests 和 BeautifulSoup 库

import requests
from bs4 import BeautifulSoup

url = 'https://www.example.com'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    title = soup.title.string
    print(f"Title: {title}")
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

案例 3：抓取网页中的所有链接

目标：抓取网页并提取其中的所有链接。

工具：requests 和 BeautifulSoup 库

import requests
from bs4 import BeautifulSoup

url = 'https://www.example.com'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    links = [a.get('href') for a in soup.find_all('a', href=True)]
    print(links)
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

案例 4：抓取网页中的图片链接

目标：抓取网页并提取其中的所有图片链接。

工具：requests 和 BeautifulSoup 库

import requests
from bs4 import BeautifulSoup

url = 'https://www.example.com'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    image_links = [img.get('src') for img in soup.find_all('img', src=True)]
    print(image_links)
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

案例 5：抓取网页中的表格数据

目标：抓取网页并提取其中的表格数据。

工具：requests 和 pandas 库

import requests
import pandas as pd

url = 'https://www.example.com/table'
response = requests.get(url)

if response.status_code == 200:
    tables = pd.read_html(response.text)
    for i, table in enumerate(tables):
        print(f"Table {i+1}:\n{table}\n")
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

案例 6：处理分页抓取

目标：抓取一个分页网站的所有页面内容。

工具：requests 和 BeautifulSoup 库

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.example.com/page/'
num_pages = 5  # 假设有5页

for page_num in range(1, num_pages + 1):
    url = f"{base_url}{page_num}"
    response = requests.get(url)
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        # 假设我们要抓取标题
        titles = [h.string for h in soup.find_all('h2')]
        print(f"Page {page_num} Titles: {titles}")
    else:
        print(f"Failed to retrieve page {page_num}. Status code: {response.status_code}")

案例 7：使用 Scrapy 框架抓取网页

目标：使用 Scrapy 框架抓取网页内容。

工具：Scrapy 框架

首先，安装 Scrapy：

pip install scrapy

然后，创建一个 Scrapy 项目并编写爬虫：

scrapy startproject myproject
cd myproject
scrapy genspider example example.com

在 myproject/spiders/example.py 中编写爬虫代码：

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com']

    def parse(self, response):
        title = response.css('title::text').get()
        yield {'title': title}

        # 假设我们要抓取所有链接
        for link in response.css('a::attr(href)').getall():
            yield {'link': link}