Python语言爬虫案例：从入门到实战-优快云博客

Python作为爬虫领域最流行的语言之一，拥有丰富的库和框架支持。以下通过几个典型案例展示Python爬虫的实现方法，涵盖基础到进阶场景。

基础爬虫：静态页面抓取

使用requests和BeautifulSoup库抓取静态页面内容，适合没有反爬机制的网站。

import requests
from bs4 import BeautifulSoup

url = 'https://books.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

for book in soup.select('article.product_pod'):
    title = book.h3.a['title']
    price = book.select('p.price_color')[0].text
    print(f"书名: {title}, 价格: {price}")

关键点解析：

requests.get()获取网页内容
BeautifulSoup解析HTML结构
CSS选择器定位元素

动态内容抓取：Selenium实战

对于JavaScript渲染的页面，使用selenium模拟浏览器操作。

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get('https://quotes.toscrape.com/js/')

quotes = driver.find_elements(By.CLASS_NAME, 'quote')
for quote in quotes:
    text = quote.find_element(By.CLASS_NAME, 'text').text
    author = quote.find_element(By.CLASS_NAME, 'author').text
    print(f"{author}: {text}")

driver.quit()

技术要点：

需要安装对应浏览器的驱动
提供多种元素定位方式（ID、CLASS、XPath等）
必须显式关闭浏览器实例

反反爬策略：请求头与代理

应对基础反爬机制的标准实践方案。

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36',
    'Accept-Language': 'en-US,en;q=0.9'
}

proxies = {
    'http': 'http://user:pass@proxy_ip:port',
    'https': 'https://user:pass@proxy_ip:port'
}

response = requests.get('https://httpbin.org/headers', 
                       headers=headers, 
                       proxies=proxies,
                       timeout=10)
print(response.json())

注意事项：

轮换User-Agent避免被封禁
高匿名代理保护真实IP
设置合理超时时间

数据存储方案

将爬取结果持久化到不同存储介质。

# CSV存储
import csv
with open('books.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['Title', 'Price'])
    writer.writerows([('Book1', '$20'), ('Book2', '$15')])

# MongoDB存储
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client['scraped_data']
collection = db['books']
collection.insert_one({'title': 'Python Cookbook', 'price': '$40'})

存储选择建议：

小规模数据使用CSV或JSON
大规模结构化数据考虑MySQL
非结构化数据适合MongoDB

高级技巧：异步爬虫

使用aiohttp提升爬取效率。

import aiohttp
import asyncio

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    urls = [f'https://example.com/page/{i}' for i in range(1,6)]
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in URLs]
        results = await asyncio.gather(*tasks)
        for html in results:
            print(len(html))

asyncio.run(main())

性能优化：