Python爬虫从入门到精通

最新推荐文章于 2025-12-17 05:58:59 发布

原创最新推荐文章于 2025-12-17 05:58:59 发布 · 260 阅读

7 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫 #开发语言 #csdn

Python爬虫基础

Python爬虫是一种自动化程序，用于从互联网上抓取数据。常用的库包括requests、BeautifulSoup、Scrapy和selenium。

安装基础库：

pip install requests beautifulsoup4 scrapy selenium

发送HTTP请求示例：

import requests
response = requests.get('https://example.com')
print(response.text)

解析HTML内容：

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.find('title').text

数据提取技术

XPath选择器可用于精确提取数据：

from lxml import html
tree = html.fromstring(response.content)
elements = tree.xpath('//div[@class="content"]/text()')

正则表达式匹配复杂文本：

import re
emails = re.findall(r'[\w\.-]+@[\w\.-]+', response.text)

动态内容处理

对于JavaScript渲染的页面，使用selenium：

from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://example.com')
dynamic_content = driver.find_element_by_id('dynamic').text

数据存储

将爬取数据保存到CSV文件：

import csv
with open('data.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['title', 'content'])

存储到SQLite数据库：

import sqlite3
conn = sqlite3.connect('data.db')
cursor = conn.cursor()
cursor.execute('CREATE TABLE IF NOT EXISTS pages (url TEXT, content TEXT)')

反爬虫策略应对

设置请求头模拟浏览器：

headers = {
    'User-Agent': 'Mozilla/5.0',
    'Accept-Language': 'en-US'
}
requests.get(url, headers=headers)

使用代理IP：

proxies = {'http': 'http://10.10.1.10:3128'}
requests.get(url, proxies=proxies)

分布式爬虫架构

Scrapy框架构建分布式爬虫：

import scrapy
class MySpider(scrapy.Spider):
    name = 'example'
    start_urls = ['http://example.com']
    
    def parse(self, response):
        yield {'title': response.css('title::text').get()}

使用Redis实现分布式队列：

import redis
r = redis.Redis()
r.lpush('task_queue', 'http://example.com/page1')

爬虫性能优化

异步请求提高效率：

import aiohttp
async with aiohttp.ClientSession() as session:
    async with session.get(url) as response:
        content = await response.text()

内存管理技巧：

from itertools import islice
large_list = [i for i in range(1000000)]
for batch in iter(lambda: list(islice(large_list, 1000)), []):
    process_batch(batch)

合法合规注意事项

遵守robots.txt协议：

from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
can_fetch = rp.can_fetch('*', '/private/')

设置爬取延迟：