Python爬虫基础
Python爬虫是一种自动化程序,用于从互联网上抓取数据。常用的库包括requests、BeautifulSoup、Scrapy和selenium。
安装基础库:
pip install requests beautifulsoup4 scrapy selenium
发送HTTP请求示例:
import requests
response = requests.get('https://example.com')
print(response.text)
解析HTML内容:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.find('title').text
数据提取技术
XPath选择器可用于精确提取数据:
from lxml import html
tree = html.fromstring(response.content)
elements = tree.xpath('//div[@class="content"]/text()')
正则表达式匹配复杂文本:
import re
emails = re.findall(r'[\w\.-]+@[\w\.-]+', response.text)
动态内容处理
对于JavaScript渲染的页面,使用selenium:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://example.com')
dynamic_content = driver.find_element_by_id('dynamic').text
数据存储
将爬取数据保存到CSV文件:
import csv
with open('data.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(['title', 'content'])
存储到SQLite数据库:
import sqlite3
conn = sqlite3.connect('data.db')
cursor = conn.cursor()
cursor.execute('CREATE TABLE IF NOT EXISTS pages (url TEXT, content TEXT)')
反爬虫策略应对
设置请求头模拟浏览器:
headers = {
'User-Agent': 'Mozilla/5.0',
'Accept-Language': 'en-US'
}
requests.get(url, headers=headers)
使用代理IP:
proxies = {'http': 'http://10.10.1.10:3128'}
requests.get(url, proxies=proxies)
分布式爬虫架构
Scrapy框架构建分布式爬虫:
import scrapy
class MySpider(scrapy.Spider):
name = 'example'
start_urls = ['http://example.com']
def parse(self, response):
yield {'title': response.css('title::text').get()}
使用Redis实现分布式队列:
import redis
r = redis.Redis()
r.lpush('task_queue', 'http://example.com/page1')
爬虫性能优化
异步请求提高效率:
import aiohttp
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
content = await response.text()
内存管理技巧:
from itertools import islice
large_list = [i for i in range(1000000)]
for batch in iter(lambda: list(islice(large_list, 1000)), []):
process_batch(batch)
合法合规注意事项
遵守robots.txt协议:
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
can_fetch = rp.can_fetch('*', '/private/')
设置爬取延迟:
import time
time.sleep(2) # 2秒延迟
2566

被折叠的 条评论
为什么被折叠?



