目录
-
数据采集概述
-
Python 环境配置
-
HTTP 协议基础
-
网页解析技术
-
数据存储方案
-
高级采集技术
-
数据清洗与处理
-
实战项目
-
道德与法律考量
-
总结与资源
1. 数据采集概述 <a name="数据采集概述"></a>
1.1 什么是数据采集
数据采集(Web Scraping)是指通过自动化程序从网站提取信息的过程。与手动复制粘贴相比,自动化数据采集可以高效地获取大量结构化数据,为数据分析、市场研究和机器学习提供数据源。
1.2 为什么使用 Python 进行数据采集
Python 成为数据采集的首选语言原因如下:
| 特性 | 说明 |
|---|---|
| 丰富的库生态系统 | Requests, BeautifulSoup, Scrapy, Selenium 等 |
| 简单易学的语法 | 代码可读性高,学习曲线平缓 |
| 强大的数据处理能力 | Pandas, NumPy 等库便于后续数据处理 |
| 跨平台兼容性 | 可在Windows、MacOS和Linux上运行 |
| 社区支持 | 庞大的开发者社区,丰富的学习资源 |
1.3 数据采集的法律与道德考量
在进行数据采集前,必须了解相关法律和道德准则:
| 注意事项 | 说明 |
|---|---|
| robots.txt | 遵守网站的robots.txt文件规定 |
| 服务条款 | 尊重网站的使用条款 |
| 访问频率 | 合理控制请求频率,避免对网站造成负担 |
| 数据用途 | 明确数据用途,尊重版权和隐私 |
| 身份标识 | 使用适当的User-Agent标识爬虫身份 |
2. Python 环境配置 <a name="Python-环境配置"></a>
2.1 Python 安装与配置
数据采集需要安装Python及相关库,以下是推荐的环境配置:
| 组件 | 版本 | 说明 |
|---|---|---|
| Python | 3.8+ | 建议使用最新稳定版 |
| pip | 最新版 | Python包管理工具 |
| virtualenv | 最新版 | 创建隔离的Python环境 |
安装步骤:
-
访问 Python官网 下载并安装Python
-
验证安装:在终端/CMD中输入
python --version -
更新pip:
pip install --upgrade pip -
安装virtualenv:
pip install virtualenv
2.2 创建虚拟环境
使用虚拟环境可以避免包冲突:
bash
# 创建虚拟环境 python -m venv scraping_env # 激活虚拟环境 (Windows) scraping_env\Scripts\activate # 激活虚拟环境 (MacOS/Linux) source scraping_env/bin/activate
2.3 安装必要库
以下是数据采集所需的核心库:
| 库名称 | 用途 | 安装命令 |
|---|---|---|
| requests | 发送HTTP请求 | pip install requests |
| BeautifulSoup4 | HTML解析 | pip install beautifulsoup4 |
| lxml | 快速XML/HTML解析 | pip install lxml |
| selenium | 浏览器自动化 | pip install selenium |
| scrapy | 爬虫框架 | pip install scrapy |
| pandas | 数据处理与分析 | pip install pandas |
| numpy | 数值计算 | pip install numpy |
2.4 开发工具推荐
| 工具类型 | 推荐工具 | 特点 |
|---|---|---|
| IDE | PyCharm | 强大的Python专用IDE |
| 文本编辑器 | VS Code | 轻量级,插件丰富 |
| 浏览器工具 | Chrome DevTools | 分析网页结构,调试爬虫 |
| API测试 | Postman | 测试API接口 |
3. HTTP 协议基础 <a name="HTTP-协议基础"></a>
3.1 HTTP 请求与响应
HTTP(超文本传输协议)是数据采集的基础,了解其工作原理至关重要:
| 组件 | 说明 |
|---|---|
| 请求方法 | GET, POST, PUT, DELETE等 |
| 状态码 | 200(成功), 404(未找到), 500(服务器错误)等 |
| 请求头 | User-Agent, Cookie, Referer等 |
| 响应头 | Content-Type, Set-Cookie等 |
| 请求体 | POST请求中发送的数据 |
3.2 常用HTTP状态码
| 状态码 | 含义 | 常见场景 |
|---|---|---|
| 200 | OK | 请求成功 |
| 301 | Moved Permanently | 永久重定向 |
| 302 | Found | 临时重定向 |
| 400 | Bad Request | 错误请求 |
| 403 | Forbidden | 禁止访问 |
| 404 | Not Found | 页面不存在 |
| 500 | Internal Server Error | 服务器内部错误 |
| 503 | Service Unavailable | 服务不可用 |
3.3 使用 Requests 库发送HTTP请求
Requests是Python中最常用的HTTP库,简单易用:
python
import requests
# 发送GET请求
response = requests.get('https://api.example.com/data')
# 检查请求是否成功
if response.status_code == 200:
print('请求成功!')
print(response.text) # 响应内容
else:
print(f'请求失败,状态码: {response.status_code}')
# 发送带参数的GET请求
params = {'key1': 'value1', 'key2': 'value2'}
response = requests.get('https://api.example.com/data', params=params)
# 发送POST请求
data = {'username': 'user', 'password': 'pass'}
response = requests.post('https://api.example.com/login', data=data)
# 设置请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'application/json'
}
response = requests.get('https://api.example.com/data', headers=headers)
3.4 处理 cookies 和会话
python
import requests
# 创建会话对象维持cookies
session = requests.Session()
# 首先登录获取cookies
login_data = {'username': 'user', 'password': 'pass'}
session.post('https://example.com/login', data=login_data)
# 使用已有cookies访问需要认证的页面
response = session.get('https://example.com/dashboard')
print(response.text)
# 手动处理cookies
response = requests.get('https://example.com')
cookies = response.cookies
# 使用获取的cookies发送后续请求
response = requests.get('https://example.com/protected', cookies=cookies)
3.5 处理异常和超时
python
import requests
from requests.exceptions import RequestException
try:
# 设置超时时间(连接超时和读取超时)
response = requests.get('https://example.com', timeout=(3.05, 27))
# 抛出HTTP错误状态码异常
response.raise_for_status()
print(response.text)
except RequestException as e:
print(f'请求错误: {e}')
except Timeout:
print('请求超时')
except ConnectionError:
print('连接错误')
except HTTPError as e:
print(f'HTTP错误: {e}')
4. 网页解析技术 <a name="网页解析技术"></a>
4.1 HTML 基础结构
了解HTML结构是解析网页的前提:
html
<!DOCTYPE html>
<html>
<head>
<title>网页标题</title>
</head>
<body>
<div id="content">
<h1 class="title">主标题</h1>
<p class="text">段落文本</p>
<ul>
<li>列表项1</li>
<li>列表项2</li>
</ul>
<a href="https://example.com">链接</a>
</div>
</body>
</html>
4.2 使用 BeautifulSoup 解析HTML
BeautifulSoup是Python中最流行的HTML解析库:
python
from bs4 import BeautifulSoup
import requests
# 获取网页内容
response = requests.get('https://example.com')
html_content = response.text
# 创建BeautifulSoup对象
soup = BeautifulSoup(html_content, 'lxml') # 或者使用 'html.parser'
# 通过标签名查找元素
title = soup.title # 获取<title>标签
title_text = soup.title.text # 获取<title>标签的文本
# 通过CSS选择器查找元素
first_paragraph = soup.select_one('p') # 第一个<p>标签
all_paragraphs = soup.select('p') # 所有<p>标签
# 通过属性查找元素
div_with_id = soup.find('div', id='content') # id为content的div
elements_with_class = soup.find_all('div', class_='item') # class为item的所有div
# 提取属性值
link = soup.find('a')
href = link['href'] # 获取href属性值
# 导航文档树
parent = link.parent # 父元素
children = div_with_id.children # 子元素
siblings = link.next_siblings # 兄弟元素
4.3 XPath 与 lxml
lxml库提供了XPath支持,适用于复杂的解析需求:
python
from lxml import html
import requests
# 获取网页内容
response = requests.get('https://example.com')
html_content = response.text
# 创建HTML树
tree = html.fromstring(html_content)
# 使用XPath选择元素
# 选择所有h1标签
h1_elements = tree.xpath('//h1')
# 选择class为title的元素
title_elements = tree.xpath('//*[@class="title"]')
# 选择包含特定文本的元素
specific_text = tree.xpath('//p[contains(text(), "特定文本")]')
# 提取属性
links = tree.xpath('//a/@href') # 所有链接的href属性
# 复杂XPath示例
# 选择id为content的div下的所有p标签
paragraphs = tree.xpath('//div[@id="content"]//p')
for p in paragraphs:
print(p.text_content()) # 获取元素文本内容
4.4 正则表达式在数据采集中的应用
正则表达式适合提取特定模式的文本:
python
import re
text = "联系电话: 123-456-7890, 邮箱: example@email.com"
# 提取电话号码
phone_pattern = r'\d{3}-\d{3}-\d{4}'
phones = re.findall(phone_pattern, text)
print(phones) # ['123-456-7890']
# 提取邮箱地址
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = re.findall(email_pattern, text)
print(emails) # ['example@email.com']
# 替换文本
anonymized_text = re.sub(phone_pattern, 'XXX-XXX-XXXX', text)
print(anonymized_text) # "联系电话: XXX-XXX-XXXX, 邮箱: example@email.com"
# 分割文本
sentence = "数据1,数据2;数据3|数据4"
split_result = re.split(r'[,;|]', sentence)
print(split_result) # ['数据1', '数据2', '数据3', '数据4']
4.5 解析策略对比
| 解析方法 | 优点 | 缺点 | 适用场景 |
|---|---|---|---|
| BeautifulSoup | 简单易用,容错性好 | 速度相对较慢 | 简单网页,快速开发 |
| lxml + XPath | 解析速度快,表达能力强 | 学习曲线较陡 | 复杂网页,高性能需求 |
| 正则表达式 | 灵活,模式匹配强大 | 可读性差,维护困难 | 提取特定模式文本 |
5. 数据存储方案 <a name="数据存储方案"></a>
5.1 文件存储
CSV 文件存储
python
import csv
import requests
from bs4 import BeautifulSoup
# 采集数据
response = requests.get('https://example.com/books')
soup = BeautifulSoup(response.text, 'lxml')
books = []
for item in soup.select('.book-item'):
title = item.select_one('.title').text.strip()
author = item.select_one('.author').text.strip()
price = item.select_one('.price').text.strip()
books.append([title, author, price])
# 写入CSV文件
with open('books.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(['书名', '作者', '价格']) # 写入表头
writer.writerows(books) # 写入数据
# 读取CSV文件
with open('books.csv', 'r', encoding='utf-8') as file:
reader = csv.reader(file)
for row in reader:
print(row)
JSON 文件存储
python
import json
import requests
from bs4 import BeautifulSoup
# 采集数据
response = requests.get('https://example.com/books')
soup = BeautifulSoup(response.text, 'lxml')
books = []
for item in soup.select('.book-item'):
book = {
'title': item.select_one('.title').text.strip(),
'author': item.select_one('.author').text.strip(),
'price': item.select_one('.price').text.strip()
}
books.append(book)
# 写入JSON文件
with open('books.json', 'w', encoding='utf-8') as file:
json.dump(books, file, ensure_ascii=False, indent=2)
# 读取JSON文件
with open('books.json', 'r', encoding='utf-8') as file:
books_data = json.load(file)
for book in books_data:
print(book['title'], book['author'])
5.2 数据库存储
SQLite 数据库
python
import sqlite3
import requests
from bs4 import BeautifulSoup
# 创建数据库连接
conn = sqlite3.connect('books.db')
cursor = conn.cursor()
# 创建表
cursor.execute('''
CREATE TABLE IF NOT EXISTS books (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
author TEXT NOT NULL,
price REAL NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
''')
# 采集数据
response = requests.get('https://example.com/books')
soup = BeautifulSoup(response.text, 'lxml')
for item in soup.select('.book-item'):
title = item.select_one('.title').text.strip()
author = item.select_one('.author').text.strip()
price = float(item.select_one('.price').text.strip().replace('¥', ''))
# 插入数据
cursor.execute('INSERT INTO books (title, author, price) VALUES (?, ?, ?)',
(title, author, price))
# 提交事务并关闭连接
conn.commit()
conn.close()
MySQL 数据库
python
import mysql.connector
from mysql.connector import Error
import requests
from bs4 import BeautifulSoup
try:
# 创建数据库连接
connection = mysql.connector.connect(
host='localhost',
database='web_scraping',
user='username',
password='password'
)
if connection.is_connected():
cursor = connection.cursor()
# 创建表
cursor.execute('''
CREATE TABLE IF NOT EXISTS books (
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(255) NOT NULL,
author VARCHAR(255) NOT NULL,
price DECIMAL(10, 2) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
''')
# 采集数据
response = requests.get('https://example.com/books')
soup = BeautifulSoup(response.text, 'lxml')
for item in soup.select('.book-item'):
title = item.select_one('.title').text.strip()
author = item.select_one('.author').text.strip()
price = float(item.select_one('.price').text.strip().replace('¥', ''))
# 插入数据
cursor.execute('INSERT INTO books (title, author, price) VALUES (%s, %s, %s)',
(title, author, price))
connection.commit()
except Error as e:
print(f"数据库错误: {e}")
finally:
if connection.is_connected():
cursor.close()
connection.close()
5.3 NoSQL 数据库
MongoDB 存储
python
from pymongo import MongoClient
import requests
from bs4 import BeautifulSoup
# 连接MongoDB
client = MongoClient('mongodb://localhost:27017/')
db = client['web_scraping']
collection = db['books']
# 采集数据
response = requests.get('https://example.com/books')
soup = BeautifulSoup(response.text, 'lxml')
books = []
for item in soup.select('.book-item'):
book = {
'title': item.select_one('.title').text.strip(),
'author': item.select_one('.author').text.strip(),
'price': item.select_one('.price').text.strip()
}
books.append(book)
# 批量插入数据
if books:
result = collection.insert_many(books)
print(f"插入了 {len(result.inserted_ids)} 条文档")
# 查询数据
for book in collection.find({'price': {'$gt': '¥50'}}):
print(book)
# 关闭连接
client.close()
5.4 数据存储方案对比
| 存储方式 | 优点 | 缺点 | 适用场景 |
|---|---|---|---|
| CSV文件 | 简单通用,易于查看 | 不适合复杂数据结构 | 小型项目,数据交换 |
| JSON文件 | 保持数据结构,可读性好 | 文件较大时效率低 | 配置数据,简单数据结构 |
| SQLite | 无需服务器,轻量级 | 并发性能有限 | 桌面应用,小型项目 |
| MySQL | 功能丰富,性能良好 | 需要单独服务器 | 中大型项目,Web应用 |
| MongoDB | 灵活的模式,易扩展 | 内存占用较高 | 非结构化数据,快速迭代 |
6. 高级采集技术 <a name="高级采集技术"></a>
6.1 处理 JavaScript 渲染的页面
许多现代网站使用JavaScript动态加载内容,需要使用浏览器自动化工具:
使用 Selenium
python
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import time
# 配置Chrome选项
chrome_options = Options()
chrome_options.add_argument('--headless') # 无头模式
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
# 初始化浏览器驱动
driver = webdriver.Chrome(options=chrome_options)
try:
# 打开网页
driver.get('https://example.com/dynamic-content')
# 等待特定元素加载完成
wait = WebDriverWait(driver, 10)
element = wait.until(
EC.presence_of_element_located((By.CLASS_NAME, "dynamic-content"))
)
# 交互操作:点击按钮
button = driver.find_element(By.ID, 'load-more')
button.click()
# 等待新内容加载
time.sleep(2)
# 获取页面源码
page_source = driver.page_source
# 使用BeautifulSoup解析
from bs4 import BeautifulSoup
soup = BeautifulSoup(page_source, 'lxml')
# 提取数据
items = soup.select('.item')
for item in items:
print(item.text)
finally:
# 关闭浏览器
driver.quit()
使用 Requests-HTML
python
from requests_html import HTMLSession
session = HTMLSession()
# 渲染JavaScript
response = session.get('https://example.com/dynamic-content')
response.html.render(sleep=2, timeout=20)
# 提取数据
items = response.html.find('.item')
for item in items:
print(item.text)
# 关闭会话
session.close()
6.2 处理分页和无限滚动
python
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
def scrape_paginated_content():
driver = webdriver.Chrome()
all_data = []
try:
driver.get('https://example.com/paginated-data')
page_number = 1
while True:
print(f"正在采集第 {page_number} 页...")
# 等待内容加载
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, "item")))
# 解析当前页内容
soup = BeautifulSoup(driver.page_source, 'lxml')
items = soup.select('.item')
for item in items:
# 提取数据并添加到all_data
data = extract_item_data(item)
all_data.append(data)
# 检查是否有下一页
next_button = driver.find_elements(By.CSS_SELECTOR, '.next-page')
if not next_button or 'disabled' in next_button[0].get_attribute('class'):
break
# 点击下一页
next_button[0].click()
page_number += 1
time.sleep(2) # 等待页面加载
finally:
driver.quit()
return all_data
def extract_item_data(item):
# 实现具体的数据提取逻辑
title = item.select_one('.title').text.strip()
price = item.select_one('.price').text.strip()
return {'title': title, 'price': price}
6.3 使用代理和轮换 User-Agent
python
import requests
from fake_useragent import UserAgent
import random
import time
# 代理池
proxies = [
'http://proxy1.com:8080',
'http://proxy2.com:8080',
'http://proxy3.com:8080',
]
# 创建User-Agent生成器
ua = UserAgent()
def get_with_retry(url, max_retries=3):
for attempt in range(max_retries):
try:
# 随机选择代理和User-Agent
proxy = {'http': random.choice(proxies)}
headers = {'User-Agent': ua.random}
response = requests.get(url, headers=headers, proxies=proxy, timeout=10)
response.raise_for_status()
return response
except requests.RequestException as e:
print(f"尝试 {attempt + 1} 失败: {e}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # 指数退避
else:
raise
# 使用示例
try:
response = get_with_retry('https://example.com')
print("请求成功")
except Exception as e:
print(f"所有尝试都失败了: {e}")
6.4 异步数据采集
使用 asyncio 和 aiohttp 提高采集效率:
python
import aiohttp
import asyncio
from bs4 import BeautifulSoup
import time
async def fetch_page(session, url):
try:
async with session.get(url) as response:
return await response.text()
except Exception as e:
print(f"获取 {url} 时出错: {e}")
return None
async def parse_page(content):
if not content:
return []
soup = BeautifulSoup(content, 'lxml')
items = soup.select('.item')
data = []
for item in items:
title = item.select_one('.title').text.strip()
price = item.select_one('.price').text.strip()
data.append({'title': title, 'price': price})
return data
async def scrape_urls(urls):
connector = aiohttp.TCPConnector(limit=10) # 限制并发数
timeout = aiohttp.ClientTimeout(total=30)
async with aiohttp.ClientSession(connector=connector, timeout=timeout) as session:
tasks = []
for url in urls:
task = asyncio.create_task(fetch_page(session, url))
tasks.append(task)
contents = await asyncio.gather(*tasks)
parsing_tasks = []
for content in contents:
parsing_tasks.append(asyncio.create_task(parse_page(content)))
results = await asyncio.gather(*parsing_tasks)
# 合并所有结果
all_data = []
for result in results:
all_data.extend(result)
return all_data
# 使用示例
urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3',
# ... 更多URL
]
start_time = time.time()
results = asyncio.run(scrape_urls(urls))
end_time = time.time()
print(f"采集了 {len(results)} 条数据,耗时 {end_time - start_time:.2f} 秒")
7. 数据清洗与处理 <a name="数据清洗与处理"></a>
7.1 数据清洗技术
采集的数据往往需要清洗和预处理:
python
import pandas as pd
import numpy as np
import re
from datetime import datetime
# 示例数据
data = [
{'title': ' Python编程入门 ', 'price': '¥99.00', 'date': '2023-01-15'},
{'title': '数据科学实战', 'price': '150元', 'date': '2023/02/20'},
{'title': '机器学习', 'price': '200', 'date': '无效日期'},
{'title': 'Web开发', 'price': '¥120.50', 'date': '2023-03-10'},
]
# 创建DataFrame
df = pd.DataFrame(data)
# 清洗标题:去除前后空格
df['title'] = df['title'].str.strip()
# 清洗价格:提取数字
def clean_price(price):
if isinstance(price, str):
# 提取数字和小数点
numbers = re.findall(r'\d+\.?\d*', price)
if numbers:
return float(numbers[0])
return np.nan
df['price_clean'] = df['price'].apply(clean_price)
# 清洗日期
def clean_date(date_str):
try:
# 尝试不同日期格式
for fmt in ('%Y-%m-%d', '%Y/%m/%d', '%d-%m-%Y', '%d/%m/%Y'):
try:
return datetime.strptime(date_str, fmt).date()
except ValueError:
continue
return np.nan
except:
return np.nan
df['date_clean'] = df['date'].apply(clean_date)
print("原始数据:")
print(df[['title', 'price', 'date']])
print("\n清洗后的数据:")
print(df[['title', 'price_clean', 'date_clean']])
7.2 数据转换与标准化
python
# 继续使用上面的DataFrame
# 处理缺失值
print("缺失值统计:")
print(df.isnull().sum())
# 填充缺失值
df['price_clean'] = df['price_clean'].fillna(df['price_clean'].median())
df['date_clean'] = df['date_clean'].fillna(pd.Timestamp('today').date())
# 数据类型转换
df['price_clean'] = df['price_clean'].astype(float)
# 创建新特征
df['price_category'] = pd.cut(df['price_clean'],
bins=[0, 100, 150, 200, np.inf],
labels=['便宜', '中等', '较贵', '昂贵'])
# 字符串操作
df['title_length'] = df['title'].str.len()
df['has_python'] = df['title'].str.contains('Python', case=False)
print("\n处理后的数据:")
print(df)
7.3 数据去重与验证
python
# 数据去重
print(f"去重前数据量: {len(df)}")
# 基于标题去重
df_deduplicated = df.drop_duplicates(subset=['title'])
print(f"去重后数据量: {len(df_deduplicated)}")
# 数据验证
def validate_row(row):
errors = []
# 检查价格是否合理
if row['price_clean'] <= 0 or row['price_clean'] > 1000:
errors.append(f"价格 {row['price_clean']} 不合理")
# 检查日期是否在未来
if pd.notna(row['date_clean']) and row['date_clean'] > pd.Timestamp('today').date():
errors.append(f"日期 {row['date_clean']} 是未来日期")
return errors if errors else None
# 应用验证
df['validation_errors'] = df.apply(validate_row, axis=1)
# 显示有错误的数据
invalid_data = df[df['validation_errors'].notna()]
print(f"发现 {len(invalid_data)} 条无效数据")
for index, row in invalid_data.iterrows():
print(f"行 {index}: {row['validation_errors']}")
8. 实战项目 <a name="实战项目"></a>
8.1 电商网站价格监控
python
import requests
from bs4 import BeautifulSoup
import smtplib
from email.mime.text import MIMEText
import time
import schedule
class PriceMonitor:
def __init__(self, url, target_price, email_settings):
self.url = url
self.target_price = target_price
self.email_settings = email_settings
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
def get_current_price(self):
try:
response = requests.get(self.url, headers=self.headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
# 根据实际网站结构调整选择器
price_element = soup.select_one('.product-price, .price, [itemprop="price"]')
if price_element:
price_text = price_element.get_text().strip()
# 提取数字
import re
price = float(re.search(r'\d+\.?\d*', price_text).group())
return price
return None
except Exception as e:
print(f"获取价格时出错: {e}")
return None
def send_email_alert(self, current_price):
msg = MIMEText(f"""
价格警报!
商品链接: {self.url}
当前价格: ¥{current_price}
目标价格: ¥{self.target_price}
当前价格已低于或等于您的目标价格!
""")
msg['Subject'] = '价格警报: 商品价格下降!'
msg['From'] = self.email_settings['from_email']
msg['To'] = self.email_settings['to_email']
try:
with smtplib.SMTP(self.email_settings['smtp_server'], self.email_settings['smtp_port']) as server:
server.starttls()
server.login(self.email_settings['username'], self.email_settings['password'])
server.send_message(msg)
print("警报邮件已发送")
except Exception as e:
print(f"发送邮件时出错: {e}")
def check_price(self):
print(f"检查价格... {time.strftime('%Y-%m-%d %H:%M:%S')}")
current_price = self.get_current_price()
if current_price is not None:
print(f"当前价格: ¥{current_price}")
if current_price <= self.target_price:
self.send_email_alert(current_price)
return True
return False
def run_monitor(self, check_interval_hours=1):
print(f"开始监控,每 {check_interval_hours} 小时检查一次")
schedule.every(check_interval_hours).hours.do(self.check_price)
# 立即检查一次
self.check_price()
while True:
schedule.run_pending()
time.sleep(60) # 每分钟检查一次任务
# 使用示例
if __name__ == "__main__":
# 配置
product_url = "https://example.com/product/123"
target_price = 100.0
email_settings = {
'smtp_server': 'smtp.gmail.com',
'smtp_port': 587,
'username': 'your_email@gmail.com',
'password': 'your_password',
'from_email': 'your_email@gmail.com',
'to_email': 'recipient@example.com'
}
monitor = PriceMonitor(product_url, target_price, email_settings)
monitor.run_monitor(check_interval_hours=2)
8.2 新闻网站内容聚合
python
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
from datetime import datetime
import time
class NewsAggregator:
def __init__(self):
self.sources = {
'source1': {
'url': 'https://news-source1.com/latest',
'article_selector': '.article',
'title_selector': '.title',
'summary_selector': '.summary',
'date_selector': '.publish-date',
'link_selector': 'a[href]'
},
'source2': {
'url': 'https://news-source2.com/news',
'article_selector': '.news-item',
'title_selector': 'h2',
'summary_selector': '.description',
'date_selector': '.time',
'link_selector': 'a'
}
# 可以添加更多新闻源
}
self.articles = []
def scrape_source(self, source_name, source_config):
try:
print(f"正在采集 {source_name}...")
response = requests.get(source_config['url'], timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
article_elements = soup.select(source_config['article_selector'])
for article in article_elements:
try:
title_elem = article.select_one(source_config['title_selector'])
summary_elem = article.select_one(source_config['summary_selector'])
date_elem = article.select_one(source_config['date_selector'])
link_elem = article.select_one(source_config['link_selector'])
if title_elem and link_elem:
article_data = {
'source': source_name,
'title': title_elem.get_text().strip(),
'summary': summary_elem.get_text().strip() if summary_elem else '',
'date': date_elem.get_text().strip() if date_elem else '',
'link': link_elem['href'] if link_elem and 'href' in link_elem.attrs else '',
'scraped_at': datetime.now().isoformat()
}
# 确保链接是绝对URL
if article_data['link'] and not article_data['link'].startswith('http'):
article_data['link'] = source_config['url'] + article_data['link']
self.articles.append(article_data)
except Exception as e:
print(f"处理文章时出错: {e}")
continue
except Exception as e:
print(f"采集 {source_name} 时出错: {e}")
def scrape_all_sources(self):
print("开始采集所有新闻源...")
self.articles = []
for source_name, source_config in self.sources.items():
self.scrape_source(source_name, source_config)
time.sleep(1) # 礼貌延迟
print(f"采集完成,共获取 {len(self.articles)} 篇文章")
def save_to_json(self, filename):
with open(filename, 'w', encoding='utf-8') as f:
json.dump(self.articles, f, ensure_ascii=False, indent=2)
print(f"数据已保存到 {filename}")
def save_to_csv(self, filename):
df = pd.DataFrame(self.articles)
df.to_csv(filename, index=False, encoding='utf-8')
print(f"数据已保存到 {filename}")
def analyze_articles(self):
df = pd.DataFrame(self.articles)
if df.empty:
print("没有数据可分析")
return
print("\n=== 数据分析 ===")
print(f"总文章数: {len(df)}")
print("\n各来源文章数量:")
print(df['source'].value_counts())
# 时间分析(如果有日期信息)
if 'date' in df.columns and not df['date'].empty:
# 这里可以添加日期处理和分析逻辑
pass
return df
# 使用示例
if __name__ == "__main__":
aggregator = NewsAggregator()
aggregator.scrape_all_sources()
if aggregator.articles:
aggregator.save_to_json('news_articles.json')
aggregator.save_to_csv('news_articles.csv')
df = aggregator.analyze_articles()
print("\n前5篇文章:")
print(df[['source', 'title', 'date']].head())
9. 道德与法律考量 <a name="道德与法律考量"></a>
9.1 合法数据采集的重要原则
| 原则 | 说明 | 实践建议 |
|---|---|---|
| 尊重 robots.txt | 遵守网站的爬虫协议 | 检查并遵守目标网站的robots.txt |
| 控制访问频率 | 避免对网站造成负担 | 添加延迟,限制并发请求 |
| 标识爬虫身份 | 诚实地标识爬虫 | 使用合适的User-Agent |
| 尊重版权 | 不侵犯内容版权 | 仅采集必要数据,注明来源 |
| 保护隐私 | 不收集个人信息 | 避免采集邮箱、电话等敏感信息 |
9.2 robots.txt 遵守示例
python
import requests
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse
import time
def check_robots_permission(url, user_agent='*'):
"""检查给定URL是否允许爬取"""
try:
# 解析URL获取基础URL
parsed_url = urlparse(url)
base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
# 获取robots.txt
robots_url = f"{base_url}/robots.txt"
rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()
# 检查权限
return rp.can_fetch(user_agent, url)
except Exception as e:
print(f"检查robots.txt时出错: {e}")
return False
def respectful_crawler(url, user_agent='MyCrawler/1.0'):
"""遵守robots.txt的礼貌爬虫"""
if not check_robots_permission(url, user_agent):
print(f"根据robots.txt,不允许爬取: {url}")
return None
try:
headers = {'User-Agent': user_agent}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
# 礼貌延迟
time.sleep(1)
return response.text
except Exception as e:
print(f"爬取 {url} 时出错: {e}")
return None
# 使用示例
url = 'https://example.com/some-page'
content = respectful_crawler(url)
if content:
print("成功获取内容")
# 处理内容...
10. 总结与资源 <a name="总结与资源"></a>
10.1 数据采集最佳实践总结
| 实践领域 | 最佳实践 |
|---|---|
| 代码结构 | 模块化设计,函数职责单一 |
| 错误处理 | 全面的异常捕获和处理 |
| 性能优化 | 异步请求,合理缓存 |
| 可维护性 | 清晰注释,配置文件管理 |
| 遵守规则 | 尊重robots.txt,控制频率 |
10.2 推荐学习资源
| 资源类型 | 推荐内容 |
|---|---|
| 官方文档 | Requests, BeautifulSoup, Scrapy, Selenium |
| 在线课程 | Coursera, Udemy, 慕课网的数据采集课程 |
| 书籍 | 《Python网络数据采集》、《用Python写网络爬虫》 |
| 社区 | Stack Overflow, GitHub, 知乎相关话题 |
| 工具 | Postman, Chrome DevTools, Scrapy Cloud |
10.3 常见问题与解决方案
| 问题 | 解决方案 |
|---|---|
| 被封IP | 使用代理IP池,降低请求频率 |
| 动态内容 | 使用Selenium或Requests-HTML |
| 验证码 | 使用验证码识别服务或手动处理 |
| 登录认证 | 使用会话保持cookies,处理token |
| 反爬机制 | 随机User-Agent,模拟人类行为 |
通过本指南,您应该已经掌握了Python数据采集从基础到高级的全套技能。记住,技术只是工具,如何使用这些工具才是关键。始终遵循道德和法律准则,负责任地进行数据采集。
Happy Scraping!
126

被折叠的 条评论
为什么被折叠?



