1、解析与提取(Beautiful、json)
- 当数据藏匿于网页源代码(BeautifulSoup)

手动修改编码类型:response.encoding='xxx'
- 当数据藏匿于 XHR 中(json)

- 总结

2、更厉害的请求(get、post、cookies)
- requests.get() _ 参数params:让我们带着参数来请求数据,如我想要第几页?我想要搜索的关键词?我想要多少个数据?
- requests.get() _ 参数headers:请求头。
- get是明文显示参数,post是非明文显示参数。
- requests.post() _ 参数data:用法和params非常相像。
- cookies:作用是让服务器“记住你”。
- 示例代码
import requests
url_1 = 'https://…'
headers = {'user-agent':''}
data = {}
login_in = requests.post(url,headers=headers,data=data)
cookies = login_in.cookies
url_2 = 'https://…'
params = {}
response = requests.get(url,headers=headers,params=params,cookies=cookies)
3、存储(csv、openpyxl)
- csv

import csv
csv_file=open('demo.csv','w',newline='')
writer = csv.writer(csv_file)
writer.writerow(['电影','豆瓣评分'])
csv_file.close()

import csv
csv_file=open('demo.csv','r',newline='')
reader=csv.reader(csv_file)
for row in reader:
print(row)
- Excel文件

import openpyxl
wb=openpyxl.Workbook()
sheet=wb.active
sheet.title='new title'
sheet['A1'] = '漫威宇宙'
rows= [['美国队长','钢铁侠','蜘蛛侠','雷神'],['是','漫威','宇宙', '经典','人物']]
for i in rows:
sheet.append(i)
print(rows)
wb.save('Marvel.xlsx')

import openpyxl
wb = openpyxl.load_workbook('Marvel.xlsx')
sheet=wb['new title']
sheetname = wb.sheetnames
print(sheetname)
A1_value=sheet['A1'].value
print(A1_value)
4、更多的爬虫(协程/gevent、queue)
import gevent,time,requests
from gevent.queue import Queue
from gevent import monkey
monkey.patch_all()
start = time.time()
url_list = ['https://www.baidu.com/',
'https://www.sina.com.cn/',
'http://www.sohu.com/',
'https://www.qq.com/',
'https://www.163.com/',
'http://www.iqiyi.com/',
'https://www.tmall.com/',
'http://www.ifeng.com/']
work = Queue()
for url in url_list:
work.put_nowait(url)
def crawler():
while not work.empty():
url = work.get_nowait()
r = requests.get(url)
print(url,work.qsize(),r.status_code)
tasks_list = [ ]
for x in range(2):
task = gevent.spawn(crawler)
tasks_list.append(task)
gevent.joinall(tasks_list)
end = time.time()
print(end-start)
5、更强大的爬虫(Scrapy框架)
- Scrapy结构

- Scrapy工作原理

- Scrapy用法

6、给爬虫加上翅膀(selenium、邮件/smtplib+email、定时/schedule)
- selenium
提取数据的方法:

对象的转换过程:

获取字符串格式的网页源代码:HTML源代码字符串 = driver.page_source
自动操作浏览器的方法:

- 邮件
流程:

示例代码:
import smtplib
from email.mime.text import MIMEText
from email.header import Header
mailhost='smtp.qq.com'
qqmail = smtplib.SMTP()
qqmail.connect(mailhost,25)
account = input('请输入你的邮箱:')
password = input('请输入你的密码:')
qqmail.login(account,password)
receiver=input('请输入收件人的邮箱:')
content=input('请输入邮件正文:')
message = MIMEText(content, 'plain', 'utf-8')
subject = input('请输入你的邮件主题:')
message['Subject'] = Header(subject, 'utf-8')
try:
qqmail.sendmail(account, receiver, message.as_string())
print ('邮件发送成功')
except:
print ('邮件发送失败')
qqmail.quit()
import schedule
import time
def job():
print("I'm working...")
schedule.every(10).minutes.do(job)
schedule.every().hour.do(job)
schedule.every().day.at("10:30").do(job)
schedule.every().monday.do(job)
schedule.every().wednesday.at("13:15").do(job)
while True:
schedule.run_pending()
time.sleep(1)
7、爬虫进阶路径指引
- 解析与提取
解析库 xpath / lxml
正则表达式( re 模块) - 存储
MySQL库 、MongoDB库
SQL语言 - 数据分析和可视化
模块与库 Pandas / Matplotlib / Numpy / Scikit-Learn / Scipy - 更多的爬虫
多进程( multiprocessing 库) - 更强大的爬虫-框架
Scrapy模拟登录、存储数据库、使用HTTP代理、分布式爬虫
PySpider框架