1.简单爬取数据
1.爬取百度翻译
进入百度翻译,用网站自带的抓包工具,找到输入请求的xhr。
找到请求url,还有请求方法为post
使用formdata 这种方式,kw:d d为查找的英文单词
import requests
url = "https://fanyi.baidu.com/sug"
s= input("请输入英文单词:")
dat = {"kw":s}
resp = requests.post(url, data=dat)
print(resp)
print(resp.json())

找到搜狗翻译的字典和有道翻译的字典如下
import requests
url = {'百度翻译':"https://fanyi.baidu.com/sug",
'搜狗翻译':"https://fanyi.sogou.com/reventondc/suggV3",
'有道翻译':"https://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule"
}
formData =input( "请输入(百度翻译、搜狗翻译、有道翻译)):")
s= input("请输入英文单词:")
dat = {
'百度翻译':
{"kw":s},
'搜狗翻译':
{"from": "auto","to": "zh-CHS","client": "web","text": s,"uuid": "8166fd5c-2cf0-4f7e-a0c5-5ed5c8fc1011","pid": "sogou-dict-vr","addSugg":"on"},
'有道翻译':
{"i": s,"from": "AUTO","to": "AUTO","smartresult": "dict","client": "fanyideskweb",
"salt": "16408453493548","sign": "69884ad58e3f6bc3dcbd65cbf80607c2","lts": "1640845349354",
"bv": "2632875b568a3baf568a14dddf2c8f7f","doctype": "json","version": "2.1","keyfrom": "fanyi.web","action": "FY_BY_REALTlME"
}
}
header = {
"百度翻译":
{'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'},
"搜狗翻译":
{'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'},
"有道翻译":
{
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Referer': 'http://fanyi.youdao.com/',
'Cookie': 'OUTFOX_SEARCH_USER_ID=-1154806696@10.168.8.76; OUTFOX_SEARCH_USER_ID_NCOO=1227534676.2988937; JSESSIONID=aaa7LDLdy4Wbh9ECJb_Vw; ___rl__test__cookies=1563334957868'
}
}
resp = requests.post(url[formData], data=dat[formData],headers = header[formData])
print(resp)
print(resp.json())

2.豆瓣排行榜爬取
1.加入URL的参数param
2.修改消息头的UA
import requests
url = "https://movie.douban.com/j/chart/top_list"
param= {
"type": "24",
"interval_id": "100:90",
"action": "",
"start": 0,
"limit": 20
}
heard= {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.62"
}
resp = requests.get(url=url, params=param, headers=heard)
print(resp) #返回状态
print(resp.json()) #爬取
print(resp.request.headers['User-Agent'])
resp.close() #关掉resp
2.数据解析
从爬取的数据里面,选取有用的数据,为数据解析。
这里有三种:
1.re解析
2.bs4解析
3.xpath解析
3.re解析
re解析用正则表达式,一种表达式方式对字符串进行匹配的语法规则。
https://tool.oschina.net/regex
正则表达式测试网站
1) 常用元字符

2) 常用限定符(量词)

3) 其他语法

贪婪匹配 .*
惰性匹配 .*?
.*? 表示匹配任意数量的重复,但是在能使整个匹配成功的前提下使用最少的重复(惰性匹配)
4. re解析 爬取豆瓣250排行榜
import requests
import re
import csv
url = 'https://movie.douban.com/top250'
heard = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110"
"Safari/537.36 Edg/96.0.1054.62"
}
resp = requests.get(url,headers=heard)
print(resp)
#保存网页
pageContent = resp.text
resp.close() #关闭网页
#解析数据
obj = re.compile(r'<li>.*?<div class.*?>.*?<span class="title">'
r'(?P<name>.*?)</span>.*?<p class="">.*?<br>.*?'
r'(?P<year>\d.*?) / .*?<div class="star">.*?'
r'<span class="rating_num" property="v:average">'
r'(?P<score>.*?)</span>.*?<span>'
r'(?P<people>.*?)评价</span>', re.S) #正则预加载
#开始匹配
res = obj.finditer(pageContent)
#保存csv
f = open("data.csv",mode='w')
csvwriter = csv.writer(f)
for it in res:
#print(it.group("name","year",'score','people'))
dic = it.groupdict()
csvwriter.writerow(dic.values())
f.close
print("Over!")

改进版将250个全部爬出
import requests
import re
import csv
def getMovie(starter):
url = 'https://movie.douban.com/top250'
heard = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110"
"Safari/537.36 Edg/96.0.1054.62"
}
params= {
"start": starter,
"filter": ""
}
resp = requests.get(url,params=params,headers=heard)
print(resp)
#保存网页
pageContent = resp.text
resp.close() #关闭网页
return pageContent
#解析数据
obj = re.compile(r'<li>.*?<div class.*?>.*?<span class="title">'
r'(?P<name>.*?)</span>.*?<p class="">.*?<br>.*?'
r'(?P<year>\d{4}).*? / .*?<div class="star">.*?'
r'<span class="rating_num" property="v:average">'
r'(?P<score>.*?)</span>.*?<span>'
r'(?P<people>.*?)评价</span>', re.S) #正则预加载
#保存csv
f = open("data.csv",mode='w',newline='')
#开始匹配
for starter in range(0,250,25):
res = obj.finditer(getMovie(starter))
csvwriter = csv.writer(f)
for it in res:
#print(it.group("name","year",'score','people'))
dic = it.groupdict()
#print(dic.values())
csvwriter.writerow(dic.values())
print(starter+25,"Over!")
f.close()

5.总结
爬虫知识基本是以前没见到也没用过的,接受需要一段时间,明天还有re解析的一个案例,看看能它和bs4一起看完吗。