PY3完整的爬虫模板:
所有模板完成过程离不开抓包工具fiddler分析请求头,POST请求的form_data格式是不一样的,需要试验后通过webforms来判断所传递的字典kv值分别是什么,从而完成form_data构建。
Get请求及详细爬取过程,格式调整好了,可直接复制粘贴使用:
#-*- coding:utf-8 -*-
import urllib.request
import urllib.parse
#To define which webset do you wanna crawl.
get_url = 'http://www.example.com/' # for example,must be the whole url path!!!
#Generate headers,headers must be a dict, UA is selected automatically by program.
headers = {
# safari
'User-Agent':'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
# chrome
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
# firefox
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
}
#Generate request stuction. As usual , we only disguise headers with different UA, but sometimes we need copy the whole headers including Accept etc...
req = urllib.request.Request(url=get_url,headers=headers)
#Get response object, the response is binary, so we need to read it after decode.
response = urllib.request.urlopen(req)
#Proof the result.
result = response.read().decode()
print(result)
#Save the result or discard.
with open('res1.txt','w',encoding='utf-8') as f:
f.write(result)
POST请求及详细爬取过程,此爬虫包含绕过https验证,强行爬豆瓣,可直接复制粘贴使用:
#-*- coding:utf-8 -*-
import urllib.request
import urllib.parse
# Crawling https webset,you must import ssl-pkg
import ssl
# Define the post_url of douban
post_url = 'https://accounts.douban.com/passport/login'
# Define the vars of formdata
# pls insert your account
name = input('Pls insert you name: ')
name = str(name)
# pls insert your passwd
passwd = input('Pls insert your password: ')
passwd = str(passwd)
# Define the formdata, use dict, then use the method of urlencode, charge dict into str.
form_data = {
'password':passwd,
'remember':'true',
'name':name,
'ck':'',
'ticket':'',
}
# Define headers, you can copy the whole headers without modify.
headers = {
'Accept': 'application/json',
'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'X-Requested-With':'XMLHttpRequest',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
}
# Bypass the tls/ssl, define context, you can copy the whole headers without modify.
context = ssl._create_unverified_context()
# Generate request stuction.
req = urllib.request.Request(url=post_url,headers=headers)
# Change str into bytes.
form_data = urllib.parse.urlencode(form_data).encode()
# The data which you pass must be the format of bytes.
response = urllib.request.urlopen(req,data=form_data,context=context)
#Proof the result
print(response.read().decode())
爬豆瓣后结果如下,粘贴部分:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>登录豆瓣</title>
<style type="text/css">
#db-nav-sns {
position: relative;
zoom: 1;
background: #edf4ed;
}
#db-nav-sns .nav-primary {
width: 1040px;
margin: 0 auto;
overflow: hidden;
padding: 22px 0 20px;
zoom: 1;
}
.account-wrap {
width: 1040px;
margin: 20px auto 0;
overflow: hidden;
}