Python爬虫学习笔记（二）：PY3完整注释的爬虫模板

最新推荐文章于 2024-07-25 08:23:34 发布

已开挂的24K

最新推荐文章于 2024-07-25 08:23:34 发布

阅读量334

点赞数

CC 4.0 BY-SA版权

文章标签：爬虫模板绕过https

本文链接：https://blog.youkuaiyun.com/weixin_41047549/article/details/89045808

本文分享了一篇关于Python3爬虫的完整注释模板，涵盖了GET和POST请求的处理，包括如何利用fiddler分析请求头，构建form_data字典，并提供了绕过HTTPS验证的示例，适用于爬取豆瓣等网站。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

PY3完整的爬虫模板：
所有模板完成过程离不开抓包工具fiddler分析请求头，POST请求的form_data格式是不一样的，需要试验后通过webforms来判断所传递的字典kv值分别是什么，从而完成form_data构建。

Get请求及详细爬取过程，格式调整好了，可直接复制粘贴使用：

#-*- coding:utf-8 -*-
import urllib.request
import urllib.parse

#To define which webset do you wanna crawl. 
get_url = 'http://www.example.com/'    # for example,must be the whole url path!!!

#Generate headers,headers must be a dict, UA is selected automatically by program. 
headers = {
       # safari
       'User-Agent':'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 
       # chrome
       'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
       # firefox
       'User-Agent':'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
}

#Generate request stuction. As usual , we only disguise headers with different UA, but sometimes we need copy the whole headers including Accept etc...
req = urllib.request.Request(url=get_url,headers=headers)

#Get response object, the response is binary, so we need to read it after decode.
response = urllib.request.urlopen(req)

#Proof the result.
result = response.read().decode()
print(result)

#Save the result or discard.
with open('res1.txt','w',encoding='utf-8') as f:
	f.write(result)

POST请求及详细爬取过程，此爬虫包含绕过https验证，强行爬豆瓣，可直接复制粘贴使用：

#-*- coding:utf-8 -*-
import urllib.request
import urllib.parse
# Crawling https webset,you must import ssl-pkg
import ssl

# Define the post_url of douban
post_url = 'https://accounts.douban.com/passport/login'

# Define the vars of formdata
# pls insert your account
name = input('Pls insert you name: ')
name = str(name)
# pls insert your passwd
passwd = input('Pls insert your password: ')
passwd = str(passwd)

# Define the formdata, use dict, then use the method of urlencode, charge dict into str.
form_data = {
    'password':passwd,
    'remember':'true',
    'name':name,
    'ck':'',
    'ticket':'',
}

# Define headers, you can copy the whole headers without modify.
headers = {
       'Accept': 'application/json',
       'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
       'X-Requested-With':'XMLHttpRequest',
       'User-Agent':'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
}

# Bypass the tls/ssl, define context,  you can copy the whole headers without modify.
context = ssl._create_unverified_context()

# Generate request stuction. 
req = urllib.request.Request(url=post_url,headers=headers)

# Change str into bytes.
form_data = urllib.parse.urlencode(form_data).encode()

# The data which you pass must be the format of bytes.
response = urllib.request.urlopen(req,data=form_data,context=context)

#Proof the result
print(response.read().decode())

爬豆瓣后结果如下，粘贴部分：

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  
  <title>登录豆瓣</title>
  <style type="text/css">
    #db-nav-sns {
      position: relative;
      zoom: 1;
      background: #edf4ed; 
    }
    #db-nav-sns .nav-primary {
      width: 1040px;
      margin: 0 auto;
      overflow: hidden;
      padding: 22px 0 20px;
      zoom: 1;
    }
    .account-wrap {
      width: 1040px;
      margin: 20px auto 0;
      overflow: hidden;
    }