Python爬虫学习笔记(二):PY3完整注释的爬虫模板

本文分享了一篇关于Python3爬虫的完整注释模板,涵盖了GET和POST请求的处理,包括如何利用fiddler分析请求头,构建form_data字典,并提供了绕过HTTPS验证的示例,适用于爬取豆瓣等网站。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

PY3完整的爬虫模板:
所有模板完成过程离不开抓包工具fiddler分析请求头,POST请求的form_data格式是不一样的,需要试验后通过webforms来判断所传递的字典kv值分别是什么,从而完成form_data构建。

Get请求及详细爬取过程,格式调整好了,可直接复制粘贴使用:

#-*- coding:utf-8 -*-
import urllib.request
import urllib.parse

#To define which webset do you wanna crawl. 
get_url = 'http://www.example.com/'    # for example,must be the whole url path!!!

#Generate headers,headers must be a dict, UA is selected automatically by program. 
headers = {
       # safari
       'User-Agent':'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 
       # chrome
       'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
       # firefox
       'User-Agent':'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
}

#Generate request stuction. As usual , we only disguise headers with different UA, but sometimes we need copy the whole headers including Accept etc...
req = urllib.request.Request(url=get_url,headers=headers)

#Get response object, the response is binary, so we need to read it after decode.
response = urllib.request.urlopen(req)

#Proof the result.
result = response.read().decode()
print(result)

#Save the result or discard.
with open('res1.txt','w',encoding='utf-8') as f:
	f.write(result)

POST请求及详细爬取过程,此爬虫包含绕过https验证,强行爬豆瓣,可直接复制粘贴使用:

#-*- coding:utf-8 -*-
import urllib.request
import urllib.parse
# Crawling https webset,you must import ssl-pkg
import ssl

# Define the post_url of douban
post_url = 'https://accounts.douban.com/passport/login'

# Define the vars of formdata
# pls insert your account
name = input('Pls insert you name: ')
name = str(name)
# pls insert your passwd
passwd = input('Pls insert your password: ')
passwd = str(passwd)

# Define the formdata, use dict, then use the method of urlencode, charge dict into str.
form_data = {
    'password':passwd,
    'remember':'true',
    'name':name,
    'ck':'',
    'ticket':'',
}

# Define headers, you can copy the whole headers without modify.
headers = {
       'Accept': 'application/json',
       'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
       'X-Requested-With':'XMLHttpRequest',
       'User-Agent':'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
}

# Bypass the tls/ssl, define context,  you can copy the whole headers without modify.
context = ssl._create_unverified_context()

# Generate request stuction. 
req = urllib.request.Request(url=post_url,headers=headers)

# Change str into bytes.
form_data = urllib.parse.urlencode(form_data).encode()

# The data which you pass must be the format of bytes.
response = urllib.request.urlopen(req,data=form_data,context=context)

#Proof the result
print(response.read().decode())

爬豆瓣后结果如下,粘贴部分:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  
  <title>登录豆瓣</title>
  <style type="text/css">
    #db-nav-sns {
      position: relative;
      zoom: 1;
      background: #edf4ed; 
    }
    #db-nav-sns .nav-primary {
      width: 1040px;
      margin: 0 auto;
      overflow: hidden;
      padding: 22px 0 20px;
      zoom: 1;
    }
    .account-wrap {
      width: 1040px;
      margin: 20px auto 0;
      overflow: hidden;
    }

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值