Python爬虫：模拟抓取指定URL内容并添加Cookies_python 脚本粘贴抓包的url-优快云博客

本文链接：https://blog.youkuaiyun.com/aoxiangchina/article/details/143971820

在网络爬虫的世界里，Python因其强大的库支持和简洁的语法成为了开发者的首选语言。本次教程将带你了解如何使用Python模拟抓取指定URL的内容，并展示如何在需要时添加Cookies以绕过一些简单的反爬虫机制。

环境准备

在开始之前，请确保你的环境中已安装Python以及以下库：

requests：用于发送网络请求。
BeautifulSoup：用于解析HTML文档。

如果尚未安装，可以通过以下命令安装：

pip install requests beautifulsoup4

基本的网页内容抓取

首先，我们将创建一个简单的脚本来模拟抓取指定URL的内容。

import requests
from bs4 import BeautifulSoup

# 目标网页URL
url = 'http://www.rolexby.cn'

# 发送GET请求
response = requests.get(url)

# 确保请求成功
if response.status_code == 200:
    # 使用BeautifulSoup解析HTML
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 打印网页内容
    print(soup.prettify())
else:
    print('请求失败')

添加Cookies进行模拟抓取

有些网站会检查请求中的Cookies来识别爬虫。在这种情况下，我们需要在请求中添加Cookies。

import requests
from bs4 import BeautifulSoup

# 目标网页URL
url = 'http://www.watchwxfw.cn'

# Cookies字典
cookies = {
    'name': 'value',  # 根据实际情况替换
    'another_name': 'another_value'
}

# 发送带有Cookies的GET请求
response = requests.get(url, cookies=cookies)

# 确保请求成功
if response.status_code == 200:
    # 使用BeautifulSoup解析HTML
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 打印网页内容
    print(soup.prettify())
else:
    print('请求失败')

添加Headers进行模拟抓取

除了Cookies，有些网站还会检查请求头中的某些字段，如User-Agent，以识别爬虫。我们可以在请求中添加Headers来模拟正常浏览器的行为。

import requests
from bs4 import BeautifulSoup

# 目标网页URL
url = 'http://www.baidhub.com'

# Cookies字典
cookies = {
    'name': 'value',  # 根据实际情况替换
    'another_name': 'another_value'
}

# 请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# 发送带有Cookies和Headers的GET请求
response = requests.get(url, cookies=cookies, headers=headers)

# 确保请求成功
if response.status_code == 200:
    # 使用BeautifulSoup解析HTML
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 打印网页内容
    print(soup.prettify())
else:
    print('请求失败')