我的爬虫学习记录Selenium4.0与百度搜索_爬虫-selenium-百度安全验证-优快云博客

从八月份开始学习的爬虫一直到现在都没有再怎么写学习记录了，其实是有间间断断在学的，只不过写一篇csdn的文章确实是有点花费时间（本人总想尽可能写的详细一点，注释多一点）。很多次草草写了个开头就扔到草稿库里了。

正好在不久之前有位兄弟在这篇文章下面问了我用requests.get()访问百度但是不能正常访问的问题，而网上有很多关于requests.get访问百度的教程是20、21年的，同样的代码（加了headers中的UA）却不能正常的访问百度搜索，抓取源码。

import requests
url = "https://www.baidu.com/s?wd=周杰伦"
my_headers={
    "User-Agent": "在开发者工具里看你的UA"
}
resp = requests.get(url = url, headers =my_headers)
resp.encoding = "utf-8"
print(resp.text)

大多数的教程都是加个有UA信息的header，但是在现在即2022.9.10，返回信息如下，说明我们被反扒了。

<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="utf-8">
    <title>百度安全验证</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <meta name="apple-mobile-web-app-capable" content="yes">
    <meta name="apple-mobile-web-app-status-bar-style" content="black">
    <meta name="viewport" content="width=device-width, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0">
    <meta name="format-detection" content="telephone=no, email=no">
    <link rel="shortcut icon" href="https://www.baidu.com/favicon.ico" type="image/x-icon">
    <link rel="icon" sizes="any" mask href="https://www.baidu.com/img/baidu.svg">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <meta http-equiv="Content-Security-Policy" content="upgrade-insecure-requests">
    <link rel="stylesheet" href="https://ppui-static-wap.cdn.bcebos.com/static/touch/css/api/mkdjump_0635445.css" />
</head>
<body>
    <div class="timeout hide">
        <div class="timeout-img"></div>
        <div class="timeout-title">网络不给力，请稍后重试</div>
        <button type="button" class="timeout-button">返回首页</button>
    </div>
    <div class="timeout-feedback hide">
        <div class="timeout-feedback-icon"></div>
        <p class="timeout-feedback-title">问题反馈</p>
    </div>

<script src="https://wappass.baidu.com/static/machine/js/api/mkd.js"></script>
<script src="https://ppui-static-wap.cdn.bcebos.com/static/touch/js/mkdjump_db105ab.js"></script>
</body>
</html>

进程已结束,退出代码0

在网上翻阅2022年的教程，找到了能通过requests.get抓取到百度搜索网页源码的解决方案。

headers = {
	"User-Agent": "你的UA",
	"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
	"Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",
	"Connection": "keep-alive",
	"Accept-Encoding": "gzip, deflate, br",
	"Host": "www.baidu.com",
	# 需要更换Cookie
	"Cookie": "你的cookie"
}

也就是把我们的headers扩充一下。这样就能抓到有我们需要的信息的源码了。

Selenium4.0

下面简单介绍一下我这几天在学的selenium，对于我们写爬虫的而言，它的功能就是像人一样，点开浏览器，进行一些操作，和requests尽可能伪装成人类访问不同。我也看了一些文章说，验证码一定程度上就是为了对付selenium。

在使用selenium前除了要pip install selenium之外，还有下载一个对于你浏览器的浏览器驱动，具体方法大家就从网上查吧，唯一要注意的是浏览器版本和驱动版本要对应，我这里用的是edge浏览器。

from selenium.webdriver import Edge # 需要导入相应的浏览器驱动 我这里用的是edge
from selenium.webdriver.common.by import By # selenium 4.0版本后使用的方法
import time
from selenium.webdriver.common.keys import Keys

web = Edge()
# 访问
web.get("https://www.baidu.com/")
# 在开发者工具里找输入框的位置
web.find_element(By.XPATH,'//*[@id="kw"]').send_keys("周杰伦",Keys.ENTER)
time.sleep(1)
# 抓要的搜索信息
el1 = web.find_element(By.XPATH, '/html/body/div[2]/div[4]/div[1]/div[3]/div')

这里在强调几点我遇到过的坑。

第一是你的Xpath路径，一定要从自动测试软件打开的浏览器的源码中复制，有的网站自己打开的浏览器和自动测试软件打开的浏览器他的源码不太一样。

第二是在各大网络教程上，用的不是selenium 4.0以后版本的，他的find_element方法是老版的。

from selenium.webdriver.common.by import By # selenium 4.0版本后使用的方法

# 老版
# find_element_by_XX()
# XX代表着不同的方法有ID、XPATH等
#新版
# web.find_element(By.XX,VALUE)
# VALUE 是你对应方法的值

爬虫学习还在继续，我也会有时间继续更新我接下来学习selenium的记录，以后也许会填一下async的坑。。。。。

感谢大家的阅读，希望大家和我多多交流爬虫的知识。拜拜~