Python爬虫（Scrapy、Selenium、BeautifulSoup、Jupyter的使用）

一只小铁柱

已于 2024-09-04 22:41:00 修改

阅读量2k

点赞数 5

文章标签：爬虫 python 开发语言

于 2024-08-20 23:06:08 首次发布

本文链接：https://blog.youkuaiyun.com/beautiful77moon/article/details/141370356

版权

一、常见的Python爬虫库

1. Requests：处理Http请求。

2. lxml：HTML和XML文件解。

3. BeautifulSoup：网络抓取框架，用于解析和提取HTML和XML数据，通常用于小数据量的处理，不支持异步操作，通常搭配lxml使用。

4. Scrapy：比较强大的爬虫框架，适用于复杂大型任务的爬取。

5. Selenium：模拟用户访问浏览器操作，适合处理JS渲染的网页。

6. re：正则表达式。

二、爬虫示例 1

示例描述：爬取自己的csdn博客，统计每篇博客的访问量，制作一个柱状图，以访问量从大到小的方式显示。

1. 从“个人主页”爬取所有所有文章的链接

1.1 查看爬取规则

打开个人主页，右键->检查：可以看到每篇文章的链接挂在哪个标签的哪个属性下( <article>标签下的<a>标签中的href属性值即为每篇文章的链接 )

1.2 提取网页中的所有文章ip

from bs4 import BeautifulSoup  #pip3 install beautifulsoup4
from urllib.request import urlopen

homePage_url="https://blog.youkuaiyun.com/beautiful77moon?type=blog"  #你的csdn个人主页链接
homePage_html=urlopen(homePage_url).read().decode('utf-8')
soup=BeautifulSoup(homePage_html,features='lxml')

#1.查找所有的<article>标签
li_articles=soup.find_all('article')

#2.取出所有<article>标签下<a>中的href属性值
article_urls=[]
for item in li_articles:
    link=item.find_all('a')
    article_urls.append(link[0]['href'])
    print(link[0]['href'])

当页面内容过多时，需要下拉"加载"，才能显示所有内容，所以需要一个工具模拟浏览器行为，自动滚动页面以加载更多内容。urllib无法处理这种情况，所以一般不建议使用 urllib。

1.3 使用selenium模拟浏览器。

1.3.1 下载浏览器驱动(以Edge为例)

1. 查看自己的浏览器版本（点击浏览器右上角的三个点->设置->关于 Microsoft Edge）

2. 下载对应版本的驱动：Microsoft Edge WebDriver | Microsoft Edge Developer

3. 解压到一个目录下（这个目录后续会用到）

1.3.2 下载关键的依赖包

1. 浏览器模拟器selenium：pip install selenium --index-url https://pypi.tuna.tsinghua.edu.cn/simple

2. 处理网页的beautifulsoup：pip install beautifulsoup4 --index-url https://pypi.tuna.tsinghua.edu.cn/simple

1.3.3 代码

1. 模拟浏览器，实现“滑动鼠标”下拉页面以加载更多数据的行为

2. 从个人主页提取所有文章的url并打印

from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.edge.options import Options
from bs4 import BeautifulSoup
import time

# 设置 Edge驱动 的路径
edge_driver_path = 'E:\SoftWare_work\download\edgedriver_win64\msedgedriver.exe'  # 替换为你本地的 EdgeDriver 路径

# 配置 Edge浏览器选项
edge_options = Options()
edge_options.add_argument("--headless&#

最低0.47元/天解锁文章