1.静态网页爬取
(1)网页数据的获取
1)利用requests库获取静态网页源码
(2)网页数据的解析
1)利用bs库(BeautifulSoup)解析网页
2)利用lxml库(xpath)解析文本
3)利用parsel库(css)解析网页
4)利用正则表达式(re库)解析网页内容
2.动态网页爬取
(1)网页数据的获取
1)利用requests库获取动态网页源码
(2)模拟登录
(3)使用Selenium库爬取动态网页
1)Selenium和WebDriver的安装和配置
如上述目录所示,后面的发布作品主要分为两个板块进行阐述,分别是静态网页的爬取和动态网页的爬取,前面所发布的图片和视频爬取都属于是动态网页的爬取内容,可能我的讲解不是很清晰,但是没关系,在后续的作品中我会慢慢的在重新仔细地讲解一下,一方面可以巩固我的所学知识,另一方面,也可以和各位进行交流。
今天主要想说的是如何利用requests库获取静态网页的网页源码并保存为html文件,首先导入requests库(注意:必须要先在python解释器中安装此库,才可导入;或者可以直接安装Anaconda,此处不过多阐述),然后获取网页的url(网址)和headers,再来,用response来接受该信息,下面,以某站为例,获取该站的网页源码:
import requests
# from bs4 import BeautifulSoup
url='https://www.bilibili.com/'
headers={
'user-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari'
}
response=requests.get(url,headers=headers)
response.encoding='utf-8'
with open('./bilibili.html','w',encoding='utf-8') as f:
f.write(response.text)
print("爬取网页源码成功")
以上就是要获取某站网页源码所编取的python代码,划重点(response.text方法),下面让我们来运行一下看看结果:
<!DOCTYPE html>
<html lang="zh-CN" class="gray">
<head>
<meta charset="UTF-8" />
<title>哔哩哔哩 (゜-゜)つロ 干杯~-bilibili</title>
<meta
name="description"
content="哔哩哔哩(bilibili.com)是国内知名的视频弹幕网站,这里有及时的动漫新番,活跃的ACG氛围,有创意的Up主。大家可以在这里找到许多欢乐。"
/>
<meta
name="keywords"
content="bilibili,哔哩哔哩,哔哩哔哩动画,哔哩哔哩弹幕网,弹幕视频,B站,弹幕,字幕,AMV,MAD,MTV,ANIME,动漫,动漫音乐,游戏,游戏解说,二次元,游戏视频,ACG,galgame,动画,番组,新番,初音,洛天依,vocaloid,日本动漫,国产动漫,手机游戏,网络游戏,电子竞技,ACG燃曲,ACG神曲,追新番,新番动漫,新番吐槽,巡音,镜音双子,千本樱,初音MIKU,舞蹈MMD,MIKUMIKUDANCE,洛天依原创曲,洛天依翻唱曲,洛天依投食歌,洛天依MMD,vocaloid家族,OST,BGM,动漫歌曲,日本动漫音乐,宫崎骏动漫音乐,动漫音乐推荐,燃系mad,治愈系mad,MAD MOVIE,MAD高燃"
/>
<meta name="renderer" content="webkit" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="spm_prefix" content="333.1007" />
<meta name="referrer" content="no-referrer-when-downgrade" />
<meta name="applicable-device" content="pc">
<meta http-equiv="Cache-Control" content="no-transform" />
<meta http-equiv="Cache-Control" content="no-siteapp" />
<meta name="server_render" content="is_server_render" />
<link rel="dns-prefetch" href="//s1.hdslb.com" />
<link rel="apple-touch-icon" href="https://i0.hdslb.com/bfs/static/jinkela/long/images/512.png" />
<link rel="shortcut icon" href="https://i0.hdslb.com/bfs/static/jinkela/long/images/favicon.ico" />
<link rel="canonical" href="https://www.bilibili.com/" />
<link rel="alternate" media="only screen and (max-width: 640px)" href="https://m.bilibili.com" />
<link
rel="stylesheet"
href="//s1.hdslb.com/bfs/static/jinkela/long/font/medium.css"
media="print"
onload="this.media='all'"
/>
<link
rel="stylesheet"
href="//s1.hdslb.com/bfs/static/jinkela/long/font/regular.css"
media="print"
onload="this.media='all'"
/>
<script>window._BiliGreyResult={"method":"base","grayVersion":"38821"}</script><script src="//s1.hdslb.com/bfs/static/laputa-home/client/assets/svgfont.2cee4853.js" async></script><script src="https://www.bilibili.com/gentleman/polyfill.js?features=es2015%2Ces2016%2Ces2017%2Ces2018%2Ces2019%2Ces2020%2Ces2021%2Ces2022%2CglobalThis&flags=gated"></script>
<script type="text/javascript" src="//s1.hdslb.com/bfs/seed/jinkela/short/bmg/register/fallback.js"></script>
<link rel="stylesheet" href="//s1.hdslb.com/bfs/seed/jinkela/short/bili-theme/map.css"/>
<link rel="stylesheet" href="//s1.hdslb.com/bfs/seed/jinkela/short/bili-theme/light_u.css"/>
<link id="__css-map__" rel="stylesheet" href="//s1.hdslb.com/bfs/seed/jinkela/short/bili-theme/light.css"/>
<script>window.__SERVER_CONFIG__={"serverBuvid":"","homeFeedColumn":"","browserResolution":"","isModern":true,"aiexp":"","remove_channel_lift":0,"ab_