requests库获取静态网页源码

1.静态网页爬取

(1)网页数据的获取

1)利用requests库获取静态网页源码

(2)网页数据的解析

1)利用bs库(BeautifulSoup)解析网页

2)利用lxml库(xpath)解析文本

3)利用parsel库(css)解析网页

4)利用正则表达式(re库)解析网页内容

2.动态网页爬取

(1)网页数据的获取

1)利用requests库获取动态网页源码

(2)模拟登录

(3)使用Selenium库爬取动态网页

1)Selenium和WebDriver的安装和配置

如上述目录所示,后面的发布作品主要分为两个板块进行阐述,分别是静态网页的爬取和动态网页的爬取,前面所发布的图片和视频爬取都属于是动态网页的爬取内容,可能我的讲解不是很清晰,但是没关系,在后续的作品中我会慢慢的在重新仔细地讲解一下,一方面可以巩固我的所学知识,另一方面,也可以和各位进行交流。

今天主要想说的是如何利用requests库获取静态网页的网页源码并保存为html文件,首先导入requests库(注意:必须要先在python解释器中安装此库,才可导入;或者可以直接安装Anaconda,此处不过多阐述),然后获取网页的url(网址)和headers,再来,用response来接受该信息,下面,以某站为例,获取该站的网页源码:

import requests
# from bs4 import BeautifulSoup

url='https://www.bilibili.com/'
headers={
'user-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari'
}

response=requests.get(url,headers=headers)
response.encoding='utf-8'

with open('./bilibili.html','w',encoding='utf-8') as f:
    f.write(response.text)
print("爬取网页源码成功")

以上就是要获取某站网页源码所编取的python代码,划重点(response.text方法),下面让我们来运行一下看看结果:

<!DOCTYPE html>
<html lang="zh-CN" class="gray">
  <head>
    <meta charset="UTF-8" />
    <title>哔哩哔哩 (゜-゜)つロ 干杯~-bilibili</title>
    <meta
      name="description"
      content="哔哩哔哩(bilibili.com)是国内知名的视频弹幕网站,这里有及时的动漫新番,活跃的ACG氛围,有创意的Up主。大家可以在这里找到许多欢乐。"
    />
    <meta
      name="keywords"
      content="bilibili,哔哩哔哩,哔哩哔哩动画,哔哩哔哩弹幕网,弹幕视频,B站,弹幕,字幕,AMV,MAD,MTV,ANIME,动漫,动漫音乐,游戏,游戏解说,二次元,游戏视频,ACG,galgame,动画,番组,新番,初音,洛天依,vocaloid,日本动漫,国产动漫,手机游戏,网络游戏,电子竞技,ACG燃曲,ACG神曲,追新番,新番动漫,新番吐槽,巡音,镜音双子,千本樱,初音MIKU,舞蹈MMD,MIKUMIKUDANCE,洛天依原创曲,洛天依翻唱曲,洛天依投食歌,洛天依MMD,vocaloid家族,OST,BGM,动漫歌曲,日本动漫音乐,宫崎骏动漫音乐,动漫音乐推荐,燃系mad,治愈系mad,MAD MOVIE,MAD高燃"
    />
    <meta name="renderer" content="webkit" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <meta name="spm_prefix" content="333.1007" />
    <meta name="referrer" content="no-referrer-when-downgrade" />
    <meta name="applicable-device" content="pc">
    <meta http-equiv="Cache-Control" content="no-transform" />
    <meta http-equiv="Cache-Control" content="no-siteapp" />
    <meta name="server_render" content="is_server_render" />

    <link rel="dns-prefetch" href="//s1.hdslb.com" />
    <link rel="apple-touch-icon" href="https://i0.hdslb.com/bfs/static/jinkela/long/images/512.png" />
    <link rel="shortcut icon" href="https://i0.hdslb.com/bfs/static/jinkela/long/images/favicon.ico" />
    <link rel="canonical" href="https://www.bilibili.com/" />
    <link rel="alternate" media="only screen and (max-width: 640px)" href="https://m.bilibili.com" />
    <link
      rel="stylesheet"
      href="//s1.hdslb.com/bfs/static/jinkela/long/font/medium.css"
      media="print"
      onload="this.media='all'"
    />
    <link
      rel="stylesheet"
      href="//s1.hdslb.com/bfs/static/jinkela/long/font/regular.css"
      media="print"
      onload="this.media='all'"
    />
    <script>window._BiliGreyResult={"method":"base","grayVersion":"38821"}</script><script src="//s1.hdslb.com/bfs/static/laputa-home/client/assets/svgfont.2cee4853.js" async></script><script src="https://www.bilibili.com/gentleman/polyfill.js?features=es2015%2Ces2016%2Ces2017%2Ces2018%2Ces2019%2Ces2020%2Ces2021%2Ces2022%2CglobalThis&flags=gated"></script>   
    <script type="text/javascript" src="//s1.hdslb.com/bfs/seed/jinkela/short/bmg/register/fallback.js"></script>
    
    <link rel="stylesheet" href="//s1.hdslb.com/bfs/seed/jinkela/short/bili-theme/map.css"/>
    <link rel="stylesheet" href="//s1.hdslb.com/bfs/seed/jinkela/short/bili-theme/light_u.css"/>
    <link id="__css-map__" rel="stylesheet" href="//s1.hdslb.com/bfs/seed/jinkela/short/bili-theme/light.css"/>
  
    <script>window.__SERVER_CONFIG__={"serverBuvid":"","homeFeedColumn":"","browserResolution":"","isModern":true,"aiexp":"","remove_channel_lift":0,"ab_
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值