Python网络爬虫实战：网页数据抓取与解析-优快云博客

本文链接：https://blog.youkuaiyun.com/shi_jiaye/article/details/119249300

抓取网页数据

网页数据抓取指从网络资源上抓取网页中的一些有用数据或网络文件数据。其基本过程是获取网络上的网页内容或文件，然后再进行正则匹配处理。

1.urllib库

Urllib是python内置的标准库模块，使用它可以像访问本地文本文件一样读取网页的内容。Python的Urllib库模块包括以下四个模块：

·urllib.request 请求模块
·urllib.error 异常处理模块
·urllib.parse url解析模块
·urllib.robotparser解析模块

2.urllib.request模块的常用方法

在这里插入图片描述

urllib.request模块常用方法的基本使用步骤：

（1）导入urllib.request模块
from urllib import request

（2）连接要访问的网站，发起请求
resp = request.urlopen(“http://网站IP地址")

（3）获取网站代码信息
print(resp.read().decode())

应用urllib.request.urlopen()方法连接网站，抓取页面代码：

import urllib.request
response=urllib.request.urlopen("http://www.baidu.com/")
print(response.info())  # 查看相应的简介信息
print('\n**************************\n')
print(response.getcode())  # 获取状态码
print('\n**************************\n')
print(response.read())

3.BeautifulSoup模块

（1）安装BeautifulSoup模块

BeautifulSoup模块不是Pyton系统自带模块，必须用pip安装：

  pip install beautifulsoup4

（2）BeautifulSoup模块的基本元素

在这里插入图片描述

（3）“标签树”

在解析网页文档的过程中，需要应用BeautifulSoup模块对HTML内容进行遍历。
设有如下的一个HTML文档：

<html>
    <head>
        ....
    </head>
    <body>
        <p class="title"> The demo Python Project.</p>
        <p class="course"> 
            Python is a programming language.
            <a href="http://www.icourse163.com"> Basic Python </a>
            <a href="http:..www.python.org"> Advanced Python </a>
        </p>
    </body>
</html>

在这里插入图片描述

（4）BeautifulSoup模块对象“标签树”的上行遍历属性

在这里插入图片描述

（5）BeautifulSoup模块对象“标签树”的下行遍历属性

在这里插入图片描述

（6）BeautifulSoup模块对象的信息提取方法

在这里插入图片描述

网络爬虫简介

1.网络爬虫

网络爬虫（又被称为网页蜘蛛，网络机器人），是一种按照一定的规则，自动地抓取万维网信息的程序。

2.爬虫的基本流程

（1）发起请求

通过HTTP库向目标站点发起请求，也就是发送一个Request，请求可以包含额外的 header 等信息，等待服务器响应！

（2）获取响应内容

如果服务器能正常响应，会得到一个Response，Response的内容便是所要获取的页面内容，类型可能是HTML,Json字符串，二进制数据（图片或者视频）等类型

（3）解析内容

得到的内容可能是HTML,可以用正则表达式，页面解析库进行解析，可能是Json,可以直接转换为Json对象解析，可能是二进制数据，可以做保存或者进一步的处理

（4）保存数据

保存形式多样，可以存为文本，也可以保存到数据库，或者保存特定格式的文件

3.爬取最新电影的影评信息。

网址 https://movie.douban.com/nowplaying/hangzhou 是一个推介最新上映的电影网站页面。

爬取网站信息的主要步骤如下：

（1）获取网站页面的HTML代码；
（2）处理页面，提取相关信息；
（3）解析数据，输出结果。

（1）获取网站页面的HTML代码

from urllib import request
resp = request.urlopen('https://movie.douban.com/nowplaying/hangzhou/')
html_data = resp.read().decode('utf-8')
print(html_data)

这里会出现418的错误：👉点我！解决418反爬机制，模拟浏览器访问

from urllib import request
from urllib.request import Request

url = 'https://movie.douban.com/nowplaying/hangzhou/'
header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.55'}
ret = Request(url,headers = header)  # 通过HTTP库向目标站点发起请求，也就是发送一个Request
resp = request.urlopen(ret)
html_data = resp.read().decode('utf-8')
# print(html_data)

在这里插入图片描述

（2）对得到的HTML代码进行解析，提取需要的数据

在python中使用BeautifulSoup库进行HTML代码的解析。BeautifulSoup使用的格式如下：

BeautifulSoup(html,  "html.parser")

其中，第一个参数为需要提取数据的html，第二个参数是指定解析器，然后使用find_all()读取html标签中的内容。

从(1)中输出的结果可以看到，影片《白蛇2：青蛇劫起》的电影名称、评分、主演等信息都存放在 < li class = ”list-item” > 标签中，而< li >标签又存放在< div id=“nowplaying”>标签中。

所以，查看影片信息的代码编写如下：（接1中的代码）

from bs4 import BeautifulSoup as bs
soup = bs(html_data, 'html.parser')
nowplaying_movie = soup.find_all('div', id='nowplaying')
nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_='list-item')
# print(nowplaying_movie_list[0])  # 第一个影片

标签< li >的data-subject属性里面存放了影片的id号码，而在< img >标签的alt属性里面存放了影片的名字。因此通过这两个属性能得到影片的id和名称。

（说明：打开电影短评的网页需要用到电影的id，所以需要对它进行解析）。

解析影片的id和名称的代码如下：

nowplaying_list = []
for item in nowplaying_movie_list:
    nowplaying_dict = {}
    nowplaying_dict['id'] = item['data-subject']  # 影片id
    for tag_img_item in item.find_all('img'):
        nowplaying_dict['name'] = tag_img_item['alt']  # 名称
        nowplaying_list.append(nowplaying_dict)
# print(nowplaying_list)

（3）对页面数据进行解析，并输出结果

下面进行对最新电影短评网址进行解析。例如《白蛇2：青蛇劫起》的短评网址为：
https://movie.douban.com/subject/30435124/comments?status=P
其中30435124就是影片的id。

打开影片《白蛇2：青蛇劫起》的短评页面的html代码，可以发现关于评论的数据是在div标签的comment属性下面。

comment_div_lits = []
requrl = 'https://movie.douban.com/subject/' + \
          nowplaying_list[0]['id'] + \
          '/comments' + \
          '?' +'start=0' + \
          '&limit=20'
ret1 = Request(requrl,headers = header)  # 通过HTTP库向目标站点发起请求，也就是发送一个Request
resp1 = request.urlopen(ret1)
html_data1 = resp1.read().decode('utf-8')
soup1 = bs(html_data1, 'html.parser')
comment_div_lits = soup1.find_all('div', class_='comment')
print(comment_div_lits)

此时在comment_div_lits 列表中存放的就是div标签和comment属性下面的html代码了。

eachCommentList = []
for item in comment_div_lits:
    if item.find_all('p')[0] is not None:
        eachCommentList.append(item.find_all('p')[0])
    print(eachCommentList)

完整代码：

from urllib import request
from urllib.request import Request
from bs4 import BeautifulSoup as bs
# 分析网页函数
def getNowPlayingMovie_list():
    url = 'https://movie.douban.com/nowplaying/hangzhou/'
    header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.55'}
    ret = Request(url, headers=header)
    resp = request.urlopen(ret)
    html_data = resp.read().decode('utf-8')
    soup = bs(html_data, 'html.parser')
    nowplaying_movie = soup.find_all('div', id='nowplaying')
    nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_='list-item')
    nowplaying_list = []
    for item in nowplaying_movie_list:
        nowplaying_dict = {}
        nowplaying_dict['id'] = item['data-subject']
        for tag_img_item in item.find_all('img'):
            nowplaying_dict['name'] = tag_img_item['alt']
            nowplaying_list.append(nowplaying_dict)
    return nowplaying_list
# 爬取评论函数
def getCommentsById(movieId, pageNum):
    eachCommentList = []
    if pageNum > 0:
        start = (pageNum - 1) * 20
    else:
        return False
    requrl = 'https://movie.douban.com/subject/' + \
             movieId + '/comments' + \
             '?' + 'start=' + str(start) + '&limit=20'
    # print(requrl)
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.55'}
    ret = Request(requrl, headers=header)
    resp = request.urlopen(ret)
    html_data = resp.read().decode('utf-8')
    soup = bs(html_data, 'html.parser')
    comment_div_lits = soup.find_all('div', class_='comment')
    for item in comment_div_lits:
        if item.find_all('span', class_="short") is not None:
            eachCommentList.append(item.find_all('span', class_="short"))
    return eachCommentList
def main():
    # 循环获取第一个电影的前3页评论
    commentList = []
    NowPlayingMovie_list = getNowPlayingMovie_list()
    for i in range(3):
        num = i + 1
        commentList_temp = getCommentsById(NowPlayingMovie_list[0]['id'], num)
        commentList.append(commentList_temp)
    # 将列表中的数据转换为字符串
    comments = ''
    for k in range(len(commentList)):
        comments = comments + (str(commentList[k])).strip()
    print(comments)
# 主函数
main()