python--爬虫学习二_用beautifulsoup代替re-优快云博客

本文链接：https://blog.youkuaiyun.com/slyslyme/article/details/86610315

本文深入讲解了爬虫的基本原理和工作流程，包括发起请求、获取响应、解析内容和保存数据等关键步骤。介绍了urllib、BeautifulSoup等常用库的使用，并探讨了数据结构、伪装浏览器及第三方模块的应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

0 、基本爬虫工作原理：

是一种按照一定的规则，自动地抓取网站信息的程序或者脚本。爬虫是通过网页的链接地址来寻找网页，从网站某一个页面开始，读取网页的内容，找到在网页中的其它链接地址，然后通过这些链接地址寻找下一个网页，这样一直循环下去，直到把这个网站所有的网页都抓取完为止。

基本流程：

1、发起请求

通过http库向目标站点发送request，请求可以包含额外的headers等信息，等待服务器响应

2、获取响应内容

服务器正常响应会得到一个Response，类型可能是HTML JSON字符串，二进制数据（图片视频）等】

3、解析内容

得到的内容可能是HTML，可以用正则表达式、网页解析库进行解析，JSON直接转为JSON对象进行解析

4、保存数据

保存形式多样，可以存为文本，也可以保存至数据库，或者保存特定格式的文件

例：

①先由urllib的request打开Url得到网页html文档

②浏览器打开网页源代码分析元素节点

③通过Beautiful Soup或则正则表达式提取想要的数据

④存储数据到本地磁盘或数据库（抓取，分析，存储）

1、urllib

urllib.request官方文档

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

重点部分是返回值, 这个函数返回一个 http.client.HTTPResponse 对象, 这个对象又有各种方法, 比如我们用到的read()方法

代码实现：用python简单处理URL，使用urllib打开连接

import urllib
import urllib.request

data = {}
data['word'] = '三生三世十里桃花'

# 把一个通俗的字符串转为url格式的字符串
url_values = urllib.parse.urlencode(data)
url = "http://www.baidu.com/s?"
full_url=url+url_values

data = urllib.request.urlopen(full_url).read()
data=data.decode('utf-8')
print(data)

2、用到的数据结构简介

队列与集合

python中的list足够完成队列的功能，可以使用append（）向队尾添加元素，可以使用类似数组的方式来获取队首元素，可以用pop(0)来弹出队首元素，但是效率很低，python官方建议使用collection.deque来完成队列任务

from collections import deque
queue=deque(["Eric","John","Michael"])
queue.append("xixi")
print(queue)
queue.popleft()
print(queue)

python的集合

为了使爬过的网站不会重复，所以使用set（），set（）是一种无序的，不包含重复元素的结构

import re
import urllib.request
import urllib
from collections import deque

queue = deque()
visited = set()

url = 'https://www.zhihu.com/question/28259314/answer/538491618'  # 入口页面, 可以换成别的
queue.append(url)
cnt = 0

while queue:
    url = queue.popleft()  # 队首元素出队
    visited |= {url}  # 标记为已访问
    print('已经抓取: ' + str(cnt) + '   正在抓取 <---  ' + url)
    cnt += 1
    urlop = urllib.request.urlopen(url)
    if 'html' not in urlop.getheader('Content-Type'):
        continue

    # 避免程序异常中止, 用try..catch处理异常
    try:
        data = urlop.read().decode('utf-8')
    except:
        continue

    # 正则表达式提取页面中所有队列, 并判断是否已经访问过, 然后加入待爬队列
    linkre = re.compile('href="(.+?)"')
    for x in linkre.findall(data):
        if 'http' in x and x not in visited:
            queue.append(x)
            print('加入队列 --->  ' + x)

3、伪装成浏览器

在get时将user-agent添加至header

基础理论知识：

HTTP 报文分两种: 请求报文和响应报文：

http请求报文解析：

所以我们要做的就是在python爬虫向百度发送请求时添加user-agent，表名自己是浏览器

在GET的时候添加header有很多方法

第一种：简便但不宜扩展

req = urllib.request.Request(url, headers = {
    'Connection': 'Keep-Alive',
    'Accept': 'text/html, application/xhtml+xml, */*',
    'Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko'
})

第二种：使用build_opener方法，用来自定义opener，好处是可以方便的扩展功能，如下代码自动处理cookies的功能

import urllib.request
import http.cookiejar
 
# head: dict of header
def makeMyOpener(head = {
    'Connection': 'Keep-Alive',
    'Accept': 'text/html, application/xhtml+xml, */*',
    'Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko'
}):
    cj = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
    header = []
    for key, value in head.items():
        elem = (key, value)
        header.append(elem)
    opener.addheaders = header
    return opener
 
oper = makeMyOpener()
uop = oper.open('http://www.baidu.com/', timeout = 1000)
data = uop.read()
print(data.decode())

4、使用第三方模块抓取和解析

使用requests来代替之前的urllib、使用BeautifulSoup来代替re模块

关于美丽汤BeautifulSoup，方便在我们可以用tag class id 使用开发者工具来定位我们想要的东西，可以直接提取出正文信息，也可以全文搜索，同样也支持正则表达式

官方文档BeautifulSoup

(看文档很简单，但是上手就不知道咋写了，嘤嘤嘤~)

关于BeautifulSoup的栗子：爬取优快云首页的

from bs4 import BeautifulSoup
from urllib.request import urlopen

html = urlopen("https://www.youkuaiyun.com/").read().decode('utf-8')
soup = BeautifulSoup(html,"html.parser")
titles = soup.select("h3[class='company_name'] a") #css选择器
for title in titles:
    print(title.get_text(),title.get('href')) # 标签题 标签属性

显示爬取结果：

免费报名 | 区块链设计与智能合约开发 https://edu.youkuaiyun.com/huiyiCourse/detail/952
优快云《2019区块链开发者报告》 https://blog.youkuaiyun.com/Blockchain_lemon/article/details/85604263
100+Python编程题给你练~（附答案） https://blog.youkuaiyun.com/dQCFKyQDXYm3F8rB0/article/details/86610486
30篇文章通关数据科学与人工智能 https://www.tinymind.cn/articles/3972?from=csdn02
2019 前端程序员必备的 9 大技能！ https://blog.youkuaiyun.com/csdnnews/article/details/86587166
不要在爬虫犯罪的边缘疯狂试探！ https://blog.youkuaiyun.com/csdnnews/article/details/86610620
程序员必须掌握数据结构与算法 https://gitbook.cn/gitchat/column/5b6d05446b66e3442a2bfa7b?utm_source=jr1901151
入门机器学习必备知识 https://gitbook.cn/gitchat/column/5ad70dea9a722231b25ddbf8?utm_source=jr1901152
AI/Python/大数据/区块链开发者社群！ https://blog.youkuaiyun.com/优快云edu/article/details/84327908
2019年人工智能发展趋势！ https://edu.youkuaiyun.com/topic/ai30?utm_source=home4
中年程序员被裁了，他哭了！ https://bbs.youkuaiyun.com/forums/ProgrammerStory?utm_source=sy
联想如何将VR体验带到教育和医疗行业 https://qualcomm.youkuaiyun.com/
小米快应用专区 https://bss.youkuaiyun.com/m/topic/mi_app
人工智能、机器学习和认知计算入门指南 http://ibmuniversity.youkuaiyun.com/
迅雷链技术沙龙 https://bss.youkuaiyun.com/m/topic/xunlei
百度云技术专区 https://baidu.youkuaiyun.com/