python爬虫Beautiful Soup基础知识总结（附带实操案例）

Python爬虫：Beautiful Soup实战解析

最新推荐文章于 2022-01-16 23:33:25 发布

原创

最新推荐文章于 2022-01-16 23:33:25 发布 · 953 阅读

14 ·

CC 4.0 BY-SA版权

文章标签：

#python #正则表达式

本文介绍了Python爬虫库Beautiful Soup的基础知识，包括安装、主要解析器、简单使用方法、遍历HTML内容的方法、文件树搜索以及find()方法。通过实例展示了如何爬取京东电脑商品信息，如价格、名称和ID。

python爬虫之Beautiful Soup基础知识

Beautiful Soup是一个可以从HTML或XML文件中提取数据的python库。它能同过你喜欢的转换器实现惯用的文档导航，查找，修改文档的方式。

需要注意的是，Beautiful Soup已经自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。因此在使用它的时候不需要考虑编码方式，仅仅需要说明一下原始编码方式就可以了。

点击获取Python学习资料

一、安装Beautiful Soup库

使用pip命令工具安装Beautiful Soup4库

pip install beautifulsoup4

二、BeautifulSoup库的主要解析器

在这里插入图片描述
具体操作：

html = 'https://www.baidu.com'
bs = BeautifulSoup(html, 'html.parser')

三、BeautifulSoup的简单使用

提取百度搜索页面的部分源代码为例：

<!DOCTYPE html>
<html>
<head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type" />
  <meta content="IE=Edge" http-equiv="X-UA-Compatible" />
  <meta content="always" name="referrer" />
  <link
href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.
css" rel="stylesheet" type="text/css" />
  <title>百度一下，你就知道 </title>
</head>
<body link="#0000cc">
 <div id="wrapper">
  <div id="head">
    <div class="head_wrapper">
     <div id="u1">
      <a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻
</a>
      <a class="mnav" href="https://www.hao123.com"
name="tj_trhao123">hao123 </a>
      <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图 </a>
      <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频 </a>
      <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧
</a>
      <a class="bri" href="//www.baidu.com/more/" name="tj_briicon"
style="display: block;">更多产品 </a>
     </div>
    </div>
  </div>
 </div>
</body>
</html>

综合requests和使用BeautifulSoup库的html解析器,对其进行解析如下：

import requests
from bs4 import BeautifulSoup

# 使用requests库加载页面代码
r = requests.get('https://www.baidu.com')
r.encoding = r.apparent_encoding
html = r.text

bs = BeautifulSoup(html, 'html.parser')

print(bs.prettify())    # prettify 方式输出页面

结果如下：

<!DOCTYPE html>
<!--STATUS OK-->
<html>
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="always" name="referrer"/>
  <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
  <title>
   百度一下，你就知道
  </title>
 </head>
 <body link="#0000cc">

最低0.47元/天解锁文章