BeautifulSoup4的使用

最新推荐文章于 2025-07-02 13:31:46 发布

逆向中的菜鸟

最新推荐文章于 2025-07-02 13:31:46 发布

阅读量1.4k

点赞数

CC 4.0 BY-SA版权

文章标签： python 爬虫方法 BeautifulSoup

本文链接：https://blog.youkuaiyun.com/weixin_42281813/article/details/90299780

本文围绕Python爬虫中的BeautifulSoup展开，介绍了去除打印警告信息、导入模块的方法。详细阐述了将网页转为对象，以及使用prettify()格式化输出、get_text()和string取文本、find_all()和find()查找标签、select()选择元素等操作，还提及获取节点属性和文本内容的相关属性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

BeautifulSoup

去除打印时警告信息

import warnings
warnings.filterwarnings("ignore")

html = """
<tbody>
<tr class="h">
    <td class="l" width="374">职位名称</td>
    <td>职位类别</td>
    <td>人数</td>
    <td>地点</td>
    <td>发布时间</td>
</tr>

<tr class="even">
    <td class="l square"><a target="_blank" href="position_detail.php?id=44569&amp;keywords=Python&amp;tid=0&amp;lid=0">MIG16-基础架构工程师（北京）</a></td>
    <td>技术类</td>
    <td>1</td>
    <td>北京</td>
    <td>2018-09-29</td>
</tr>

<tr class="odd">
    <td class="l square"><a target="_blank" href="position_detail.php?id=44570&amp;keywords=Python&amp;tid=0&amp;lid=0">MIG16-数据系统高级开发工程师</a></td>
    <td>技术类</td>
    <td>1</td>
    <td>北京</td>
    <td>2018-09-29</td>
</tr>

<tr class="even">
    <td class="l square"><a target="_blank" href="position_detail.php?id=44567&amp;keywords=Python&amp;tid=0&amp;lid=0">MIG16-基础架构工程师（北京）</a></td>
    <td>技术类</td>
    <td>1</td>
    <td>北京</td>
    <td>2018-09-29</td>
</tr>

<tr class="odd">
    <td class="l square"><a target="_blank" href="position_detail.php?id=44559&amp;keywords=Python&amp;tid=0&amp;lid=0">18796-专项技术测试(深圳）</a><span class="hot">&nbsp;</span></td>
    <td>技术类</td>
    <td>2</td>
    <td>深圳</td>
    <td>2018-09-29</td>
</tr>
</tbody>
"""

导入BeautifulSoup
from bs4 import BeautifulSoup
转至对象，括号可以加入url,要爬取得模板名，还可以加入‘lxml’解析器

bs = BeautifulSoup(html,‘lxml’)
- **prettify()：**格式化输出网页
  bs.prettify()
取出文本方法：
- 获取全部文本get_text()
  t = bs.select(‘a’)
  for i in t:
  print(i.get_text())
- **string:**同样还有一个方法索引出文本
  t = bs.select(‘a’)
  for i in t:
  print(i.string)
find_all()方法：
- **find_all：**匹配所有网页tr标签，不加过滤条件，获取全部的tr标签
  bs.find_all(‘标签名’)
- limit限制符合条件前n个标签
  bs.find_all(‘tr’,limit=2)
  - **索引：**相同概念
    bs.find_all(‘tr’)[0]
- 指定取出所有tr子节点class 传入class是class_ = ‘名’
  bs.find_all(‘tr’,class_ = ‘h’)
- 选择class=even，同时id=feng的tr子节点
  bs.find_all(‘tr’,class_ = ‘even’,id = ‘feng’)
- **get_text()**方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容
  t = bs.find_all(‘tr’)[2] 必须索引否则报错AttributeError 可以指定索引某个
  print(t.get_text())
- 可以传入一个参数作为分隔符，让获取的字符串更好的显示出来
  t = bs.find_all(‘tr’)[2]
  print(t.get_text(’=’))
  还可以加入一个参数strip=True删除返回的字符串左右两边的空格
  t = bs.find_all(‘tr’)[2]
  print(t.get_text(’=====’,strip = True))
- 想要获取节点的属性，，想获取它的href属性值，或者对于其他的节点元素，我们想要获取name、class、id等属性值的
  当输入错误标签里面子节点时会报错KeyError: ‘traget’
  t = bs.find_all(‘a’)
  for i in t:
  print(i[‘target’])
find方法：
- find()方法只选取符合条件的第一个标签,取别的标签报错KeyError索引出错
  bs.find(‘tr’)
- **contents属性:**该属性返回的是某个节点下的全部子元素，包括子元素的标签名和文本内容。返回的数据类型是列表
  t = bs.find(‘tr’)
  print(t.contents)列表
  可以迭代出
  for i in t.contents:
  print(i.string)
- **children属性：**和contents属性的用法是一样的，但是返回的数据类型是迭代器
  t = bs.find(‘tr’)
  返回包含空格，
  for i in t.children:
  print(i.string)
select方法：
- select 可以通过标签名，标签的class、标签的id，通过标签的name、href等属性来选择我们的元素。使用该方法返回的是一个迭代器，我们可以通过for…in…循环遍历
- 可以连着取标签下的子节点，然后用for迭代出i 然后用i所以用i索引出想要的href
  t = bs.select(‘td a’)
  for i in t:
  print(i[‘href’])
- 通过属性来查找标签，比如查找href属性等于index.html的a节点
  a = bs.select(“a[href=‘index.html’]”)
- 选择div中的直接子元素img
  img = bs.select(“div > img”)
strings属性：
- 一个节点只包含一个文本节点，或者是只包含一个节点,strings,获取该文本节点的文本内容，或者是这个节点的文本内容’’‘可能或出现换行和空格等空白文本’’’
  o = r.select(‘div’)[1]
- 必须索引否则报错’’‘AttributeError: ‘list’ object has no attribute ‘strings’’’’
  
  for i in o.strings:
  print(i)
如果不想获取换行和空格，那么我们可以使用stripped_strings属性
o = r.select(‘div’)[0]
for i in o.stripped_strings:
print(i)