11.BeautifulSoup基础

zmjames2000

于 2019-09-10 10:22:55 发布

阅读量115

点赞数

CC 4.0 BY-SA版权

分类专栏：爬虫 python 文章标签： python BeautifulSoup xpath 正则

本文链接：https://blog.youkuaiyun.com/zmjames2000/article/details/100690828

python 同时被 2 个专栏收录

56 篇文章

订阅专栏

爬虫

18 篇文章

订阅专栏

本文深入探讨了网页数据抓取中的三种关键方法：正则表达式、xpath和BeautifulSoup的使用技巧。通过实例代码展示了如何利用urllib.request获取网页源码，并运用BeautifulSoup进行格式化输出及数据解析，包括获取特定标签、属性和内容。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

正则=xpath=BeautifulSoup

from bs4 import BeautifulSoup as bsf
import urllib.request

data = urllib.request.urlopen('xxxx.com').read().decode('utf-8','ignore')
bs = bsf(data)  #格式化输出
print(bs.prettify())

bs.title  #bs.标签名  <title>hello</title>
bs.title.name # 'title'
bs.title.string #   hello

bs.a.attrs  # 获取<a> 中所有属性
bs.a["class"] = bs.a.get("class")  #获取的是 class="xxx"中的 xxx

bs.find_all('a')
bs.find_all(['a','u'])  #获取所有a，u节点的内容

k1 = bs.ul.contents #返回list
k2 = bs.ul.children  #返回的是生成器
children = [ i for i in k2]