Python3---BeautifulSoup

最新推荐文章于 2025-06-07 14:19:16 发布

原创最新推荐文章于 2025-06-07 14:19:16 发布 · 204 阅读

0 ·

CC 4.0 BY-SA版权

本文介绍了爬虫技术中常用的网络请求方式及数据提取方法，包括urllib、requests等库的使用，以及通过正则表达式、BeautifulSoup、lxml等方式解析网页内容。同时，还展示了如何利用Python标准库进行文件目录操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

# 爬虫网络请求方式：urllib(模块), requests(库), scrapy, pyspider(框架)
# 爬虫数据提取方式：正则表达式, bs4, lxml, xpath, css

from bs4 import BeautifulSoup

# 参数1：序列化的html源代码字符串，将其序列化成一个文档树对象。
# 参数2：将采用 lxml 这个解析库来序列化 html 源代码
html = BeautifulSoup(open('index.html', encoding='utf-8'), 'lxml')

# print(html.title)
# print(html.a)
#
# # 获取某一个标签的所有属性
# # {'href': '/', 'id': 'result_logo', 'onmousedown': "return c({'fm':'tab','tab':'logo'})"}
# print(html.a.attrs)
#
# # 获取其中一个属性
# print(html.a.get('id'))

# 获取多个标签，需要遍历文档树
# print(html.head.contents)

# print(html.head.children) # list_iterator object
# for ch in html.head.children:
#     print(ch)

# descendants
# print(html.head.descendants)

# find_all
# find
# get_text: 标签内所有文本，包含子标签
# select
# string: 不能有其他标签。
print(html.select('.two')[0].get_text())

# print(help(html))

# find_all：根据标签名查找一组元素
res = html.find_all('a')
# print(res)

# select：支持所有的CSS选择器语法
res = html.select('.one')[0]
# print(res.get_text())
# print(res.get('class'))

res = html.select('.two')[0]
print(res)
print('----',res.next_sibling)


import os

os.mkdir('abc') # 在当前目录下6-7下，创建abc
os.chdir('abc') # 进入到abc
os.mkdir('123') # 在abc创建123目录

os.chdir(os.path.pardir) # 回到父级目录

os.mkdir('erf')