[Python]BS4 与 一个KDS 美图爬虫

本文介绍如何利用Python库BeautifulSoup (BS4) 从HTML和XML文件中抽取数据。文章详细展示了BS4的基本用法,包括创建BeautifulSoup对象、提取特定标签、字符串和注释等元素的方法,并提供了示例代码。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

BS4

BeautifulSoup是用来从HTML or XML中提取数据的Python lib。BeautifulSoup将文档转化为树形结构(DOM),每个节点都是下述四种类型的Python对象:

  1. BeautifulSoup <class 'bs4.BeatifulSoup'>
  2. Tag <class 'bs4.element.Tag'>
  3. NavigableString <class 'bs4.element.NavigableString'>
  4. Comment <class 'bs4.element.Comment'>

从集合角度理解以上4中类的关系(类概念上并不准确)

  • BeautifulSoup 为全集(将Document以入参传入生成BeautifulSoup object), 包含 Tag子集
  • Tag 包含 NavigableString 子集
  • Comment 为 NavigableString 特殊集合

Usage

BeautifulSoup的第一个入参是Document,第二个入参指定Document parser 类型.

from bs4 import BeautifulSoup
import requests, re

url = 'http://m.kdslife.com/club/'
# get whole HTTP response
response = requests.get(url)
# args[0] is HTML document, args[1] select LXML parser. returned BeautifulSoup object
soup = BeautifulSoup( response.text, 'lxml')
print soup.name
# [document]'
print type(soup)
# <class 'bs4.BeatifulSoup'>

Sample codes for Tag objects

# BeutifulSoup --> Tag 
# get the Tag object(title)
res = soup.title
print res
# <title>KDS Life</title>

res = soup.title.name
print res
# title

# attribules of a Tag object
res = soup.section
print type(res)
# <class 'bs4.element.Tag'>

print res['class']
# ['forum-head-hot', 'clearfix']

# All the attributes of section Tag object, returned a dict
print res.attrs
#{'class': ['forum-head-hot', 'clearfix']}

Sample codes for NavigableString object

# NavigableString object describes the string in Tag object
res = soup.title
print res.string
# KDS Life
print type(res.string)
# <class 'bs4.element.NavigableString'>

Sample codes for Comment object

# Comment, is a special NavigableString object
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
print type(comment)
# <class 'bs4.element.Comment'>

BS4 Parser

按照优先顺序自动解析,’lxml’ –> ‘html5lib’ –> ‘html.parser’


常用Tag对象方法

find_all()

find_all(name,attrs,recursive,text,**kwargs) 不解释,直接看代码

# filter, returned a matching list
# returned [] if matching nothing
title = soup.find_all('title')
print title
#[<title>Google</title>]

res = soup.find_all('div', 'topAd')
print res

# find all the elements whose id is 'gb-main'
res = soup.find_all(id='topAd')
print res
#[<div id="topAd">...</div>]

# find all the elements with 'img' tag and 'src' attribute matching the specific pattern
res = soup.find_all('img', src=re.compile(r'^http://club-img',re.I))
print res
# [<img src="http://club-img.kdslife.com/attach/1k0/gs/a/o41gty-1coa.png@0o_1l_600w_90q.src"/>,
#...]

select()

# css selector
# select those whose tag's id = wrapperto
res = soup.select('#wrapperto')
print res
# [<div class="swiper-wrapper clearfix" id="wrapperto"></div>]

# select those 'img' tags who have 'src' attribute
res = soup.select('img[src]')
print res
#[<img alt="" src="http://icon.pch-img.net/kds/club_m/club/icon/user1.png"/>, <im
#g src="http://club-img.kdslife.com/attach/1k0/gs/a/o41gty-1coa.png@0o_1l_600w_90q.src"/>]

# select those 'img' tags whose 'src' attribute is 
res = soup.select('img[src=http://icon.pch-img.net/kds/club_m/club/icon/user1.png]')
print res
# [<img alt="" src="http://icon.pch-img.net/kds/club_m/club/icon/user1.png"/>]

Other

# get_text()
markup = '<a href="http://example.com/">\n a link to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup,'lxml')
res = soup.get_text()
print res
#  a link to example.com

res = soup.i.get_text()
print res
# example.com

# .stripped_string
res = soup.stripped_strings
print list(res)
# [u'a link to', u'example.com']

最后贴上一个简单的KDS图片爬虫

A KDS image spider


Note

  • BeautifulSoup进行了编码检测并自动转为Unicode. soup.original_encoding属性来获取自动识别编码的结果。
  • Input converts to unicode, output encodes with utf-8
  • 在BS使用中,可配合 XPath expression使用
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值