我们在爬取一个网站的时候经常会遇到用单一的xpath或者正则不是太方便,抓数据会不理想,所以我们就要将其数据类型互相的转换,一些不熟练的朋友可能有些不太清除,这里给大家分享一下。
from lxml import etree
text = """
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
"""
# 转换成xml解析对象
html =etree.HTML(text)
print(type(html))
print(html)
# xml对象转换成bytes类型
text1 =etree.tostring(html)
print(type(text1))
print(text1)
# 将bytes类型解码成字符串
text2 =text1.decode("utf-8")
print(type(text2))
print(text2)