python etree pandas_python lxml中etree的简单应用

本文介绍使用lxml库解析HTML的方法,重点讲解etree.html()和etree.tostring()的应用。前者用于将HTML字符串转换为_element对象,便于进一步操作;后者则能将_element对象转回字符串形式,特别适用于提取纯文本内容。

我一般都是通过xpath解析dom树的时候会使用lxml的etree,可以很方便的从html源码中得到自己想要的内容。

这里主要介绍一下我常用到的两个方法,分别是etree.html()和etree.tostrint()。

1.etree.html()

etree.html()可以用来解析字符串格式的html文档对象,将传进去的字符串转变成_element对象。作为_element对象,可以方便的使用getparent()、remove()、xpath()等方法。

如果想通过xpath获取html源码中的内容,就要先将html源码转换成_element对象,然后再使用xpath()方法进行解析。例如,这里有一段最简单的html源码:"

this is a test

",现在想要得到h1标签中的文本,可以这样实现:

# encoding=utf8

from lxml import etree

html = '

this is a test

'

# 将html转换成_element对象

_element = etree.html(html)

# 通过xpath表达式获取h1标签中的文本

text = _element.xpath('//h1/text()')

print 'result is: ', text

结果:

result is: ['this is a test']

通过结果可以知道,xpath()方法放回的结果是一个列表,所以通常在取xpath()方法结果的时候,只取列表中的第一个元素。

2.etree.tostring()

etree.tostring()方法用来将_element对象转换成字符串。一般通过简单的xpath表达式无法得到想要的内容的时候我就会用该方法。例如,将上面的html小改动一下:"

this is a test

",这时候如果想要得到h1中的文本该怎么办呢?使用“//h1/text()”试试(将上面的html保存并用火狐浏览器打开,然后在firepath中输入该xpath表达式):

通过截图左下角的提示可以知道,使用xpath表达式“//h1/text()”只能得到h1标签中文本的“this”和“test”,用代码实现看看:

# encoding=utf8

from lxml import etree

html = '

this is a test

'

_element = etree.html(html)

text = _element.xpath('//h1/text()')

print 'result is: ', text

运行结果:

result is: ['this ', 'test']

确实,使用xpath()方法,只能得到h1中部分文本内容,我们再试试使用“//h1//text()”看看:

然后通过代码实现看看:

# encoding=utf8

from lxml import etree

html = '

this is a test

'

_element = etree.html(html)

text = _element.xpath('//h1//text()')

print 'result is: ', text

运行结果:

result is: ['this ', 'is a ', 'test']

通过“//h1//text()”表达式确实可以得到想要的内容,但是得到的是一个列表,还需要将列表中的所有元素“拼”起来才行,是不是有点麻烦。这时候,就可以考虑使用etree.tostring()方法了,etree.tostring()方法可以传递多个参数,包括element_or_tree、encoding、method等,其中method参数为text的时候,表示返回_element对象中的所有文本,所以可以这样:

# encoding=utf8

from lxml import etree

html = '

this is a test

'

_element = etree.html(html)

# 先找到h1对象,然后通过etree.tostring方法找到h1对象中的所有文本

_h = _element.xpath('//h1')

# 注意,xpath方法返回的是一个列表,我们需要的是列表中的第一个元素:代表h1标签的_element对象

result = etree.tostring(_h[0], method='text')

print 'result is: ', result

运行结果:

result is: this is a test

这时候使用etree.tostring()方法是不是很容易的就解决问题了。

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持萬仟网。

如您对本文有疑问或者有任何想说的,请点击进行留言回复,万千网友为您解惑!

dm-1: write failed, project block limit reached. Traceback (most recent call last): File "/cache/code/yangck/AnswerQ/infer_huifu_no_background.py", line 358, in <module> df1.to_excel(writer, sheet_name='模型答复_2.3.1', index=None, startrow=index, startcol=0) File "/home/ma-user/anaconda3/envs/qwen1.5/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 1353, in __exit__ self.close() File "/home/ma-user/anaconda3/envs/qwen1.5/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 1357, in close self._save() File "/home/ma-user/anaconda3/envs/qwen1.5/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 110, in _save self.book.save(self._handles.handle) File "/home/ma-user/anaconda3/envs/qwen1.5/lib/python3.9/site-packages/openpyxl/workbook/workbook.py", line 386, in save save_workbook(self, filename) File "/home/ma-user/anaconda3/envs/qwen1.5/lib/python3.9/site-packages/openpyxl/writer/excel.py", line 294, in save_workbook writer.save() File "/home/ma-user/anaconda3/envs/qwen1.5/lib/python3.9/site-packages/openpyxl/writer/excel.py", line 275, in save self.write_data() File "/home/ma-user/anaconda3/envs/qwen1.5/lib/python3.9/site-packages/openpyxl/writer/excel.py", line 77, in write_data self._write_worksheets() File "/home/ma-user/anaconda3/envs/qwen1.5/lib/python3.9/site-packages/openpyxl/writer/excel.py", line 215, in _write_worksheets self.write_worksheet(ws) File "/home/ma-user/anaconda3/envs/qwen1.5/lib/python3.9/site-packages/openpyxl/writer/excel.py", line 200, in write_worksheet writer.write() File "/home/ma-user/anaconda3/envs/qwen1.5/lib/python3.9/site-packages/openpyxl/worksheet/_writer.py", line 359, in write self.write_rows() File "/home/ma-user/anaconda3/envs/qwen1.5/lib/python3.9/site-packages/openpyxl/worksheet/_writer.py", line 125, in write_rows self.write_row(xf, row, row_idx) File "/home/ma-user/anaconda3/envs/qwen1.5/lib/python3.9/site-packages/openpyxl/worksheet/_writer.py", line 147, in write_row write_cell(xf, self.ws, cell, cell.has_style) File "/home/ma-user/anaconda3/envs/qwen1.5/lib/python3.9/site-packages/openpyxl/cell/_writer.py", line 125, in lxml_write_cell xf.write(el) File "src/lxml/serializer.pxi", line 1666, in lxml.etree._IncrementalFileWriter.write File "src/lxml/serializer.pxi", line 1703, in lxml.etree._IncrementalFileWriter._handle_error File "src/lxml/serializer.pxi", line 198, in lxml.etree._raiseSerialisationError lxml.etree.SerialisationError: IO_WRITE dm-1: write failed, project block limit reached. Exception ignored in: <generator object WorksheetWriter.get_stream at 0xffff5ae13d60> Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/qwen1.5/lib/python3.9/site-packages/openpyxl/worksheet/_writer.py", line 300, in get_stream File "src/lxml/serializer.pxi", line 1359, in lxml.etree.xmlfile.__exit__ File "src/lxml/serializer.pxi", line 1697, in lxml.etree._IncrementalFileWriter._close File "src/lxml/serializer.pxi", line 1703, in lxml.etree._IncrementalFileWriter._handle_error File "src/lxml/serializer.pxi", line 198, in lxml.etree._raiseSerialisationError lxml.etree.SerialisationError: IO_WRITE
最新发布
07-03
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值