题
如何使用python和lxml从html中删除类属性?
例
我有:
Lorem ipsum dolor sit amet, consectetur adipisicing elit
我想要:
Lorem ipsum dolor sit amet, consectetur adipisicing elit
到目前为止我已经尝试过
但是,我已经签出了lxml.html.clean.Cleaner,它没有一种剥离类属性的方法.您可以设置safe_attrs_only = True,但是不会删除该类属性.
重要的搜索已经变得没有任何可行性.我认为类在html和python中都使用了进一步的泥泞搜索结果.许多结果也似乎严格地处理xml.
我对其他提供人性化界面的python模块也是开放的.
非常感谢.
解
感谢@Dan Roberts的回答,我想出了以下解决方案.提供给将来试图解决同样问题的人们到达这里的人.
import lxml.html
# Our html string we want to remove the class attribute from
html_string = '
Lorem ipsum dolor sit amet, consectetur adipisicing elit
'# Parse the html
html = lxml.html.fromstring(html_string)
# Print out our "Before"
print lxml.html.tostring(html)
# .xpath below gives us a list of all elements that have a class attribute
# xpath syntax explained:
# // = select all tags that match our expression regardless of location in doc
# * = match any tag
# [@class] = match all class attributes
for tag in html.xpath('//*[@class]'):
# For each element with a class attribute, remove that class attribute
tag.attrib.pop('class')
# Print out our "After"
print lxml.html.tostring(html)