Python 爬虫使用正则去掉不想要的网页元素

最新推荐文章于 2024-10-19 22:54:18 发布

weixin_33701564

最新推荐文章于 2024-10-19 22:54:18 发布

阅读量1.4k

点赞数

文章标签：爬虫 python php

本文介绍了一种使用正则表达式进行爬虫数据清洗的方法，针对网页中特定格式的标签进行去除，从而提取出干净的文字内容。通过实例演示了如何处理不同格式的数据，并给出了解决方案。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在做爬虫的时候，我们总是不想去看到网页的注释，或者是网页的一些其他元素，有没有好的办法去掉他们呢？

例如：下面的问题

第一种情况
<ahref="http://artso.artron.net/auction/search_auction.php?keyword=%E6%9E%97%E7%BB%8D%E5%91%A8"target="_blank">林绍周（明）</a>辑</td>

想要得到的结果是：林绍周（明）辑


第二种情况

<ahref="http://artso.artron.net/auction/search_auction.php?keyword=%E9%92%9F%E6%83%BA"target="_blank">钟惺（明）</a><ahref="http://artso.artro

n.net/auction/search_auction.php?keyword=%E8%B0%AD%E5%85%83%E6%98%A5"target="_blank">谭元春</a>辑</td>

想要得到的结果是：钟惺（明）谭元春辑

第三种情况

<ahref="http://artso.artron.net/auction/search_auction.php?keyword=%E8%90%A7%E5%A8%B4"target="_blank">萧娴（1902～1997）</a></td>

想要得到的结果是： 萧娴（1902～1997）

针对这三种情况，可以试用正则 sub去提取信息

ewline = """<ahref="http://artso.artron.net/auction/search_auction.php?keyword=%E6%96%87%E7%83%BA"target="_blank">文烺</a><ahref="htt
p://artso.artron.net/auction/search_auction.php?keyword=%E6%9D%8E%E9%93%A0"target="_blank">李铠</a>等</td>"""


re_comment = re.compile('<ahref=[^>]*target="_blank">')
print re_comment
newlines = re_comment.sub('', newline)
print newlines.replace('</a>',' ').replace('</td>','').replace('</a>','')

运行结果是：

C:\Python27\python.exe C:/Users/xuchunlin/PycharmProjects/A9_25/haiwai__guanwang/0/qq.py
文烺 李铠 等

Process finished with exit code 0

Python 爬虫 使用正则去掉不想要的网页元素

Python 爬虫使用正则去掉不想要的网页元素