富文本中提取信息并去除其中的HTML或XML标签

最新推荐文章于 2024-11-30 17:46:57 发布

原创

最新推荐文章于 2024-11-30 17:46:57 发布 · 1.1k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#xml #html #java #javascript

要从富文本中提取信息并去除其中的HTML或XML标签，可以使用不同的编程语言和库。以下是一些流行语言中的示例方法：

1. Python（使用`BeautifulSoup`）

BeautifulSoup是一个强大的Python库，用于从HTML或XML文件中提取数据。它创建了一个解析树，用于遍历并提取数据。

from bs4 import BeautifulSoup  
  
html_doc = """  
<html><head><title>The Dormouse's story</title></head>  
<body>  
<p class="title"><b>The Dormouse's story</b></p>  
<p class="story">Once upon a time there were three little sisters; and their names were  
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,  
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and  
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;  
and they lived at the bottom of a well.</p>  
</body></html>  
"""  
  
soup = BeautifulSoup(html_doc, 'html.parser')  
  
# 使用get_text()去除所有标签，只保留文本  
text = soup.ge