算法：获取html中内容的开始和结束位置

原创已于 2022-05-24 16:00:00 修改 · 417 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#mysql #centos #linux

于 2021-11-22 14:03:24 首次发布

python 专栏收录该内容

49 篇文章

订阅专栏

这段代码演示了如何从HTML字符串中解析元素并提取标签。它使用栈数据结构逐个处理字符，当遇到尖括号时，判断是否为开始或结束标签，并将结果存储在结果列表中。最终输出了提取到的标签的起止位置。

#-*- coding: utf-8 -*-

html  = """<div><div>div data >aaa<p>ppp data<a>hello,world.</a></p></div></div>"""

if __name__ == "__main__":
    stack = []
    result = []
    for index,elem in enumerate(html):
        stack.append(elem)
        if elem == ">" and "<" in stack:
            tmp_result=[]
            tmp_data = []
            while stack:
                e = stack.pop()
                tmp_data.insert(0,e)
                if e=="<":
                    tmp_result.insert(0,tmp_data)
                    tmp_data = []
            
            if tmp_data:
                tmp_result.insert(0,tmp_data)
            if tmp_result:
                result.extend(tmp_result)

    result_index = []
    current_index = 0
    for index,record in enumerate(result):
        length = len(record)
        if not (record[0] == "<" and record[-1] == ">"):
            result_index.append([index,current_index,current_index + length])
        
        current_index = current_index + length
    
    
    for index,start,end in result_index:
        print(index,start,end)
        print(html[start:end])