SGMLParser解析网页href的text

最新推荐文章于 2025-08-08 17:00:29 发布

made_in_chn

最新推荐文章于 2025-08-08 17:00:29 发布

阅读量1.8k

点赞数

CC 4.0 BY-SA版权

文章标签： attributes hyperlink list import class object

本文链接：https://blog.youkuaiyun.com/made_in_chn/article/details/1046049

本文介绍了一个简单的SGML解析器实现，该解析器能够从指定网页中抓取链接及其描述信息。通过继承sgmllib.SGMLParser并重写部分方法，实现了对HTML基本标签的解析，最终获取到了页面中的所有链接及对应的描述。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

#!/usr/bin/python
# -*- coding: utf-8 -*-
# -*- coding: gb2312 -*-

import sgmllib

class MyParser(sgmllib.SGMLParser):
    "A simple parser class."
    def parse(self, s):
        "Parse the given string 's'."
        self.feed(s)
        self.close()

    def __init__(self, verbose=0):
        "Initialise an object, passing 'verbose' to the superclass."

        sgmllib.SGMLParser.__init__(self, verbose)
        self.urls = []
        self.descriptions = []
        self.inside_a_element = 0
        self.starting_description = 0
        self.href = ""

    def start_a(self, attributes):
        "Process a hyperlink and its 'attributes'."

        for name, value in attributes:
            if name == "href":
                self.urls.append(value)
                self.inside_a_element = 1
                self.href = value
                self.starting_description = 1


    def end_a(self):
        "Record the end of a hyperlink."

        self.inside_a_element = 0


    def handle_data(self, data):
        "Handle the textual 'data'."
        if 1==self.inside_a_element:
            if self.starting_description:
                s = "<a href=%s>" %self.href + data + "</a>"
                self.descriptions.append(s)
                self.starting_description = 0
            else:
                self.descriptions[-1] += data

    def get_urls(self):
        "Return the list of urls."

        return self.urls

    def get_descriptions(self):
        "Return a list of descriptions."

        return self.descriptions

import urllib, sgmllib

# Get something to work with.
f = urllib.urlopen("http://dzh.mop.com/dwdzh/list_46_0_0.html")
s = f.read()
#print s
# Try and process the page.
# The class should have been defined first, remember.
myparser = MyParser()
myparser.parse(s)

# Get the urls.
#print myparser.get_urls()
lists = myparser.get_descriptions()
for list in lists:
    print list