python网络爬虫基础(利用HTMLParser)

最新推荐文章于 2023-10-08 11:00:21 发布

原创最新推荐文章于 2023-10-08 11:00:21 发布 · 1.6k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#python #网络爬虫 #HTMLPaser

Python 专栏收录该内容

24 篇文章

订阅专栏

本文介绍了一个具体的爬虫实例，通过 HTMLParser 和正则表达式抓取热门文章的标题、链接，并进一步解析文章内容。展示了如何利用 Python 的标准库进行网页解析和数据抓取。

该程序爬虫对象是今日神段里的热门文章,利用HTMLParser和正则表达式

from html.parser import *
import urllib.request
import re
class Scraper(HTMLParser):
    def handle_starttag(self,tag,attrs):
        if tag=='a':
            attrs=dict(attrs)
            if(attrs.__contains__("title")):
                try:
                    page={}
                    page["link"]=attrs["href"]
                    page["target"]=attrs["target"]
                    page["title"]=attrs["title"]
                    #page["article-id"]=attrs["article-id"]
                    message.append(page)
                except Exception:
                    print("捕捉错误")
                    print(attrs)


message=[]
url="http://pinyin.sogou.com/zimeiti/tag/%E4%BB%8A%E6%97%A5%E7%A5%9E%E6%AE%B5"
webpage=urllib.request.urlopen(url).read().decode()
parser=Scraper()
parser.feed(webpage)
while True:
    index=0
    for each in message:
        index+=1
        print(r"page:%2d title:%s"%(index,each['title']))
    parser.close()
    num=int(input("输入需要阅读的文章序号: "))
    nextpage=urllib.request.urlopen("http://pinyin.sogou.com"+message[num-1]["link"]).read().decode()
    pat='">(&nbsp; )?([^a-z\nA-Z<&]*?)(&nbsp)?[<br|</span>]'
    date=re.findall(pat,nextpage)
    for ts in date:
        if(ts[1]=='0'):break#略过广告
        if(ts[1]!=''):
            print(ts[1])
    print('');
    print('')
    input("按下任意键返回目录界面")