Python实现简单爬虫

最新推荐文章于 2025-07-17 21:36:14 发布

原创最新推荐文章于 2025-07-17 21:36:14 发布 · 4.5k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫

python 专栏收录该内容

3 篇文章

订阅专栏

简单爬虫构架
这里写图片描述

时序图
这里写图片描述

Url管理器

管理待抓取url集合和已抓取Url集合
通过两个列表（已抓取url列表，未抓取url的列表）防止重复抓取、防止循环抓取
这里写图片描述

网页下载器
将互联网上Url对应的网页下载到本地的工具
通过的Python urllib2模块来实现
一个网页下载器的示例

#coding=utf-8
import urllib2
import cookielib
url = "http://www.baidu.com"
request = urllib2.Request(url)
request.add_header("user-agent","Mozilla/5.0") #伪装成火狐浏览器
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener) #使urllib2增加cookie的处理
response = urllib2.urlopen(request)
print response.getcode()
print cj
fout = open("baidu.txt","w")
fout.write(response.read())
fout.close()

网页解析器（BeautifulSoup）
BeautifulSoup一个强大的网页信息解析的python第三方插件,可以选择使用html.parser或lxml来作为解析器
网页解析器的作用是解析下载的网页内容，提取价值数据和新的url,调度器不断将新的url添加到url管理器
一个小的示例

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser',from_encoding='utf-8')
for node in soup.find_all('a'):
    print node.name,node['href'],node.get_text()

爬取之前首先要对目标网页进行分析，如下:
这里写图片描述

最后贴一下调度器的代码：

import url_manager, html_downloader, html_parser, html_outputer 
test.sayHello('chm')

class SpiderMain:
    def __init__(self):
        self.urls = url_manager.UrlManager()
        self.downloader = html_downloader.HtmlDownloader()
        self.parser = html_parser.HtmlParser()
        self.outputer = html_outputer.HtmlOutputer()
    def craw(self, root_url):
        count = 1
        self.urls.add_new_url(root_url)
        while self.urls.has_new_url():
            try:
                new_url = self.urls.get_new_url()
                print 'craw %d: %s' % (count,new_url)
                html_cont = self.downloader.download(new_url)
                new_urls, new_data = self.parser.parse(new_url, html_cont)
                self.urls.add_new_urls(new_urls)
                self.outputer.collect_data(new_data)
                if count == 100:
                    break
                count = count + 1
            except IOError, e:
                print e
                print 'craw failed'
        self.outputer.output_html()

if __name__ == '__main__':
    root_url = 'http://baike.baidu.com/view/21087.htm'
    spider = SpiderMain()
    spider.craw(root_url)