简单爬虫构架
时序图
- Url管理器
管理待抓取url集合和已抓取Url集合
通过两个列表(已抓取url列表,未抓取url的列表)防止重复抓取、防止循环抓取
- 网页下载器
将互联网上Url对应的网页下载到本地的工具
通过的Python urllib2模块来实现
一个网页下载器的示例
#coding=utf-8
import urllib2
import cookielib
url = "http://www.baidu.com"
request = urllib2.Request(url)
request.add_header("user-agent","Mozilla/5.0") #伪装成火狐浏览器
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener) #使urllib2增加cookie的处理
response = urllib2.urlopen(request)
print response.getcode()
print cj
fout = open("baidu.txt","w")
fout.write(response.read())
fout.close()
- 网页解析器(BeautifulSoup)
BeautifulSoup一个强大的网页信息解析的python第三方插件,可以选择使用html.parser或lxml来作为解析器
网页解析器的作用是解析下载的网页内容,提取价值数据和新的url,调度器不断将新的url添加到url管理器
一个小的示例
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser',from_encoding='utf-8')
for node in soup.find_all('a'):
print node.name,node['href'],node.get_text()
爬取之前首先要对目标网页进行分析,如下:
最后贴一下调度器的代码:
import url_manager, html_downloader, html_parser, html_outputer
test.sayHello('chm')
class SpiderMain:
def __init__(self):
self.urls = url_manager.UrlManager()
self.downloader = html_downloader.HtmlDownloader()
self.parser = html_parser.HtmlParser()
self.outputer = html_outputer.HtmlOutputer()
def craw(self, root_url):
count = 1
self.urls.add_new_url(root_url)
while self.urls.has_new_url():
try:
new_url = self.urls.get_new_url()
print 'craw %d: %s' % (count,new_url)
html_cont = self.downloader.download(new_url)
new_urls, new_data = self.parser.parse(new_url, html_cont)
self.urls.add_new_urls(new_urls)
self.outputer.collect_data(new_data)
if count == 100:
break
count = count + 1
except IOError, e:
print e
print 'craw failed'
self.outputer.output_html()
if __name__ == '__main__':
root_url = 'http://baike.baidu.com/view/21087.htm'
spider = SpiderMain()
spider.craw(root_url)
全部代码去这里下载:http://download.youkuaiyun.com/detail/zxc123e/9506792
爬取的部分内容:
注意:因为网页结果是在不断升级和变化的,如果执行过程中发生异常,请重新分析目标页面后修改程序,才能正确爬取。