python 开发简单爬虫 by CL（一）

最新推荐文章于 2024-01-18 18:16:29 发布

原创最新推荐文章于 2024-01-18 18:16:29 发布 · 631 阅读

0 ·

CC 4.0 BY-SA版权

一.爬虫简介

爬虫：一段自动抓取互联网信息的程序

价值：互联网数据，为我所用

爬虫数据：新闻聚合阅读器，最爆笑故事APP，最漂亮美女图片网，图书价格对比网，Python技术文章大全

二.简单爬虫架构

1. 爬虫调度端-----URL管理器------网页下载器-----网页解析器----价值数据

其中（URL管理器------网页下载器-----网页解析器）是一个回路，最终目的是得到价值数据。

2. 运行流程

3. URL管理器

URL管理器：管理待抓取URL集合和已抓取URL集合

－防止重复抓取，防止循环抓取

URL管理器的实现方式：

4. 网页下载器

网页下载器：将互联网上的URL对应的网页下载到本地的工具

Python有哪几种网页下载器？

urllib2(pyhton官方基础模块 ),requests(第三方包更强大)

urllib2下载网页方法1：最简洁方法

import urllib2

#直接请求

response=urllib2.urlopen('http://www.baidu.com')

#获取状态码，如果是200表示获取成功

print response.getcode()

#读取内容

print response.read()

urllib2下载网页方法2 :添加data,http header

import urllib2

#创建Request对象

request=urllib2.Request(url)

#添加数据

request.add_data('a','1')

#添加http的header

request.add_header('User_Agent','Mozilla/5.0')

#发送请求获取结果

response=urllib2.urlopen(request)

urllib2下载网页方法3:添加特殊情景的处理器

（HTTPCookieProcessor ProxyHandler HTTPSHandler HTTPRedirectHandler）

---->opener=urllib2.build_opener(handler)

---->urllib2.install_opener(opener)

详细代码：

import urllib2 ,cookielib

#创建cookie容器

cj=cookelib.CookieJar()

#创建一个opener

opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

#给urllib2安装opener

urllib2.install_opener(opener)

#使用带有cookie的urllib2访问网页

response=urllib2.urlopen('http://www.baidu.com/')

5.网页解析器

网页解析器：从网页中提取有价值数据的工具

html网页字符串---->网页解析器--->价值数据和新的URL列表

Python 有哪几种网页解析器

正则表达式(模糊解析)，（html.parser,BeautifulSoup ,Ixml）（DOM树，即结构化解析）

文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

BeautifulSoup的安装：

1. pip install beautifulsoup4

2. easy_install beautifulsoup4

BeautifulSoup 的使用：可以参考博客（http://blog.youkuaiyun.com/watsy/article/details/14161201）

实例代码：

from bs4 import BeautifulSoup
import re
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
...
"""
soup=BeautifulSoup(html_doc,'html.parser',from_encoding='utf-8')
links=soup.find_all('a')
for link in links:
print link.name,link['href'],link.get_text()

link_node=soup.find('a',id='link2')
print link_node['href'],link_node.get_text()

zhengze=soup.find('a',href=re.compile(r'ill'))
print zhengze.get_text()