Python网络数据采集---读书笔记（一）

最新推荐文章于 2024-11-02 00:00:00 发布

TerryDev

最新推荐文章于 2024-11-02 00:00:00 发布

阅读量256

点赞数

分类专栏：网络爬虫文章标签： python 爬虫

本文链接：https://blog.youkuaiyun.com/sinat_28723265/article/details/82156849

版权

网络爬虫专栏收录该内容

1 篇文章

订阅专栏

1.2.1 安装BeautifulSoup

对于 Mac 系统，首先用
```
$sudo easy_install pip
```
安装 Python 的包管理器 pip，然后运行
```
$pip install beautifulsoup4
```

1.2.2 运行BeautifulSoup

使用例程：

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html") 
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)

output：
```
<h1>An Interesting Title</h1>
```

同理还有以下函数：

bsObj.html.body.h1
bsObj.body.h1
bsObj.html.h1

1.2.3 可靠的网络连接

为了处理爬虫可能遇到的各种异常，需要进行异常处理。
```
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
```
这行代码主要可能会发生两种异常:
1. 网页在服务器上不存在(或者获取页面的时候出现错误)
2. 服务器不存在

我们可以用下面的方式处理这种异常:

try:
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e: 
print(e)
# 返回空值，中断程序，或者执行另一个方案
else:
# 程序继续。注意:如果你已经在上面异常捕捉那一段代码里返回或中断(break)， 
# 那么就不需要使用else语句了，这段代码也不会执行

添加异常处理的代码示例：

from urllib.request import urlopen 
from urllib.error import HTTPError 
from bs4 import BeautifulSoup

    def getTitle(url): 
        try:
            html = urlopen(url) 
        except HTTPError as e:
            return None 
        try:
            bsObj = BeautifulSoup(html.read())
            title = bsObj.body.h1 
        except AttributeError as e:
            return None 
        return title
title = getTitle("http://www.pythonscraping.com/pages/page1.html") 
if title == None:
    print("Title could not be found") 
else:
    print(title)

2.2 再端一碗BeautifulSoup（根据CSS网页层叠样式表查找内容）

以下示例抓取页面中span标签中绿色的文本

html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html");
bsObj = BeautifulSoup(html.read(),"lxml")
find_all = bsObj.find_all("span", class_="green")
for all in find_all:
    print(all.get_text())

2.2.1 BeautifulSoup的find()和findAll()

这两个函数非常相似，BeautifulSoup 文档里两者的定义就是这样:

findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)

例如，下面的代码将返回一个包含 HTML 文档中所有标题标签的列表:
```
.findAll({"h1","h2","h3","h4","h5","h6"})
```
属性参数 attributes 是用一个 Python 字典封装一个标签的若干属性和对应的属性值。例如，下面这个函数会返回 HTML 文档里红色与绿色两种颜色的 span 标签:
```
.findAll("span", {"class":{"green", "red"}})
```
递归参数 recursive 是一个布尔变量。你想抓取 HTML 文档标签结构里多少层的信息?如果 recursive 设置为 True，findAll 就会根据你的要求去查找标签参数的所有子标签，以及子标签的子标签。如果 recursive 设置为 False，findAll 就只查找文档的一级标签。findAll 默认是支持递归查找的(recursive 默认值是 True);一般情况下这个参数不需要设置，除非你真正了解自己需要哪些信息，而且抓取速度非常重要，那时你可以设置递归参数。
文本参数 text 有点不同，它是用标签的文本内容去匹配，而不是用标签的属性。假如我们想查找前面网页中包含“the prince”内容的标签数量，我们可以把之前的 findAll 方法换成下面的代码:
```
nameList = bsObj.findAll(text="the prince") print(len(nameList))
```
输出结果为“7”。
范围限制参数 limit，显然只用于 findAll 方法。find 其实等价于 findAll 的 limit 等于 1 时的情形。如果你只对网页中获取的前 x 项结果感兴趣，就可以设置它。
还有一个关键词参数 keyword，可以让你选择那些具有指定属性的标签。例如:
```
allText = bsObj.findAll(id="text")
print(allText[0].get_text())
```

2.2.3 导航树

通过标签在文档中的位置来找标签，这就是导航树的作用。

1. 处理子标签和其他后代标签

一般情况下，BeautifulSoup 函数总是处理当前标签的后代标签。例如，bsObj.body.h1 选择了 body 标签后代里的第一个 h1 标签，不会去找 body 外面的标签。
类似地，bsObj.div.findAll("img") 会找出文档中第一个 div 标签，然后获取这个 div 后代里所有的 img 标签列表。

如果你只想找出子标签，可以用 .children 标签:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html,"lxml")
for child in bsObj.find("table",{"id":"giftList"}).children:
print(child)

这段代码会打印 giftList 表格中所有产品的数据行。

2. 处理兄弟标签

BeautifulSoup 的 next_siblings() 函数可以让收集表格数据成为简单的事情，尤其是处理带标题行的表格:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://www.pythonscraping.com/pages/page3.html"

html = urlopen(url)
soup = BeautifulSoup(html, "lxml")

for sibling in soup.find("table",id = "giftList").tr.next_siblings:
    print(sibling)

这段代码会打印产品列表里的所有行的产品，第一行表格标题除外。

和 next_siblings 一样，如果你很容易找到一组兄弟标签中的最后一个标签，那么 previous_siblings 函数也会很有用。
当然，还有 next_sibling 和 previous_sibling 函数，与 next_siblings 和 previous_siblings 的作用类似，只是它们返回的是单个标签，而不是一组标签。

3. 父标签处理

偶尔在特殊情况下你也会用到 BeautifulSoup 的父标签查找函数，parent 和 parents。例如:

from urllib.request import urlopen 
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html") 
bsObj = BeautifulSoup(html) 
print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"
}).parent.previous_sibling.get_text())

下面的图形是我们正在处理的 HTML 页面的部分结构，用数字表示步骤的话:

• <tr>
— <td>
— <td> 
— <td>(3)
    — "$15.00" (4) 
— <td>(2)
    — <img src="../img/gifts/img1.jpg"> (1)
(1) 选择图片标签 src="../img/gifts/img1.jpg";
(2) 选择图片标签的父标签(在示例中是 <td> 标签);
(3) 选择 <td> 标签的前一个兄弟标签 previous_sibling(在示例中是包含美元价格的 <td>标签);
(4) 选择标签中的文字，“$15.00”。