本章将介绍三种主要的数据管理方法,对绝大多数应用都适用。如果你准备创建一个网站的后端服务或者创建自己的API,那么可能都需要让爬虫把数据写入数据库。如果你需要一个快速简单的方法收集网上的文档,然后存到你的硬盘里,那么可能需要创建一个文件流来实现。如果还要为偶然事件提醒,或者每天定时收集当天累计的数据,就给自己发一封邮件吧!
5.1 媒体文件
只存储文件的URL链接的缺点:
- 这些内嵌在你的网站或应用中的外站URL链接被称为盗链(hotlinking),使用盗链可能会让你麻烦不断,每个网站都会实施防盗链措施。
- 因为你的链接文件在别人的服务器上,所以你的应用就要跟着别人的节奏运行了。
- 盗链是很容易改变的。如果你把盗链图片放在博客上,要是被对方服务器发现,很可能被恶搞。如果你把URL链接存起来准备以后再用,可能用的时候链接已经失效了,或者是变成了完全无关的内容。
- 现实中的网络浏览器不仅可以请求HTML页面并切换页面,它们也会下载访问页面上所有的资源。下载文件会让你的爬虫看起来更像是人在浏览网站,这样做反而有好处。
在 Python 3.x 版本中, urllib.request.urlretrieve 可以根据文件的 URL 下载文件:
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html)
imageLocation = bsObj.find("a", {"id":"logo"}).find("img")["src"]
urlretrieve(imageLocation, "logo.jpg")
批量下载
# -*- coding: utf-8 -*-
import os
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup
downloadDirectory = "downloaded"
baseUrl = "http://pythonscraping.com"
def getAbsoluteURL(baseUrl, source):
if source.startswith("http://www."):
url = "http://"+source[11:]
elif source.startswith("http://"):
url = source
elif source.startswith("www."):
urt = source[4:]
url = "http://" + source
else:
url = baseUrl + "/" + source
if baseUrl not in url:
return None
return url
def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory):
path = absoluteUrl.replace("www.", "")
path = path.replace(baseUrl, "")
path = downloadDirectory + path
directory = os.path.dirname(path)
if not os.path.exists(directory):
os.makedirs(directory)
return path
html = urlopen("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html)
downloadList = bsObj.findAll(src=True)
for download in downloadList:
fileUrl = getAbsoluteURL(baseUrl, download["src"])
if fileUrl is not None:
print(fileUrl)
urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))
注:这个程序没有对所有下载文件的类型进行检查,也不应用管理员权限运行它,记得经常备份重要的文件,不要在硬盘上存储敏感信息。
5.2 把数据存储到CSV
逗号分隔值
import csv
csvFile = open("test.csv", 'w+')
try:
writer = csv.writer(csvFile)
writer.writerow(('number', 'number plus 2', 'number times 2'))
for i in range(10):
writer.writerow( (i, i+2, i*2) )
finally:
csvFile.close()
HTML表格转换成CSV文件
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://en.wikipedia.org/wiki/Comparison_of_text_editors")
bsObj = BeautifulSoup(html)
#
table = bsObj.findAll("table", {"class":"wikitable"})[0]
rows = table.findAll("tr")
csvFile = open("editors.csv", 'wt', newline="", encoding='utf-8')
writer = csv.writer(csvFile)
try:
for row in rows:
csvRow = []
for cell in row.findAll(['td', 'th']):
csvRow.append(cell.get_text())
writer.writerow(csvRow)
finally:
csvFile.close()
5.3 MySQL
$sudo apt-get install mysql-server
mysql -u root -p
create database scraping;
use scraping;
create table pages (id BIGINT(7) NOT NULL AUTO_INCREMENT, title VARCHAR(200), content VARCHAR(10000), created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id));
describe pages;
insert into pages (title, content) VALUES ("Test page title", "This is some test page content. It can be up to 10,000 characters long.");
select * from pages where id=1;
select * from pages where title like "%test%"; ########
delete from pages where id = 1;
与Python整合
import pymysql
conn = pymysql.connect(host='127.0.0.1', unix_socket='/var/run/mysqld/mysqld.sock', user='root', passwd='dong', db='mysql')
cur = conn.cursor()
cur.execute("use scraping")
cur.execute("select * from pages where id=2")
print(cur.fetchone())
cur.close()
conn.close()
连接和光标模式是数据库编程中常用的模式。连接模式除了要连接数据库之外,还要发送数据库信息,处理回滚操作,创建新的光标对象,等等。而一个连接可以有很多个光标,一个光标跟踪一种状态信息,比如跟踪数据库的使用状态。如果你有多个数据库,且需要向所有数据库写内容,就需要多个光标来处理。光标还会包含最后一次查询执行的结果。通过调用光标函数,比如cur.fetchone()
# -*- coding: utf-8 -*-
import pymysql
conn = pymysql.connect(host='127.0.0.1', unix_socket='/var/run/mysqld/mysqld.sock', user='root', passwd='dong', db='mysql', charset='utf8')
cur = conn.cursor()
cur.execute("use scraping")
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re
import json
random.seed(datetime.datetime.now())
def store(title, content):
cur.execute("insert into pages (title, content) values (\"%s\", \"%s\")", (title, content))
cur.connection.commit()
# https://en.wikipedia.org/wiki/Python_(programming_language)
def getLinks(articleUrl):
html = urlopen("http://en.wikipedia.org"+articleUrl)
bsObj = BeautifulSoup(html)
title = bsObj.find("h1").get_text()
content = bsObj.find("div", {"id":"mw-content-text"}).find("p").get_text()
store(title, content)
return bsObj.find("div",{"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
links = getLinks("/wiki/Kevin_Bacon")
try:
while len(links) > 0:
newArticle = links[random.randint(0, len(links)-1)].attrs['href']
print(newArticle)
links = getLinks(newArticle)
finally:
cur.close()
conn.close()
MySQL里的“六度空间游戏”
页面A到页面B的链接:insert into links (fromPageId, toPageId) values (A, B)
需要设计一个带有两张数据表的数据库来分别存储页面和链接,两张表都带有创建时间和独立的ID号
create table pages ( id INT NOT NULL AUTO_INCREMENT, url VARCHAR(255) NOT NULL, created TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY(id));
create table links (id INT NOT NULL AUTO_INCREMENT, fromPageId INT NULL, toPageId INT NULL, created TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id));
# -*- coding: utf-8 -*-
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import pymysql
conn = pymysql.connect(host='127.0.0.1', unix_socket='/var/run/mysqld/mysqld.sock', user='root', passwd='dong', db='mysql', charset='utf8')
cur = conn.cursor()
cur.execute("use wikipedia")
def insertPageIfNotExists(url):
cur.execute("select * from pages where url = %s", (url))
if (cur.rowcount) == 0:
print("insert pages: %s", (url))
cur.execute("insert into pages (url) values (%s)", (url))
conn.commit()
return cur.lastrowid
else:
return cur.fetchone()[0]
def insertLink(fromPageId, toPageId):
cur.execute("select * from links where fromPageId=%s and toPageId=%s", (int(fromPageId), int(toPageId)))
if (cur.rowcount == 0):
print("insert link %s -> %s", int(fromPageId), int(toPageId))
cur.execute("insert into links (fromPageId, toPageId) values (%s, %s)", (int(fromPageId), int(toPageId)))
conn.commit()
pages = set()
def getLinks(pageUrl, recursionLevel):
global pages
if recursionLevel > 4:
return;
pageId = insertPageIfNotExists(pageUrl)
html = urlopen("http://en.wikipedia.org" + pageUrl)
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")):
insertLink(pageId, insertPageIfNotExists(link.attrs['href']))
if link.attrs['href'] not in pages:
# 遇到一个新页面,加入集合并搜索里面的词条链接
newPage = link.attrs['href']
pages.add(newPage)
getLinks(newPage, recursionLevel + 1)
getLinks("/wiki/Kevin_Bacon", 0)
cur.close()
conn.close()
需要注意的是,这个程序可能要运行好几天才会结束5.4 Email
通过SMTP(简单邮件传输协议)传输的。需要SMTP协议的服务器
import smtplib
from email.mime.text import MIMEText
msg = MIMEText("The body of the email is here")
msg['Subject'] = "An Email Alert"
msg['From'] = "ryan@pythonscraping.com"
msg['To'] = "webmaster@pythonscraping.com"
s = smtplib.SMTP('localhost')
s.send_message(msg)
s.quit()
Python 有两个包可以发送邮件: smtplib 和 emailimport smtplib
from email.mime.text import MIMEText
from bs4 import BeautifulSoup
from urllib.request import urlopen
import time
def sendMail(subject, body):
msg = MIMEText(body)
msg['Subject'] = subject
msg['From'] = "christmas_alerts@pythonscraping.com"
msg['To'] = "ryan@pythonscraping.com"
s = smtplib.SMTP('localhost')
s.send_message(msg)
s.quit()
bsObj = BeautifulSoup(urlopen("https://isitchristmas.com/"))
while(bsObj.find("a", {"id":"answer"}).attrs['title'] == "NO"):
print("It is not Christmas yet.")
time.sleep(3600)
bsObj = BeautifulSoup(urlopen("https://isitchristmas.com/"))
sendMail("It's Christmas!",
"According to http://itischristmas.com, it is Christmas!")
这个程序每小时检查一次 https://isitchristmas.com/ 网站(根据日期判断当天是不是圣诞节)