Python网络数据采集5：存储数据_getdownloadpath(baseurl,absoluteurl,downloaddirect-优快云博客

本文介绍了三种主要的数据管理方法：媒体文件的下载与管理、CSV文件的读写操作以及MySQL数据库的应用。详细讲解了如何使用Python进行网页媒体文件的批量下载、CSV文件的创建与HTML表格的转换、MySQL数据库的基本操作及与Python的整合。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本章将介绍三种主要的数据管理方法，对绝大多数应用都适用。如果你准备创建一个网站的后端服务或者创建自己的API，那么可能都需要让爬虫把数据写入数据库。如果你需要一个快速简单的方法收集网上的文档，然后存到你的硬盘里，那么可能需要创建一个文件流来实现。如果还要为偶然事件提醒，或者每天定时收集当天累计的数据，就给自己发一封邮件吧！

5.1 媒体文件

只存储文件的URL链接的缺点：

这些内嵌在你的网站或应用中的外站URL链接被称为盗链（hotlinking），使用盗链可能会让你麻烦不断，每个网站都会实施防盗链措施。
因为你的链接文件在别人的服务器上，所以你的应用就要跟着别人的节奏运行了。
盗链是很容易改变的。如果你把盗链图片放在博客上，要是被对方服务器发现，很可能被恶搞。如果你把URL链接存起来准备以后再用，可能用的时候链接已经失效了，或者是变成了完全无关的内容。
现实中的网络浏览器不仅可以请求HTML页面并切换页面，它们也会下载访问页面上所有的资源。下载文件会让你的爬虫看起来更像是人在浏览网站，这样做反而有好处。

在 Python 3.x 版本中, urllib.request.urlretrieve 可以根据文件的 URL 下载文件:

from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html)
imageLocation = bsObj.find("a", {"id":"logo"}).find("img")["src"]
urlretrieve(imageLocation, "logo.jpg")

批量下载

# -*- coding: utf-8 -*-
import os
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup

downloadDirectory = "downloaded"
baseUrl = "http://pythonscraping.com"

def getAbsoluteURL(baseUrl, source):
	if source.startswith("http://www."):
		url = "http://"+source[11:]
	elif source.startswith("http://"):
		url = source
	elif source.startswith("www."):
		urt = source[4:]
		url = "http://" + source
	else:
		url = baseUrl + "/" + source
	if baseUrl not in url:
		return None
	return url 

def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory):
	path = absoluteUrl.replace("www.", "")
	path = path.replace(baseUrl, "")
	path = downloadDirectory + path
	directory = os.path.dirname(path)

	if not os.path.exists(directory):
		os.makedirs(directory)

	return path

html = urlopen("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html)
downloadList = bsObj.findAll(src=True)

for download in downloadList:
	fileUrl = getAbsoluteURL(baseUrl, download["src"])
	if fileUrl is not None:
		print(fileUrl)
		urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))

注：这个程序没有对所有下载文件的类型进行检查，也不应用管理员权限运行它，记得经常备份重要的文件，不要在硬盘上存储敏感信息。

5.2 把数据存储到CSV

逗号分隔值

import csv
csvFile = open("test.csv", 'w+')
try:
	writer = csv.writer(csvFile)
	writer.writerow(('number', 'number plus 2', 'number times 2'))
	for i in range(10):
		writer.writerow( (i, i+2, i*2) )
finally:
	csvFile.close()

HTML表格转换成CSV文件

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://en.wikipedia.org/wiki/Comparison_of_text_editors")
bsObj = BeautifulSoup(html)
# 
table = bsObj.findAll("table", {"class":"wikitable"})[0]
rows = table.findAll("tr")

csvFile = open("editors.csv", 'wt', newline="", encoding='utf-8')
writer = csv.writer(csvFile)
try:
	for row in rows:
		csvRow = []
	for cell in row.findAll(['td', 'th']):
		csvRow.append(cell.get_text())
		writer.writerow(csvRow)
finally:
	csvFile.close()

5.3 MySQL

$sudo apt-get install mysql-server

mysql -u root -p

create database scraping;
use scraping;
create table pages (id BIGINT(7) NOT NULL AUTO_INCREMENT, title VARCHAR(200), content VARCHAR(10000), created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id));

describe pages;

insert into pages (title, content) VALUES ("Test page title", "This is some test page content. It can be up to 10,000 characters long.");

select * from pages where id=1;

select * from pages where title like "%test%"; ########

delete from pages where id = 1;

与Python整合

import pymysql
conn = pymysql.connect(host='127.0.0.1', unix_socket='/var/run/mysqld/mysqld.sock', user='root', passwd='dong', db='mysql')
cur = conn.cursor()
cur.execute("use scraping")
cur.execute("select * from pages where id=2")
print(cur.fetchone())
cur.close()
conn.close()

连接和光标模式是数据库编程中常用的模式。连接模式除了要连接数据库之外，还要发送数据库信息，处理回滚操作，创建新的光标对象，等等。而一个连接可以有很多个光标，一个光标跟踪一种状态信息，比如跟踪数据库的使用状态。如果你有多个数据库，且需要向所有数据库写内容，就需要多个光标来处理。光标还会包含最后一次查询执行的结果。通过调用光标函数，比如cur.fetchone()

# -*- coding: utf-8 -*-
import pymysql
conn = pymysql.connect(host='127.0.0.1', unix_socket='/var/run/mysqld/mysqld.sock', user='root', passwd='dong', db='mysql', charset='utf8')
cur = conn.cursor()
cur.execute("use scraping")

from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re
import json

random.seed(datetime.datetime.now())
def store(title, content):
	cur.execute("insert into pages (title, content) values (\"%s\", \"%s\")", (title, content))
	cur.connection.commit()

# https://en.wikipedia.org/wiki/Python_(programming_language)
def getLinks(articleUrl):
	html = urlopen("http://en.wikipedia.org"+articleUrl)
	bsObj = BeautifulSoup(html)
	title = bsObj.find("h1").get_text()
	content = bsObj.find("div", {"id":"mw-content-text"}).find("p").get_text()
	store(title, content)
	return bsObj.find("div",{"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
links = getLinks("/wiki/Kevin_Bacon")
try:
	while len(links) > 0:
		newArticle = links[random.randint(0, len(links)-1)].attrs['href']
		print(newArticle)
		links = getLinks(newArticle)
finally:
	cur.close()
	conn.close()

MySQL里的“六度空间游戏”

页面A到页面B的链接：insert into links (fromPageId, toPageId) values (A, B)

需要设计一个带有两张数据表的数据库来分别存储页面和链接，两张表都带有创建时间和独立的ID号

create table pages ( id INT NOT NULL AUTO_INCREMENT, url VARCHAR(255) NOT NULL, created TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY(id));

create table links (id INT NOT NULL AUTO_INCREMENT, fromPageId INT NULL, toPageId INT NULL, created TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id));

# -*- coding: utf-8 -*-
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import pymysql
conn = pymysql.connect(host='127.0.0.1', unix_socket='/var/run/mysqld/mysqld.sock', user='root', passwd='dong', db='mysql', charset='utf8')
cur = conn.cursor()
cur.execute("use wikipedia")

def insertPageIfNotExists(url):
	cur.execute("select * from pages where url = %s", (url))
	if (cur.rowcount) == 0:
		print("insert pages: %s", (url))
		cur.execute("insert into pages (url) values (%s)", (url))
		conn.commit()
		return cur.lastrowid
	else:
		return cur.fetchone()[0]

def insertLink(fromPageId, toPageId):
	cur.execute("select * from links where fromPageId=%s and toPageId=%s", (int(fromPageId), int(toPageId)))
	if (cur.rowcount == 0):
		print("insert link %s -> %s", int(fromPageId), int(toPageId))
		cur.execute("insert into links (fromPageId, toPageId) values (%s, %s)", (int(fromPageId), int(toPageId)))
		conn.commit()

pages = set()
def getLinks(pageUrl, recursionLevel):
	global pages
	if recursionLevel > 4:
		return;
	pageId = insertPageIfNotExists(pageUrl)
	html = urlopen("http://en.wikipedia.org" + pageUrl)
	bsObj = BeautifulSoup(html)
	for link in bsObj.findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")):
		insertLink(pageId, insertPageIfNotExists(link.attrs['href']))
		if link.attrs['href'] not in pages:
			# 遇到一个新页面，加入集合并搜索里面的词条链接
			newPage = link.attrs['href']
			pages.add(newPage)
			getLinks(newPage, recursionLevel + 1)

getLinks("/wiki/Kevin_Bacon", 0)
cur.close()
conn.close()

需要注意的是,这个程序可能要运行好几天才会结束

5.4 Email

通过SMTP（简单邮件传输协议）传输的。需要SMTP协议的服务器

import smtplib
from email.mime.text import MIMEText
msg = MIMEText("The body of the email is here")
msg['Subject'] = "An Email Alert"
msg['From'] = "ryan@pythonscraping.com"
msg['To'] = "webmaster@pythonscraping.com"
s = smtplib.SMTP('localhost')
s.send_message(msg)
s.quit()

Python 有两个包可以发送邮件: smtplib 和 email

import smtplib
from email.mime.text import MIMEText
from bs4 import BeautifulSoup
from urllib.request import urlopen
import time
def sendMail(subject, body):
msg = MIMEText(body)
msg['Subject'] = subject
msg['From'] = "christmas_alerts@pythonscraping.com"
msg['To'] = "ryan@pythonscraping.com"
s = smtplib.SMTP('localhost')
s.send_message(msg)
s.quit()
bsObj = BeautifulSoup(urlopen("https://isitchristmas.com/"))
while(bsObj.find("a", {"id":"answer"}).attrs['title'] == "NO"):
print("It is not Christmas yet.")
time.sleep(3600)
bsObj = BeautifulSoup(urlopen("https://isitchristmas.com/"))
sendMail("It's Christmas!",
"According to http://itischristmas.com, it is Christmas!")

这个程序每小时检查一次 https://isitchristmas.com/ 网站(根据日期判断当天是不是圣诞节)