本文是基于:《Python爬虫开发与项目实战》 一书的学习笔记。并对书中的代码进行了一定的修改。
JSON
储存为JSON文件利用的是dump
和dumps
函数。具体的可以看 json — JSON encoder and decoder
import json
import requests
from bs4 import BeautifulSoup
url = "http://www.seputu.com/" ##爬取的网站
req = requests.get(url)
bs = BeautifulSoup(req.text,"html.parser")
content = []
for m in bs.find_all(class_ = "mulu"):##找到标签为class=mulu
h2 = m.find("h2") ##找到标签为h2
if h2 != None:
h2_title = h2.string ##提取标题
content_list = []
for h in m.find(class_ = "box").find_all("a"):
href = h.get("href") ##提取链接
title = h.get("title")##提取每一章标题
content_list.append({'href':href,'title':title}) ##添加在list中
content.append({'title':h2_title,'content':content_list})
with open("盗墓笔记.json","w") as fp:
json.dump(content,fp = fp,indent=4)##缩进为4是最美观的
运行结构如图:
CSV
将爬取的数据储存到CSV类型的文件中,利用到了csv
这个库,具体代码如下:
import requests
from bs4 import BeautifulSoup
import csv
url = "http://www.seputu.com/" ##爬取的网站
req = requests.get(url)
bs = BeautifulSoup(req.text,"html.parser")
header=['title','sub_title','href'] ##设置文件每一列的标题
f = open("盗墓笔记.csv",'w') ##创建一个csv文件
f_csv = csv.writer(f)
f_csv.writerow(header)
for m in bs.find_all(class_ = "mulu"):##找到标签为class=mulu
h2 = m.find("h2") ##找到标签为h2
if h2 != None:
h2_title = h2.string ##提取标题
content_list = []
for h in m.find(class_ = "box").find_all("a"):
href = h.get("href") ##提取链接
title = h.get("title")##提取每一章标题
content_list.append({'href':href,'title':title}) ##添加在list中
content = (h2_title,title,href)
f_csv.writerow(content) ##写入每一行
f.close() ##关闭文件流
运行结果如下:
打开读取csv文件的代码如下:
import csv
with open("盗墓笔记.csv") as f:
f_csv = csv.reader(f)
headers = next(f_csv)
print(headers)
for row in f_csv:
print("row is ",row)
运行结果如下:
多媒体文件抽取
就是图片下载啊这些的。下面这段代码是网上找到的一组:
import urllib
import requests
import re
import os
def download_page(url):
req = urllib.request.Request(url)
res = urllib.request.urlopen(req)
data = res.read()
return data
def get_img(html):
regx = r"http://[\S]*\.jpg"
pattern = re.compile(regx)
get_img = re.findall(pattern,repr(html))
num = 1
for img in get_img:
image = download_page(img)
with open("%s.jpg"%num,"wb") as fp:
fp.write(image)
num+=1
print("正在下载第%s张图片"%num)
return
url="http://www.ivsky.com/tupian/xuexi_t2316/"
html = download_page(url)
get_img(html)
运行结果如下:
按照书上的,用的是urlretrieve
这个函数。
import urllib
import requests
from lxml import etree
url = "http://www.ivsky.com/tupian/shishangnvhai_t19861/index_5.html"
req = requests.get(url)
html = etree.HTML(req.text)
img_urls = html.xpath('.//img/@src')
i = 0
for image in img_urls:
urllib.request.urlretrieve(image,'img'+str(i)+'.jpg')
print("zhengzai xiazai di %s"%i)
i+=1
运行结果如图: