ip代理
被封了加代理
http://www.goubanjia.com/
HttpConnectionPool
- 原因:
- 短时间内发起了高频的请求导致ip被封
- http连接池中的连接资源被耗尽
- 解决:
- 代理
- headers中加入Conection:"close"
数据解析
数据解析可以帮助我们实现聚焦爬虫
数据解析的实现方式
- 正则:爬取快,但正则写起来慢
- bs4
- xpath:通用性比较强
- pyquery
数据解析的通用原理
- 爬取的数据都被存储在相关标签之中和相应的标签属性中
- 定位标签
- 取文本或者属性值
通过正则简单爬个糗事百科视频吧
import requests
import re
import os
dirName = './videos/'
if not os.path.exists(dirName):
os.mkdir(dirName)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"
}
url = 'https://www.qiushibaike.com/video/'
response = requests.get(url=url, headers=headers)
page_text = response.text
# <source src="//qiubai-video.qiushibaike.com/0Y2D9HPF1XN2MA91_hd.mp4" type='video/mp4' />
ex = '<source src="(.*?)" type.*?> '
page_list_video = re.findall(ex, page_text, re.S)
for v in page_list_video:
v = "https:" + v
video_name = dirName + v.split("/")[-1]
# 方式一:
# response_video = requests.get(v, headers=headers).content # bytes类型数据
# with open(video_name, "wb") as fp:
# fp.write(response_video)
# 方式二:
from urllib import request
request.urlretrieve(url=v, filename=video_name)
bs4解析
bs4解析的原理
- 实例化一个BeautifulSoup的对象,需要将即将被接卸的页面源码数据加载到该对象中
- 调用BeaufitulSoup对象中的响应方法和属性进行标签定位和数据提取
- 环境安装
- pip install bs4
- pip install lxml
BeautifulSoup的实例化
- BeautifulSoup(fp, ‘lxml’) 将本地存储的一个html文档中的数据加载到实例化好的BeautifulSoup对象中
- BeautifulSoup(page_text, ‘lxml’) 将从互联网上获取的页面源码数据加载到实例化好的BeautifulSoup对象中
定位标签的操作- soup.tagName: 定位到第一个出现的tagName标签
- 属性定位:
- soup.find(‘tagName’, attrName=‘value’)
- soup.find_all(‘tagName’, attrName=‘value’) 返回值为列表
- 选择器定位器:
- 层级选择器:> 表示一个层级 空格表示多个层级
- soup.select(‘选择器’)
- soup.select(’#feng’)
- soup.select(’.tang > ul > li’)
- soup.select(’.tang li’)
- 取文本
- string: 获取直系的文本内容
- text:获取所有的文本内容
- a_tag = soup.select(’#feng’)[0].string
- a_tag = soup.select(’#feng’)[0].text
- 取属性
- tagName[‘attrName’]
- tagName[‘attrName’]
通过BeautifulSoup下载个三国吧
from bs4 import BeautifulSoup
import requests
url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
response_text = requests.get(url=url, headers=headers).text
soup = BeautifulSoup(response_text, 'lxml')
a_list = soup.select('.book-mulu > ul > li > a')
fp = open('./sg.txt', 'w', encoding='utf-8')
for a in a_list:
title = a.string
content_url = "https://www.shicimingju.com" + a['href']
content_text = requests.get(url=content_url, headers=headers).text
soup = BeautifulSoup(content_text, 'lxml')
content = soup.find('div', class_="chapter_content").text
fp.write(f"{title}\n{content}\n\n")
print(f'{title}下载成功!')
fp.close()
xpath解析
xpath解析原理
- 实例化一个etree的对象,然后将即将被解析的页面源码加载到该对象中
- 使用etree对象中的xpath方法结合着不同形式的xpath表达式实现标签定位和数据提取
- 环境安装
- pip install lxml
- etree对象的实例化
- etree.parse(“text.html”)
- etree.HTML(page_text)
xpath表达式
-
最左侧的/表示:xpath表达式一定要从根标签逐层进行标签查找和定位
- tree.xpath(’/html/body/div/p’)
-
最左侧//表示:xpath表达式可以从任意位置定位标签
- tree.xpath(’//p’)
-
非最左侧的/:表示一个层级
-
非最左侧的//:表示跨多个层级
- tree.xpath(’/html/body//p’)
-
xpath可用| 管道符连接两个表达式,提高xpath的通用性
-
定位标签的操作
- 属性定位://tagName[@attrName=“value”]
- //div[@class=‘song’]
- 索引定位://tagName[index] 索引是从1开始
- 属性定位://tagName[@attrName=“value”]
-
取文本:
- /text():直系文本内容
- tree.xpath(’//a[@id=“feng”]/text()’)[0]
- //text():所有的文本内容
- /text():直系文本内容
-
取属性:/@attrName
- tree.xpath(’//a[@id=“feng”]/@href’)
- tree.xpath(’//a[@id=“feng”]/@href’)
通过xpath简单爬取个糗事百科吧
import requests
from lxml import etree
url = 'https://www.qiushibaike.com/text/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
response_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(response_text)
div_list = tree.xpath("//div[@class='article block untagged mb15 typs_hot']")
for div in div_list:
author = div.xpath("./div/a[2]/h2/text()")[0] # 实现局部解析
content = div.xpath("./a[1]/div/span/text()") # 里面有<br>标签,需要处理一下
content = "".join(content)
print(author, content)
使用管道符|连接两个表达式
import requests
from lxml import etree
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"
}
url = "https://www.aqistudy.cn/historydata/"
response_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(response_text)
citys = tree.xpath("//div[@class='bottom']/ul/li/a/text()| //div[@class='bottom']/ul/div[2]/li/a/text()")
print(citys)
解决中文乱码
数据.encode(“iso-8859-1”).decode(“gbk”) # iso-8859-1适用范围更广一些
爬些美女图片吧
import requests
from lxml import etree
import os
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"
}
dirName = './mv'
if not os.path.exists(dirName):
os.mkdir(dirName)
url = "http://pic.netbian.com/4kmeinv/index_%d.html"
for page in range(1, 10):
if page == 1:
new_url = 'http://pic.netbian.com/4kmeinv/index.html'
else:
new_url = format(url%page)
response_text = requests.get(url=new_url, headers=headers).text
tree = etree.HTML(response_text)
a_list = tree.xpath("//div[@class='slist']/ul/li/a")
for i in a_list:
pic_path = "http://pic.netbian.com/" + i.xpath("./img/@src")[0]
pic_name = i.xpath("./b/text()")[0]+".jpg"
pic_name = pic_name.encode("iso-8859-1").decode("gbk") # 解决编码问题
pic = requests.get(url=pic_path, headers=headers).content
with open(dirName+"/"+pic_name, "wb") as fp:
fp.write(pic)