爬虫代码问题总结

最新推荐文章于 2021-07-06 07:44:47 发布

EatonL

最新推荐文章于 2021-07-06 07:44:47 发布

阅读量289

点赞数 1

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/bazhidao0031/article/details/89431173

用BeautifulSoup的find函数以class为名查找函数时需要将class写为class_，避免与python保留名class重复。

如：

html: <div class="search"></div>
BeautifulSoup: result=html.find("div",class_="search")

根据自己需求选择BeautifulSoup的参数。

Parser	Typical usage	Advantages	Disadvantages
Python’s html.parser	BeautifulSoup(markup,“html.parser”)	Batteries included、Decent speed、Lenient (as of Python 2.7.3 and 3.2.)	Not very lenient (before Python 2.7.3 or 3.2.2)
lxml’s HTML parser	BeautifulSoup(markup, “lxml”)	Very fast、Lenient	External C dependency
lxml’s XML parser	BeautifulSoup(markup, “lxml-xml”) BeautifulSoup(markup, “xml”)	Very fast、The only currently supported XML parser	External C dependency
html5lib	BeautifulSoup(markup, “html5lib”)	Extremely lenient、Parses pages the same way a web browser does、Creates valid HTML5	Very slow、External Python dependency

将爬出的数据存入csv的时候，参数如果填utf-8，也经常会出现中文乱码现象，此时参数应改为utf-8-sig。

如：

test.to_csv(path,encoding='utf-8-sig')

request.urlopen打开https，会报如下的错：

ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)

在网上查到原因是说目标网站采用了自签名的证书，解决办法有两种：
第一种设置全局默认值：

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

第二种修改verify参数：

requests.get('https://www.baidu.com/', verify=False)

Python3使用使用如下写法会报错：

if dict.has_key(key1):

原因是Python3删除这种写法，应改为：

if key1 in dict:

当网页标签嵌套，需要重复使用find_all正确写法是：

result1=html.find_all("div",class_="xxx")
result2=result1[0].find_all("div",class_="xxxx")

原因是find_all返回的是list，而非 bs4 类型。

当使用Requests包get目标网址内容时，遇到如下两个问题可以这么解决：
不事先清楚网站编码，为防止乱码：

response=requests.get('xxx')
response.encoding = response.apparent_encoding

报下面的错：

HTTPSConnectionPool(host='***', port=443): Max retries exceeded with url: ******(Caused by SSLError(SSLError(1, u'[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:579)'),))

可能的原因是一直保持长连接，只需按以下方法解决：

s = requests.session()         
s.keep_alive = False        
response = s.get(‘xxx’)

有时候在进行decode的时候会出现下面的错：

UnicodeEncodeError: 'gbk' codec can't encode character '\u203e' in position 37: illegal multibyte sequence

目前详细原因不明，目测是整体的编码和里面某些字符编码冲突导致按某一个编码解码时（GBK或UTF-8），某些字符出错。
较为有效的办法：

import sys 
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf-8') 

s = requests.session()
s.keep_alive = False
response = s.get(url)
response.encoding = response.apparent_encoding
response=response.text

（未完待续）