urllib.urlopen方法无返回值问题

最新推荐文章于 2023-06-22 17:55:47 发布

原创最新推荐文章于 2023-06-22 17:55:47 发布 · 2.6k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#python #urllib.urlopen无返回值

python 专栏收录该内容

1 篇文章

订阅专栏

本文介绍了一种通过模拟火狐浏览器的方式获取特定网站的网页源码的方法。该方法利用了urllib2库来绕过一些网站对爬虫的限制，适用于那些对直接访问有所限制的站点。

由于有些网站不希望机器人去抓取它的数据，所以直接用urlopen方法时可能会出现无返回值的现象，一下这个函数采用模拟火狐浏览器访问访问的方式可获得网页源码。

import urllib2

def getUrlRespHtml(url):
heads = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset':'GB2312,utf-8;q=0.7,*;q=0.7',
'Accept-Language':'zh-cn,zh;q=0.5',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Host':'John',
'Keep-Alive':'115',
'Referer':url,
'User-Agent':'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.14) Gecko/20110221 Ubuntu/10.10 (maverick) Firefox/3.6.14'}

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)
req = urllib2.Request(url)
opener.addheaders = heads.items()
respHtml = opener.open(req).read()
return respHtml.decode('gbk').encode('utf-8')

html = getUrlRespHtml("http://mil.news.baidu.com/")
print(html)