采集豆瓣遭遇防采集,就用网上的代码来破解防采集,以下代码似乎可以暂时破解防采集。
import urllib2
req_header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; rv:28.0) Gecko/20100101 Firefox/28.0',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Cookie':'bid="hsM53yRAjQ8"; __utma=30149280.907110929.1386117661.1398322932.1398335444.20; __utmz=30149280.1398167843.17.13.utmcsr=baidu|utmccn=(organic)|utmcmd=organic|utmctr=urllib2%20403; ll="118281"; __utma=223695111.1156190174.1396328833.1398322932.1398335444.11; __utmz=223695111.1396588375.4.4.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __utmb=30149280.1.10.1398335444; __utmc=30149280; __utmb=223695111.1.10.1398335444; __utmc=223695111',
'Host':'movie.douban.com'
}
req_timeout = 5
req = urllib2.Request(url,None,req_header)
resp = urllib2.urlopen(req,None,req_timeout)
html = resp.read()
但是保存下来的网页源码是乱码。如何解决这个问题?
收集了一些资料。
Python的经典问题——中文乱码
http://hi.baidu.com/yobin/item/166e3a46537781d3c1a59257使用 python urllib2 抓取网页时出现乱码的解决方案
http://www.zhxl.me/1409.html用Python抓网页的注意事项
http://blog.raphaelzhang.com/2012/03/issues-in-python-crawler/python 中文乱码问题
http://blog.youkuaiyun.com/nwpulei/article/details/8581678