1.urllib2.URLError: <urlopen error no host given>
找了一下原因,很傻的问题,url的值不对
2.源码
import urllib2
file2 = urllib2.urlopen("http://write.blog.youkuaiyun.com")
content = file2.read()
print content
Traceback: urllib2.HTTPError: HTTP Error 403: forbidden
网站禁止爬虫,伪装成浏览器
修改后的代码:
headers = {'User-Agent':'Mozilla/5.0(Windows;U;Windows NT 6.1;en-US; rv:1.9.1.6)Gecko/20091201 FireFox/3.5.6'}
req = urllib2.Request(url="http://write.blog.youkuaiyun.com",headers = headers)
file2 = urllib2.urlopen(req)
content = file2.read()
print content
成功抓取网页的内容。