一. 根据url抓取页面源码:
二. 从抓取的网页中下载图片
import re
import urllib
def getHtml(url):
agent=''
page=urllib.urlopen(url)
html = page.read()
return html
try:
html = getHtml(url='https://www.zhihu.com/question/20899988')
#html.encoding = 'utf-8'
except Exception:
print 'getHtml fail'
print html
二. 从抓取的网页中下载图片
def getImg(html):
reg = r'src="(.+?\.jpg)"'
#reg=r'src'
pat = re.compile(reg)
imgList = re.findall(pat,html)
x=1
for imgurl in imgList:
urllib.urlretrieve(imgurl,'%s.jpg' % x)
x+=1
三. 抓取前模拟登陆
相关知识:
http消息头 :理解HTTP消息头