download
https://www.python.org/downloads/release/python-352/
python实现简单爬虫功能
http://www.cnblogs.com/fnng/p/3576154.html
关于api-ms-win-crt-runtimel1-1-0.dll缺失的解决方案
https://www.microsoft.com/zh-cn/download/confirmation.aspx?id=48145
can't use a string pattern on a bytes-like object
imglist = re.findall(imgre,html.decode('GBK'))
inconsistent use of tabs and space in indentation
把tab替换成空格
UnicodeDecodeError:'gbk' codec can't decode byte 0xaf in position 197:illegal multibyte sequence
html.decode('utf-8')
以下是3.5.2版本的python所能用的
#coding=utf-8
import urllib.request
import re
def getHtml(url):
page = urllib.request.urlopen(url)
html = page.read()
return html
def getImg(html):
reg = r'src="(.+?\.jpg)" pic_ext'
imgre = re.compile(reg)
imglist = re.findall(imgre,html.decode('utf-8'))
x = 0
for imgurl in imglist:
urllib.request.urlretrieve(imgurl,'D://%s.jpg' % x)
x+=1
print(x)
html = getHtml("http://tieba.baidu.com/p/2460150866");
getImg(html)
如果网页是用GBK字符集,则相应做修改
charset=gbk
#coding=utf-8
import urllib.request
import re
import datetime,time
def getHtml(url):
page = urllib.request.urlopen(url)
html = page.read()
return html
def getImg(html):
reg = r'file="(.+?\.jpg)"'
imgre = re.compile(reg)
imglist = re.findall(imgre,html.decode('gbk'))
x = 0
for imgurl in imglist:
urllib.request.urlretrieve(imgurl,'D://06_Download//py//%s.jpg' % x)
x+=1
print("得到文件总数",x)
starttime= datetime.datetime.now()
html = getHtml("http://www.cmfish.com/bbs/forum.php?mod=viewthread&tid=306167&extra=page%3D1");
getImg(html)
usetime= datetime.datetime.now()-starttime
print('所花时间:',usetime)