Python爬虫学习

最新推荐文章于 2022-09-05 21:01:32 发布

原创最新推荐文章于 2022-09-05 21:01:32 发布 · 546 阅读

0 ·

CC 4.0 BY-SA版权

Python 专栏收录该内容

3 篇文章

订阅专栏

Python3.8

Conda

Python

Python 是一种高级、解释型、通用的编程语言，以其简洁易读的语法而闻名，适用于广泛的应用，包括Web开发、数据分析、人工智能和自动化脚本

download
https://www.python.org/downloads/release/python-352/

python实现简单爬虫功能
http://www.cnblogs.com/fnng/p/3576154.html

关于api-ms-win-crt-runtimel1-1-0.dll缺失的解决方案
https://www.microsoft.com/zh-cn/download/confirmation.aspx?id=48145

can't use a string pattern on a bytes-like object
imglist = re.findall(imgre,html.decode('GBK'))

inconsistent use of tabs and space in indentation
把tab替换成空格

UnicodeDecodeError:'gbk' codec can't decode byte 0xaf in position 197:illegal multibyte sequence
html.decode('utf-8')

以下是3.5.2版本的python所能用的

#coding=utf-8
import urllib.request
import re

def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    return html

def getImg(html):
    reg = r'src="(.+?\.jpg)" pic_ext'
    imgre = re.compile(reg)
    imglist = re.findall(imgre,html.decode('utf-8'))
    x = 0
    for imgurl in imglist:
        urllib.request.urlretrieve(imgurl,'D://%s.jpg' % x)
        x+=1
    print(x)

 

html = getHtml("http://tieba.baidu.com/p/2460150866");

getImg(html)

如果网页是用GBK字符集，则相应做修改
charset=gbk

#coding=utf-8
import urllib.request
import re
import datetime,time

def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    return html

def getImg(html):
    reg = r'file="(.+?\.jpg)"'
    imgre = re.compile(reg)
    imglist = re.findall(imgre,html.decode('gbk'))
    x = 0
    for imgurl in imglist:
        urllib.request.urlretrieve(imgurl,'D://06_Download//py//%s.jpg' % x)
        x+=1
    print("得到文件总数",x)


starttime= datetime.datetime.now()
html = getHtml("http://www.cmfish.com/bbs/forum.php?mod=viewthread&tid=306167&extra=page%3D1");
getImg(html)
usetime= datetime.datetime.now()-starttime
print('所花时间:',usetime)

您可能感兴趣的与本文相关的镜像

Python3.8

Conda

Python

Python 是一种高级、解释型、通用的编程语言，以其简洁易读的语法而闻名，适用于广泛的应用，包括Web开发、数据分析、人工智能和自动化脚本