python11_王文友软件-优快云博客

本文链接：https://blog.youkuaiyun.com/big_data_study/article/details/102309201

import re
import urllib.request
‘’’
urlretrieve(网址，本地文件存储地址) #直接下载网页到本地
urlcleanup() #清除爬虫缓存
info() #当前爬取相应的情况
getcode() #
geturl() #
‘’’

urlretrieve(网址，本地文件存储地址) 直接下载网页到本地

data=urllib.request.urlretrieve(“http://www.baidu.com”,“D:/百度网盘/down.html”)
print(data)
‘’’
成功以html下载了
‘’’

#urlcleanup() 直接使用
urllib.request.urlcleanup()

info() 当前爬取相应的情况,就是简介

file=urllib.request.urlopen(“https://read.douban.com/provider/all”)
print(file.info())
‘’’
Date: Mon, 07 Oct 2019 07:11:41 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 72239
Connection: close
…
‘’’

getcode() 返回当前网页的状态码状态码=200(网页正常，可以爬)，其他数字都不可以爬

file=urllib.request.urlopen(“https://read.douban.com/provider/all”)
#print(file.info())
print(file.getcode())
‘’’
200
‘’’

geturl() 获取当前爬取页面的url地址

file=urllib.request.urlopen(“https://read.douban.com/provider/all”)
#print(file.info())
#print(file.getcode())
print(file.geturl())
‘’’
https://read.douban.com/provider/all
‘’’