Python request爬虫基本框架
The Website is api…
爬冲基础
- requests ,爬取页面Html页面请求页面;
- robots,网络爬虫标准;
- Beautiful Soup,解析HTML页面;
- Re 正则表达式提取;
- Scrapy[^1] 大型框架;
###requests 基本函数
#import requests
#url='http://www.baidu.com'
#response = requests.get(url)
#print(response.status_code) #200
#print(type(response)) #<class 'requests.models.Response'>
#print(response.headers) #{'Date': 'Mon, 21 Jan 2019 12:29:05 GMT', 'Connection': 'Keep-Alive'
#print(response.encoding) #ISO-8859-1 从http中猜测响应编码,,header中不存在charset,默认编码为这个
#print(response.apparent_encoding) #utf-8 备用编码
requests 基本框架
#通用代码框架格式
import requests
def getHtmlText(url):
try:
response=requests.get(url,timeout = 30)
response.raise_for_status() #如果状态不是200,引发HTTPERROR异常
response.encoding = response.apparent_encoding
return response.text
except:
return 'Exceptional error occurred'
if __name__ == '__main__':
url='requests.get("http://file.dl01.zxxk.com//OutFile/20190122/11265993164061984.doc?mkey=5e997e6187fcdc82217253aed14a5676705")'
print(getHtmlText(url))
图片下载基本框架
import requests,os
url = "http://image.ngchina.com.cn/2019/0122/20190122124507342.jpg"
root="D://pics//"
path=root+url.split("/")[-1]
try:
if not os.path.exists(root): #判断文件夹是否存在
os.mkdir(root)
if not os.path.exists(path): #判断文件不存在则下载文件
r=requests.get(url)
with open(path,'wb') as f:
f.write(r.content)
f.close()
print('dowmload success')
else:
print("文件已存在")
except:
print("爬取失败")
用requests-bs4爬取中国大学网站:
<