python爬虫实践（1）爬取图片网站的高清壁纸

最新推荐文章于 2024-05-01 22:01:25 发布

sky_on_the_way

最新推荐文章于 2024-05-01 22:01:25 发布

阅读量978

点赞数 3

CC 4.0 BY-SA版权

分类专栏： python爬虫实践文章标签： python

本文链接：https://blog.youkuaiyun.com/weixin_41780080/article/details/106412728

python爬虫实践（1）爬取图片网站的高清壁纸

robots.txt协议：

爬取网站之前首先检查该站点根目录下是否存在robots.txt，如果存在，就会按照该文件中的内容来确定访问的范围；如果该文件不存在，所有的搜索蜘蛛将能够访问网站上所有没有被口令保护的页面。，如http://pic.netbian.com/robots.txt，Disallow下的目录或文件都禁止爬取。

#!/usr/bin/env python
# coding: utf-8
# @Desc  : 爬取彼岸图网壁纸

import requests
from bs4 import BeautifulSoup
import os,re

# 创建目录
def createDir(filePath):
    if not os.path.exists(filePath):
        os.mkdir(filePath)

# 获取壁纸超链接
def getAndSavehref(url,pagenum):
    try:
        createDir(os.getcwd()+"\img")
        f = open(os.getcwd()+'\img\html_href.txt', 'w', encoding='utf-8')
        # 请求头
        head = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
        }
        for i in range(1, pagenum):
            if i==1 :
               r = requests.get(url, headers=head)
            else:
                r = requests.get(