Python入门(安装)——第一个爬虫程序（爬取山东各城市天气信息）-优快云博客

本文详细介绍了一个使用Python和Scrapy框架爬取山东各城市天气预报的项目。从环境搭建到代码实现，再到数据处理，全面展示了Python爬虫的简单与高效。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Python爬虫原来可以这么简单，前两天有个朋友让我帮她看一下爬取天气的一段程序有什么问题，这段程序是用Python写的，只是以前听说Python爬虫很厉害，但是不知道自己怎么会没有时间开始。刚好我也乘此机会学习一下Python爬虫，原来Python爬虫可以这么简单。特此记录一下第一次Python爬虫。

参考链接：

安装Python

windows下载地址
安装

安装scrapy

更新pip

python -m pip install --upgrade pip

在这里插入图片描述

安装wheel

pip install wheel

在这里插入图片描述

安装lxml
lxml文件下载，下载如下图的文件，注意与安装的python版本对应。
下载完成右击文件，选择属性，再选择安全，复制文件路径。在cmd中输入：pip install 文件路径。
安装pyOpenSSL
pyOpenSSL 文件下载

下载完成右击文件，选择属性，再选择安全，复制文件路径。在cmd中输入：pip install 文件路径。
安装Twisted
Twisted文件下载
安装pywin32
pywin32下载，下载完成安装即可。
安装scrapy
cmd中执行命令：pip install scrapy。

第一个爬虫项目

创建项目：scrapy startproject sdWeatherSpider
进入爬虫项目文件夹，执行下面的命令创建爬虫
目录结构
进入http://www.weather.com.cn/shandong/index.shtml，右键查看页面源码，找到如图所示位置。
进入http://www.weather.com.cn/weather/101120101.shtml，右键查看页面源代码，理解如下图的位置。
修改items.py文件，定义要爬取的内容

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class SdweatherspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    city = scrapy.Field()
    weather = scrapy.Field()
    pass

修改爬虫文件everyCityinSD.py，定义如何爬取内容，其中用到的规则参考前面对页面的分析

# -*- coding: utf-8 -*-
import scrapy
from urllib.request import urlopen
from sdWeatherSpider.items import SdweatherspiderItem
import re


class EverycityinsdSpider(scrapy.Spider):
    name = 'everyCityinSD'
    allowed_domains = ['www.weather.com.cn']
    start_urls = []
    #遍历各城市，获取要爬取的页面
    url = r'http://www.weather.com.cn/shandong/index.shtml'
    with urlopen(url) as fp:
        contents = fp.read().decode()
    pattern = '<a title=".*?" href="(.+?)" target="_blank">(.+?)</a>'
    for url in re.findall(pattern,contents):
        start_urls.append(url[0])

    def parse(self, response):
        #处理每个城市的天气预报页面数据
        item = SdweatherspiderItem()
        city = response.xpath('//div[@class="crumbs fl"]//a[3]//text()').extract()[0]
        item['city'] = city
        
        selector = response.xpath('//ul [@class="t clearfix"]')[0]

        #存放天气数据
        weather = ''
        for li in selector.xpath('./li'):
            date = li.xpath('./h1//text()').extract()[0]
            cloud = li.xpath('./p[@title]//text()').extract()[0]
            #high = li.xpath('./p[@class="tem"]//span//text()').extract()[0]
            low = li.xpath('./p[@class="tem"]//i//text()').extract()[0]
            wind = li.xpath('./p[@class="win"]//em//span[1]/@title').extract()[0]
            wind = wind + li.xpath('./p[@class="win"]//i//text()').extract()[0]

            #weather = weather + date +':'+cloud+','+high+r'/'+low+','+wind+'\n'
            weather = weather + date +':'+cloud+','+r'/'+low+','+wind+'\n'
        item['weather'] = weather
        return [item]
        pass

修改pipelines.py文件，把爬取到的数据写入文件weather.txt

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class SdweatherspiderPipeline(object):
    def process_item(self, item, spider):
        with open('weather.txt','a',encoding='utf8') as fp:
            fp.write(item['city']+'\n')
            fp.write(item['weather']+'\n\n')
        return item

修改settings.py文件，分派任务，指定处理数据的程序

# -*- coding: utf-8 -*-

# Scrapy settings for sdWeatherSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'sdWeatherSpider'

SPIDER_MODULES = ['sdWeatherSpider.spiders']
NEWSPIDER_MODULE = 'sdWeatherSpider.spiders'

ITEM_PIPELINES = {
    'sdWeatherSpider.pipelines.SdweatherspiderPipeline':1,
    }


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'sdWeatherSpider (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'sdWeatherSpider.middlewares.SdweatherspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'sdWeatherSpider.middlewares.SdweatherspiderDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'sdWeatherSpider.pipelines.SdweatherspiderPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'