Python爬虫原来可以这么简单,前两天有个朋友让我帮她看一下爬取天气的一段程序有什么问题,这段程序是用Python写的,只是以前听说Python爬虫很厉害,但是不知道自己怎么会没有时间开始。刚好我也乘此机会学习一下Python爬虫,原来Python爬虫可以这么简单。特此记录一下第一次Python爬虫。
参考链接:
- 手把手教你使用Python+scrapy爬取山东各城市天气预报
- Python3入门笔记(1) —— windows安装与运行
- pip install scrapy报错,教你如何正确安装scrapy
- NameError: name ‘urlopen’ is not defined
安装Python
- windows下载地址
- 安装
安装scrapy
- 更新pip
python -m pip install --upgrade pip
- 安装wheel
pip install wheel
-
安装lxml
lxml文件下载,下载如下图的文件,注意与安装的python版本对应。
下载完成右击文件,选择属性,再选择安全,复制文件路径。 在cmd中输入:pip install 文件路径。 -
安装pyOpenSSL
pyOpenSSL 文件下载
下载完成右击文件,选择属性,再选择安全,复制文件路径。 在cmd中输入:pip install 文件路径。 -
安装Twisted
Twisted文件下载
-
安装pywin32
pywin32下载 ,下载完成安装即可。
-
安装scrapy
cmd中执行命令:pip install scrapy。
第一个爬虫项目
- 创建项目:scrapy startproject sdWeatherSpider
- 进入爬虫项目文件夹,执行下面的命令创建爬虫
- 目录结构
- 进入http://www.weather.com.cn/shandong/index.shtml,右键查看页面源码,找到如图所示位置。
- 进入http://www.weather.com.cn/weather/101120101.shtml,右键查看页面源代码,理解如下图的位置。
- 修改items.py文件,定义要爬取的内容
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class SdweatherspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
city = scrapy.Field()
weather = scrapy.Field()
pass
- 修改爬虫文件everyCityinSD.py,定义如何爬取内容,其中用到的规则参考前面对页面的分析
# -*- coding: utf-8 -*-
import scrapy
from urllib.request import urlopen
from sdWeatherSpider.items import SdweatherspiderItem
import re
class EverycityinsdSpider(scrapy.Spider):
name = 'everyCityinSD'
allowed_domains = ['www.weather.com.cn']
start_urls = []
#遍历各城市,获取要爬取的页面
url = r'http://www.weather.com.cn/shandong/index.shtml'
with urlopen(url) as fp:
contents = fp.read().decode()
pattern = '<a title=".*?" href="(.+?)" target="_blank">(.+?)</a>'
for url in re.findall(pattern,contents):
start_urls.append(url[0])
def parse(self, response):
#处理每个城市的天气预报页面数据
item = SdweatherspiderItem()
city = response.xpath('//div[@class="crumbs fl"]//a[3]//text()').extract()[0]
item['city'] = city
selector = response.xpath('//ul [@class="t clearfix"]')[0]
#存放天气数据
weather = ''
for li in selector.xpath('./li'):
date = li.xpath('./h1//text()').extract()[0]
cloud = li.xpath('./p[@title]//text()').extract()[0]
#high = li.xpath('./p[@class="tem"]//span//text()').extract()[0]
low = li.xpath('./p[@class="tem"]//i//text()').extract()[0]
wind = li.xpath('./p[@class="win"]//em//span[1]/@title').extract()[0]
wind = wind + li.xpath('./p[@class="win"]//i//text()').extract()[0]
#weather = weather + date +':'+cloud+','+high+r'/'+low+','+wind+'\n'
weather = weather + date +':'+cloud+','+r'/'+low+','+wind+'\n'
item['weather'] = weather
return [item]
pass
- 修改pipelines.py文件,把爬取到的数据写入文件weather.txt
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class SdweatherspiderPipeline(object):
def process_item(self, item, spider):
with open('weather.txt','a',encoding='utf8') as fp:
fp.write(item['city']+'\n')
fp.write(item['weather']+'\n\n')
return item
- 修改settings.py文件,分派任务,指定处理数据的程序
# -*- coding: utf-8 -*-
# Scrapy settings for sdWeatherSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'sdWeatherSpider'
SPIDER_MODULES = ['sdWeatherSpider.spiders']
NEWSPIDER_MODULE = 'sdWeatherSpider.spiders'
ITEM_PIPELINES = {
'sdWeatherSpider.pipelines.SdweatherspiderPipeline':1,
}
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'sdWeatherSpider (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'sdWeatherSpider.middlewares.SdweatherspiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'sdWeatherSpider.middlewares.SdweatherspiderDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'sdWeatherSpider.pipelines.SdweatherspiderPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
- 执行命令: scrapy crawl everyCityinSD ,运行爬虫程序
- 完整项目代码:https://download.youkuaiyun.com/download/qq_36135928/11232643