前言
最近做的一些项目需要获得国内某地的天气,所以写了一个爬取天气的程序。
一、使用的库与网站
使用的库有requests、lxml
爬取天气的网站为https://www.weatherol.cn/
json解析网站https://www.json.cn/
cityid获取https://blog.youkuaiyun.com/li_and_li/article/details/79602686
二、爬取天气信息(动态)
1.分析API请求
打开网站https://www.weatherol.cn/,进入开发者选项,刷新一下,包过滤选择XHR,可以看到如下几个请求。
可以看到,请求天气信息的API应为。
https://www.weatherol.cn/api/home/getCurrAnd15dAnd24h?cityid=101180301
参数cityid为城市统一编码。
然后,我们将返回的json解析一下,看看里面都有什么信息。
可以看到,15天内的信息都在这里面,我们可以自行提取需要的信息
2.爬虫主程序编写
程序的主要结构为使用requests发出post请求,再使用parse_data()函数解析响应
代码如下:
import json
from typing import Union
import requests
# API
city_weather_url = 'http://www.weatherol.cn/api/home/getCurrAnd15dAnd24h'
# ua
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
def get_headers():
"""
获得请求头部
return: headers: dict
"""
headers = {
'user_agent': user_agent
}
return headers
def parse_data(weather_data):
"""
解析返回的数据,提取有用的内容
return: weather_dict: dict
"""
ret_dict = {}
ret_dict['当前天气'] = weather_data['data']['current']['current']['weather']
ret_dict['当前温度'] = weather_data['data']['current']['current']['temperature']
high = weather_data['data']['forecast15d'][1]['temperature_am']
low = weather_data['data']['forecast15d'][1]['temperature_pm']
ret_dict['今日温度'] = low + ' - ' + high
ret_dict['风向'] = weather_data['data']['current']['current']['winddir']
ret_dict['风速'] = weather_data['data']['current']['current']['windpower']
ret_dict['气压'] = weather_data['data']['current']['current']['airpressure'] + 'hpa'
ret_dict['湿度'] = weather_data['data']['current']['current']['humidity'] + '%'
aqi = weather_data['data']['current']['air']['AQI']
level = weather_data['data']['current']['air']['levelIndex']
ret_dict['空气质量'] = aqi + '/' +level
ret_dict['小提示'] = weather_data['data']['current']['tips']
return ret_dict
def get_weather(city_id) -> Union[None, dict]:
"""
根据城市ID获取天气信息
"""
params = {
'cityid': city_id
}
# 发出post请求
response = requests.get(url=city_weather_url, headers=get_headers(), params=params)
weather_json = response.text
# 转换返回的字符串为json并解析
weather_data = json.loads(weather_json)
weather_dict = parse_data(weather_data)
print(response)
return weather_dict
def test():
weather_dict = get_weather('101180301')
print(weather_dict)
if __name__ == '__main__':
test()
运行一下
三、cityid的获取(静态)
现在,我们已经可以通过cityid来获取15天内的所有天气信息,但是,我们怎么来获取cityid呢?
我们发现网上有很多人都已经汇总好了国内所有城市的ID,我们只需要将其解析储存到本地,使用的时候再去检索就好了。
通过百度,我找到了一个比较好爬取的博客。
https://blog.youkuaiyun.com/li_and_li/article/details/79602686
直接F12看下网页结构并用Xpath Helper这个工具解析一下xpath
代码如下
import requests
import json
from lxml import etree
url = 'https://blog.youkuaiyun.com/li_and_li/article/details/79602686'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36'
}
def parse_html(html):
"""
解析网页,返回网页内所有城市的名字与ID
"""
et = etree.HTML(html)
citys = et.xpath('//div[@id="content_views"]/p/text()')
ret_dict = {}
for city in citys:
try:
city_info = city.split(',')
city_id = city_info[0]
city_name = city_info[1]
ret_dict[city_name] = city_id
except:
print('err str: ' + city)
continue
return ret_dict
response = requests.get(url, headers=headers)
html = response.text
city_info = parse_html(html)
with open('city_id.json', 'w', encoding='utf8') as fp:
fp.write(json.dumps(city_info, ensure_ascii=False))
爬取后的结果
之后,我们就可以从这个文件里面得到cityid然后使用上面写的天气爬虫爬取天气了!!