今天闲来无事, 又用刚学的异步库练了练手, 这次提取的是腾讯招聘的招聘数据, 这里面的数据是ajax加载的, 所以需要抓包获取, 总体的思路是从列表页通过抓包获取一个可以进入详情页的id, 然后接受这个id在详情页中提取数据, 用的异步以及aiohtto库, 时间比同步快了不少, 但还是有些地方不完美, session请求构造一个其实就可以了,但是在我这个程序里只能构造两次。没办法了。以下是代码, 对于学了asyncio以及aiohttp库的同学可以看一下。
import aiohttp
import asyncio
import time
"""
异步提取腾讯招聘ajax后台数据
time: 2019年8月18日10:47:00
"""
a = time.time()
class Tecent(object):
async def get_postid(self, url):
"""
获取postid ,传给详情页
:param url:
"""
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
resp = await resp.json()
id_list = []
if resp:
if resp.get('Data').get('Posts', False)[0]:
for row in resp.get('Data').get('Posts'):
post_id = row.get('PostId')
item = f'{post_id}'
id_list.append(item)
return id_list
async def get_json_data(self, url):
"""
提取数据
:param url: detail_url
:return: json数据
"""
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
resp = await resp.json()
# 一些ajax加载的数据,很好提取
BGName = resp.get('Data').get('BGName')
CategoryName = resp.get('Data').get('CategoryName')
LastUpdateTime = resp.get('Data').get('LastUpdateTime')
LocationName = resp.get('Data').get('LocationName')
# PostURL = resp.get('Data').get('PostURL')
RecruitPostName = resp.get('Data').get('RecruitPostName')
Responsibility = resp.get('Data').get('Responsibility')
Requirement = resp.get('Data').get('Requirement')
# 拼凑
summary = BGName + "|" + CategoryName + "|" + LocationName + "|" + LastUpdateTime
item = f'{RecruitPostName},{summary},{Responsibility},{Requirement} \n\n'
return item
def save_to_csv(self, data):
"""
:param data:
"""
import csv
with open('aiotecent111.csv', 'a', encoding='utf-8', newline='') as f:
f.write(data)
async def main(self):
tecent = Tecent()
list_url = [f'https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex={index}&pageSize=10' for index in range(1, 5)]
tasks = [tecent.get_postid(url) for url in list_url]
return await asyncio.gather(*tasks)
if __name__ == '__main__':
tecent = Tecent()
loop = asyncio.get_event_loop() # 创建事件循环
results = loop.run_until_complete(tecent.main()) # 运行
detail_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?postId={0}&language=zh-cn'
for result in results:
for id in result:
urls = [detail_url.format(id)]
tasks = [tecent.get_json_data(url) for url in urls]
# 这一步 注意 创建一个事件循环就可以了
results = loop.run_until_complete(asyncio.gather(*tasks))
for result in results:
tecent.save_to_csv(result)
b = time.time()
print(b-a)
以上是代码, 有什么问题欢迎提出来, 大家共同讨论。