pool.map传入一个函数与一个列表,列表中包含字典

Python多进程爬虫:函数返回函数处理网页数据并保存至Excel
该代码示例展示了一个Python程序,它使用requests获取网页HTML,用BeautifulSoup解析数据,然后通过自定义函数job将数据保存到Excel表格。job1函数接收字典对象,对每个URL调用job。利用multiprocessing.dummy.Pool进行多进程处理,提高爬取效率。最后,程序计算总耗时并打印结果。

 1、即传入的函数返回另外一个函数,在另外一个函数中处理数据。

import time
import requests
from multiprocessing.dummy import Pool
# 自定义函数 job用于获取网页内容信息并保存到表格
def job(name ,url):
    file_path = pd.ExcelWriter(name + '.xlsx')  # 创建一个表格
    html = requests.get(url) # 请求网页
    jjhtml = etree.HTML(html.text)
    ld_html = jjhtml.xpath('//ul[@class="listContent"]/li')
    for index,ld in enumerate(ld_html):
        ld_age = ld.xpath('./div/div[3]/text()')
        ld_url = ld.xpath('./div/div[@class="title"]/a/@href')
        ldlist = []
        content = {'哦哦哦': ldage[0], '哦哦哦url': ld_url[0]}
        ldlist.append(content)
        pf = pd.DataFrame(list(ldlist))
        order = ['哦哦哦', '哦哦哦url']
        pf = pf[order]  # 表头
        pf.fillna(' ', inplace=True) # 去除空格
        pf.to_excel(file_path, encoding="gbk", header=False, index=False, startrow=( index + 3))
        file_path.save()

def job1(object):
    for name,url in object.items():
        return job(name, url)

url_list = [{'锦江': 'https://aa.aa.com/aaaaa/aa'}, {'青羊': 'https://bb.bb.com/bbbbb/bbb/'}, {'成华': 'https://cc.cc.com/ccccc/ccc/'}]

time1=time.time()
pool = Pool(4)
data_list=url_list
res = pool.map(job1,data_list)
time2=time.time()
print(res)
pool.close()
pool.join()
print('总共耗时:' + str(time2 - time1) + 's')

RemoteTraceback Traceback (most recent call last) RemoteTraceback: """ Traceback (most recent call last): File "/media/zekun/Software/anaconda3/envs/igg/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3803, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'XX' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/media/zekun/Software/anaconda3/envs/igg/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/media/zekun/Software/anaconda3/envs/igg/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar return list(map(*args)) File "/tmp/ipykernel_198777/2193130406.py", line 92, in process_sn sta_name = stas.loc[SN, 'STATION'] File "/media/zekun/Software/anaconda3/envs/igg/lib/python3.8/site-packages/pandas/core/indexing.py", line 1066, in __getitem__ return self.obj._get_value(*key, takeable=self._takeable) File "/media/zekun/Software/anaconda3/envs/igg/lib/python3.8/site-packages/pandas/core/frame.py", line 3921, in _get_value row = self.index.get_loc(index) File "/media/zekun/Software/anaconda3/envs/igg/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc raise KeyError(key) from err KeyError: 'XX' """ The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Cell In [40], line 108 106 # multi-processing, revise the number of processes accordingly 107 with Pool(processes=6) as pool: --> 108 pool.map(process_sn, SNs) File /media/zekun/Software/anaconda3/envs/igg/lib/python3.8/multiprocessing/pool.py:364, in Pool.map(self, func, iterable, chunksize) 359 def map(self, func, iterable, chunksize=None): 360 ''' 361 Apply `func` to each element in `iterable`, collecting the results 362 in a list that is returned. 363 ''' --> 364 return self._map_async(func, iterable, mapstar, chunksize).get() File /media/zekun/Software/anaconda3/envs/igg/lib/python3.8/multiprocessing/pool.py:771, in ApplyResult.get(self, timeout) 769 return self._value 770 else: --> 771 raise self._value KeyError: 'XX'
08-04
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值