异步获取html页面,如何在多线程环境中使用requestshtml呈现异步页面?

为了为具有动态加载内容的页面创建scraper,requests-html提供了在JS执行之后获取呈现页面的模块。但是,当试图通过在多线程实现中调用arender()方法来使用AsyncHTMLSession时,生成的HTML不会改变。在

例如,在源代码中提供的URL中,表HTML值默认为空,并且在脚本执行后,通过arender()方法模拟,它希望将值插入到标记中,尽管在源代码中没有发现任何可见的更改。在from pprint import pprint

#from bs4 import BeautifulSoup

import asyncio

from timeit import default_timer

from concurrent.futures import ThreadPoolExecutor

from requests_html import AsyncHTMLSession, HTML

async def fetch(session, url):

r = await session.get(url)

await r.html.arender()

return r.content

def parseWebpage(page):

print(page)

async def get_data_asynchronous():

urls = [

'http://www.fpb.pt/fpb2014/!site.go?s=1&show=jog&id=258215'

]

with ThreadPoolExecutor(max_workers=20) as executor:

with AsyncHTMLSession() as session:

# Set any session parameters here before calling `fetch`

# Initialize the event loop

loop = asyncio.get_event_loop()

# Use list comprehension to create a list of

# tasks to complete. The executor will run the `fetch`

# function for each url in the urlslist

tasks = [

await loop.run_in_executor(

executor,

fetch,

*(session, url) # Allows us to pass in multiple arguments to `fetch`

)

for url in urls

]

# Initializes the tasks to run and awaits their results

for response in await asyncio.gather(*tasks):

parseWebpage(response)

def main():

loop = asyncio.get_event_loop()

future = asyncio.ensure_future(get_data_asynchronous())

loop.run_until_complete(future)

main()

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值