python asyncio异步爬虫

极客-杀生丸

已于 2024-11-09 12:10:57 修改

阅读量386

点赞数

分类专栏： python 文章标签： python 爬虫开发语言

于 2024-11-09 12:08:36 首次发布

本文链接：https://blog.youkuaiyun.com/weixin_44772948/article/details/143642884

版权

python 专栏收录该内容

83 篇文章 ¥69.90 ¥99.00

订阅专栏

超级会员免费看

import aiohttp
import asyncio
from bs4 import BeautifulSoup
import re

# 异步爬取一个页面的 HTML 内容
async def fetch_html(url, session):
    try:
        async with session.get(url) as response:
            if response.status == 200:
                return await response.text()
            else:
                print(f"无法访问 {url}，状态码: {response.status}")
                return None
    except Exception as e:
        print(f"请求失败: {e}")
        return None

# 从 HTML 中提取所有链接
def extract_links(html, base_url):
    soup = BeautifulSoup(html, 'html.parser')
    links = set()

    # 查找所有 <a> 标签，并提取 href 属性
    for link in soup.find_all('a', href=True):
        href = link['href']
        if href.startswith('

了解本专栏