Huggingface网页解析和下载爬虫

文章介绍了如何使用Python库requests和BeautifulSoup解析HuggingFace平台上的InternLM模型页面,提取链接并演示了下载其中一个模型文件的过程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

解析网页:

import requests
from bs4 import BeautifulSoup

# 目标网页URL
url = 'https://huggingface.co/internlm/internlm-20b/tree/main'

# 发送GET请求
response = requests.get(url)

# 检查请求是否成功
if response.status_code == 200:
    # 使用BeautifulSoup解析HTML内容
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 假设我们要找到所有的链接
    for link in soup.find_all('a'):
        href = link.get('href')
        if href:  # 确保href不为空
            print(href)
else:
    print("网页请求失败,状态码:", response.status_code)

/
/models
/datasets
/spaces
/docs
/pricing
/login
/join
/internlm
/internlm/internlm-20b
/models?pipeline_tag=text-generation
/models?library=transformers
/models?library=pytorch
/models?other=internlm
/models?other=feature-extraction
/models?other=custom_code
/models?license=license%3Aapache-2.0
/internlm/internlm-20b
/internlm/internlm-20b/tree/main
/internlm/internlm-20b/discussions
/internlm/internlm-20b/tree/main
/internlm/internlm-20b/commits/main
/internlm/internlm-20b/commits/main
/x54-729
/internlm/internlm-20b/commit/2d83118d863d24565da1f9c6c0fe99d3e882f25c
/internlm/internlm-20b/blob/main/.gitattributes
/internlm/internlm-20b/resolve/main/.gitattributes?download=true
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/blob/main/README.md
/internlm/internlm-20b/resolve/main/README.md?download=true
/internlm/internlm-20b/commit/509b748b2160d0571d067d85f8a21df018cdee29
/internlm/internlm-20b/commit/509b748b2160d0571d067d85f8a21df018cdee29
/internlm/internlm-20b/blob/main/config.json
/internlm/internlm-20b/resolve/main/config.json?download=true
/internlm/internlm-20b/commit/2d83118d863d24565da1f9c6c0fe99d3e882f25c
/internlm/internlm-20b/commit/2d83118d863d24565da1f9c6c0fe99d3e882f25c
/internlm/internlm-20b/blob/main/configuration_internlm.py
/internlm/internlm-20b/resolve/main/configuration_internlm.py?download=true
/internlm/internlm-20b/commit/53d4840ed4326a633e59501ba4ac3342757fed34
/internlm/internlm-20b/commit/53d4840ed4326a633e59501ba4ac3342757fed34
/internlm/internlm-20b/blob/main/generation_config.json
/internlm/internlm-20b/resolve/main/generation_config.json?download=true
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/blob/main/modeling_internlm.py
/internlm/internlm-20b/resolve/main/modeling_internlm.py?download=true
/internlm/internlm-20b/commit/c8f2f9979075c3ccd0399d042823ac719d545840
/internlm/internlm-20b/commit/c8f2f9979075c3ccd0399d042823ac719d545840
/internlm/internlm-20b/blob/main/pytorch_model-00001-of-00005.bin
/docs/hub/security-pickle
/internlm/internlm-20b/resolve/main/pytorch_model-00001-of-00005.bin?download=true
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/blob/main/pytorch_model-00002-of-00005.bin
/docs/hub/security-pickle
/internlm/internlm-20b/resolve/main/pytorch_model-00002-of-00005.bin?download=true
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/blob/main/pytorch_model-00003-of-00005.bin
/docs/hub/security-pickle
/internlm/internlm-20b/resolve/main/pytorch_model-00003-of-00005.bin?download=true
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/blob/main/pytorch_model-00004-of-00005.bin
/docs/hub/security-pickle
/internlm/internlm-20b/resolve/main/pytorch_model-00004-of-00005.bin?download=true
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/blob/main/pytorch_model-00005-of-00005.bin
/docs/hub/security-pickle
/internlm/internlm-20b/resolve/main/pytorch_model-00005-of-00005.bin?download=true
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/blob/main/pytorch_model.bin.index.json
/internlm/internlm-20b/resolve/main/pytorch_model.bin.index.json?download=true
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/blob/main/special_tokens_map.json
/internlm/internlm-20b/resolve/main/special_tokens_map.json?download=true
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/blob/main/tokenization_internlm.py
/internlm/internlm-20b/resolve/main/tokenization_internlm.py?download=true
/internlm/internlm-20b/commit/632df84a18d93aa5b40238a1472a8ffb38e2611c
/internlm/internlm-20b/commit/632df84a18d93aa5b40238a1472a8ffb38e2611c
/internlm/internlm-20b/blob/main/tokenizer.model
/internlm/internlm-20b/resolve/main/tokenizer.model?download=true
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/blob/main/tokenizer_config.json
/internlm/internlm-20b/resolve/main/tokenizer_config.json?download=true
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa
/internlm/internlm-20b/commit/b8825fe3394608fe84f0f5eb6471454384fb83aa

下载代码:

import requests
from tqdm.auto import tqdm

file_url = 'https://huggingface.co/internlm/internlm-20b/resolve/main/pytorch_model-00001-of-00005.bin?download=true'

# 获取文件大小
response = requests.head(file_url)
total_size = int(response.headers.get('content-length', 0))

# 设置流下载模式
response = requests.get(file_url, stream=True)

# 检查是否请求成功
if response.status_code == 200:
    file_path = 'pytorch_model-00001-of-00005.bin'
    # 设置进度条
    with tqdm.wrapattr(open(file_path, "wb"), "write", miniters=1,
                       total=total_size, desc=file_path) as fout:
        for chunk in response.iter_content(chunk_size=4096):
            fout.write(chunk)
    print("文件下载完成")
else:
    print("下载失败,状态码:", response.status_code)

### Dify与Elasticsearch结合的工作流应用案例 #### 背景介绍 Dify 是一种支持复杂对话场景构建的技术框架,能够通过自定义 Agent 多阶段工作流实现智能化的任务处理[^1]。而 Elasticsearch 则是一种分布式搜索引擎,擅长全文检索、数据分析以及大规模数据存储[^2]。两者的结合可以显著提升企业级应用场景中的效率。 --- #### 多阶段工作流设计 ##### 需求澄清阶段 在这一阶段,系统会通过自然语言理解技术解析用户的输入,并识别其中的模糊部分。例如,在涉及文档查找的需求中,如果用户仅提供了关键词而不明确范围,则可以通过追问机制进一步获取上下文信息。此过程可由 Dify 的 `ResearchAgent` 类完成: ```python class ClarificationAgent(Node): def run(self, inputs): query = inputs['query'] if not is_query_clear(query): # 假设有一个函数判断查询是否清晰 follow_up_questions = generate_follow_ups(query) return {"follow_up": follow_up_questions} else: return {"clarified_query": query} ``` 上述代码片段展示了如何动态生成追问问题以消除歧义。 --- ##### 知识检索阶段 在此阶段,利用 Elasticsearch 提供的强大搜索功能来执行高效的数据检索操作。具体来说,可以从内部数据库(如向量数据库)外部资源(如网络爬虫抓取的内容)并行查询相关信息。以下是基于 Elasticsearch 实现的一个简单示例: ```python from elasticsearch import Elasticsearch def elastic_search(query, index_name="documents"): es_client = Elasticsearch() response = es_client.search( index=index_name, body={ "query": { "multi_match": { # 支持跨字段匹配 "query": query, "fields": ["title", "content"] } }, "size": 10 # 返回前十个结果 } ) hits = [hit["_source"] for hit in response["hits"]["hits"]] return hits ``` 该方法允许快速定位目标文档或记录,从而为后续决策提供依据。 --- ##### 方案生成阶段 最后一步是整合来自不同来源的信息形成最终解答。这可能涉及到结构化数据提取、摘要生成或者可视化展示等多种形式。下面是一个简单的例子说明如何将 ElasticSearch 查询到的结果转化为易于阅读的形式: ```python def summarize_results(results): summaries = [] for result in results: summary = f"Title: {result.get('title', 'N/A')}\nContent Summary: {summarize_text(result.get('content', ''))}" summaries.append(summary) return "\n---\n".join(summaries) # Helper function to create short abstract of long texts. def summarize_text(text): from transformers import pipeline summarizer = pipeline("summarization") return summarizer(text)[0]['summary_text'][:200] + "..." elastic_results = elastic_search(clarified_query, "knowledge_base") final_output = summarize_results(elastic_results) return final_output ``` 这里采用了 HuggingFace Transformers 库来进行自动文本总结。 --- #### 主要优势分析 - **高性能**:借助 Elasticsearch 的倒排索引机制,即使面对海量数据也能保持毫秒级响应速度。 - **灵活性**:通过调整插件配置文件路径 `/plugins` 或工具管理接口 `/tools` ,管理员可以根据实际业务需求灵活定制扩展模块。 - **易维护性**:整个流程被拆解成独立的功能单元,便于单独测试调试以及后期升级优化。 ---
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

AIOT魔法师

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值