一键下载当天ArXiv上的pdf文件

最新推荐文章于 2025-07-05 16:11:45 发布

置顶

yuliumt

最新推荐文章于 2025-07-05 16:11:45 发布

阅读量1.9k

点赞数

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/yuliumt/article/details/78783418

该博客介绍了如何一键下载ArXiv上当天发布的天体物理领域的pdf文件。通过解析ArXiv的HTML内容，利用Python脚本获取当前日期并创建文件夹保存文献，路径为F:ArXivyearmonth。文中还提供了正则表达式示例，并警告若ArXiv网页结构变动，代码可能失效。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一键下载当天ArXiv上的pdf文件

（本人是天文专业的，以天体物理作为例子）

ArXiv网址：https://arxiv.org/list/astro-ph/new

完整代码如下：

from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup 
import numpy as np
import pandas as np
import urllib
import re
import os

html = urlopen("https://arxiv.org/list/astro-ph/new")
bsObj = BeautifulSoup(html, "lxml")

# 判断是否下载pdf文件：
def decide(title, abstract, regular):
    s1 = re.search(regular, str(title.get_text()))
    s2 = re.search(regular, str(abstract.get_text()))
    if s1 is not None:
        return True
    elif s2 is not None:
        return True
    else:
        return False

# 无输入用默认
regular = input("input regular expression:")
if regular == '':
    regular = "GRB|FRB|GW"

# 将pdf保存到path路径下：
dateline = bsObj.find("h3")
year = '20' + dateline