如何用Python 3调用ArXiv API查询ArXiv论文元数据

使用ArXivAPI抓取数学问题相关论文：XML解析与分类获取,

原创已于 2024-04-02 19:51:21 修改 · 3.1k 阅读

11 ·

CC 4.0 BY-SA版权

文章标签：

#python #开发语言 #ArXiv #API #XML #urllib

于 2024-04-02 12:09:57 首次发布

编程学习笔记专栏收录该内容

190 篇文章

订阅专栏

本文介绍了如何使用Python的urllib库调用ArXivAPI获取数学领域的论文，包括基本入门、XML返回值解析以及如何添加分类信息以获取最新论文。作者展示了如何处理XML数据，提取论文元数据如标题、作者、发布日期等。

诸神缄默不语-个人优快云博文目录

ArXiv API文档：
arXiv API Access - arXiv info
arXiv API Basics - arXiv info
arXiv API User’s Manual - arXiv info

1. 调包

import urllib
from urllib.parse import quote

from xml.dom.minidom import parseString

2. 获取数据

注意，在Windows上运行这套代码会报错 IncompleteRead: IncompleteRead(112176 bytes read)
我也不知道是为什么，但是在Linux服务器上就没有问题……只能怀疑是操作系统的问题了。

1. 简单入门

根据关键词调用ArXiv API返回搜索结果（关于引号你们自己注意一下，Python基础常识略）：

keyword = '"math word problem"'
url = (
    "http://export.arxiv.org/api/query?search_query=all:"
    + keyword
    + "&start=0&max_results=1000&sortBy=lastUpdatedDate&sortOrder=descending"
)
url = quote(url, safe='%/:=&?~#+!$,;@()*[]"')
data = urllib.request.urlopen(url)

doc = parseString(data.read().decode("utf-8"))

URL转义知识见：Python3常用其他API速查手册（持续更新ing…）

返回值doc就是一个xml.DOM对象，可以通过如下方式在文本文件中展示：
doc.writexml(open("arxiv.xml", "w"), addindent=" ", newl="\n")

2. XML返回值示例

我觉得具体啥意思还挺见名知义的：

<?xml version="1.0" ?>
<feed xmlns="http://www.w3.org/2005/Atom">
  
  
  <link href="http://arxiv.org/api/query?search_query%3Dall%3A%22math%20word%20problem%22%26id_list%3D%26start%3D0%26max_results%3D1000" rel="self" type="application/atom+xml"/>
  
  
  <title type="html">ArXiv Query: search_query=all:&quot;math word problem&quot;&amp;id_list=&amp;start=0&amp;max_results=1000</title>
  
  
  <id>http://arxiv.org/api/omit</id>
  
  
  <updated>2024-04-01T00:00:00-04:00</updated>
  
  
  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">121</opensearch:totalResults>
  
  
  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
  
  
  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1000</opensearch:itemsPerPage>
  
  
  <entry>
    
    
    <id>http://arxiv.org/abs/2402.17916v2</id>
    
    
    <updated>2024-03-30T04:16:20Z</updated>
    
    
    <published>2024-02-27T22:07:52Z</published>
    
    
    <title>LLM-Resistant Math Word Problem Generation via Adversarial Attacks</title>
    
    
    <summary>  Large language models (LLMs) have significantly transformed the educational
landscape. As current plagiarism detection tools struggle to keep pace with
LLMs' rapid advancements, the educational community faces the challenge of
assessing students' true problem-solving abilities in the presence of LLMs. In
this work, we explore a new paradigm for ensuring fair evaluation -- generating
adversarial examples which preserve the structure and difficulty of the
original questions aimed for assessment, but are unsolvable by LLMs. Focusing
on the domain of math word problems, we leverage abstract syntax trees to
structurally generate adversarial examples that cause LLMs to produce incorrect
answers by simply editing the numeric values in the problems. We conduct
experiments on various open- and closed-source LLMs, quantitatively and
qualitatively demonstrating that our method significantly degrades their math
problem-solving ability. We identify shared vulnerabilities among LLMs and
propose a cost-effective approach to attack high-cost models. Additionally, we
conduct automatic analysis on math problems and investigate the cause of
failure, offering a nuanced view into model's limitation.
</summary>
    
    
    <author>
      
      
      <name>Roy Xie</name>
      
    
    </author>
    
    
    <author>
      
      
      <name>Chengxuan Huang</name>
      
    
    </author>
    
    
    <author>
      
      
      <name>Junlin Wang</name>
      
    
    </author>
    
    
    <author>
      
      
      <name>Bhuwan Dhingra</name>
      
    
    </author>
    
    
    <arxiv:comment xmlns:arxiv="http://arxiv.org/schemas/atom">Code/data: https://github.com/ruoyuxie/adversarial_mwps_generation</arxiv:comment>
    
    
    <link href="http://arxiv.org/abs/2402.17916v2" rel="alternate" type="text/html"/>
    
    
    <link title="pdf" href="http://arxiv.org/pdf/2402.17916v2" rel="related" type="application/pdf"/>
    
    
    <arxiv:primary_category xmlns:arxiv="http://arxiv.org/schemas/atom" term="cs.CL" scheme="http://arxiv.org/schemas/atom"/>
    
    
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom"/>
    
    
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom"/>
    
  
  </entry>
  ...
</feed>

3. 添加分类信息，获取最新论文

ArXiv API仅在指定分类时提供最新论文（The current arXiv feeds only give you updates on new papers within the category you specify. ）
（但是我自己使用时其实感觉不到太大的更新速度差异……但是我也没有高强度使用，我也不确定）

ArXiv分类ID见：Category Taxonomy

代码：

keyword = '"math word problem"'
taxonomy = "cs.AI"
url = (
    "http://export.arxiv.org/api/query?search_query=all:"
    + keyword
    + "AND+cat:"
    + taxonomy
    + "&start=0&max_results=1000&sortBy=lastUpdatedDate&sortOrder=descending"
)
url = quote(url, safe='%/:=&?~#+!$,;@()*[]"')
data = urllib.request.urlopen(url)

doc = parseString(data.read().decode("utf-8"))

4. ArXiv查询入参详解

略，待补。

3. 解析XML数据

total_list的每一个元素就是一个字典格式的论文元数据对象：

collection = doc.documentElement

total_list=[]

for entry in collection.getElementsByTagName("entry"):
    now_list = {}
    now_list["paper_url"] = entry.getElementsByTagName("id")[0].childNodes[0].data
    now_list["updated_date"] = (
        entry.getElementsByTagName("updated")[0].childNodes[0].data
    )
    now_list["publication_date"] = (
        entry.getElementsByTagName("published")[0].childNodes[0].data
    )
    now_list["title"] = (
        entry.getElementsByTagName("title")[0].childNodes[0].data.replace("\n", " ")
    )
    now_list["summary"] = (
        entry.getElementsByTagName("summary")[0].childNodes[0].data.replace("\n", " ").strip()
    )

    author_str = ""
    for author in entry.getElementsByTagName("author"):
        author_str += author.getElementsByTagName("name")[0].childNodes[0].data + "; "
    now_list["authors"] = author_str[:-2]

    comments = entry.getElementsByTagName("arxiv:comment")
    if comments:
        now_list["comment"] = comments[0].childNodes[0].data
    else:
        now_list["comment"] = ""

    links = entry.getElementsByTagName("link")
    for link in links:
        rel = link.getAttribute("rel")
        href = link.getAttribute("href")
        link_type = link.getAttribute("type")
        if rel == "alternate":
            now_list["alternate_link"] = href
        elif rel == "related" and link_type == "application/pdf":
            now_list["pdf_link"] = href

    primary_categories = entry.getElementsByTagName("arxiv:primary_category")
    if primary_categories:
        now_list["primary_category"] = primary_categories[0].getAttribute("term")
    else:
        now_list["primary_category"] = ""

    categories = entry.getElementsByTagName("category")
    category_list = []
    for category in categories:
        category_term = category.getAttribute("term")
        category_list.append(category_term)
    now_list["categories"] = "; ".join(category_list)

    total_list.append(now_list)