如何用Python 3调用ArXiv API查询ArXiv论文元数据

本文介绍了如何使用Python的urllib库调用ArXivAPI获取数学领域的论文,包括基本入门、XML返回值解析以及如何添加分类信息以获取最新论文。作者展示了如何处理XML数据,提取论文元数据如标题、作者、发布日期等。

诸神缄默不语-个人优快云博文目录

ArXiv API文档:
arXiv API Access - arXiv info
arXiv API Basics - arXiv info
arXiv API User’s Manual - arXiv info

1. 调包

import urllib
from urllib.parse import quote

from xml.dom.minidom import parseString

2. 获取数据

注意,在Windows上运行这套代码会报错 IncompleteRead: IncompleteRead(112176 bytes read)
我也不知道是为什么,但是在Linux服务器上就没有问题……只能怀疑是操作系统的问题了。

1. 简单入门

根据关键词调用ArXiv API返回搜索结果(关于引号你们自己注意一下,Python基础常识略):

keyword = '"math word problem"'
url = (
    "http://export.arxiv.org/api/query?search_query=all:"
    + keyword
    + "&start=0&max_results=1000&sortBy=lastUpdatedDate&sortOrder=descending"
)
url = quote(url, safe='%/:=&?~#+!$,;@()*[]"')
data = urllib.request.urlopen(url)

doc = parseString(data.read().decode("utf-8"))

URL转义知识见:Python3常用其他API速查手册(持续更新ing…)

返回值doc就是一个xml.DOM对象,可以通过如下方式在文本文件中展示:
doc.writexml(open("arxiv.xml", "w"), addindent=" ", newl="\n")

2. XML返回值示例

我觉得具体啥意思还挺见名知义的:

<?xml version="1.0" ?>
<feed xmlns="http://www.w3.org/2005/Atom">
  
  
  <link href="http://arxiv.org/api/query?search_query%3Dall%3A%22math%20word%20problem%22%26id_list%3D%26start%3D0%26max_results%3D1000" rel="self" type="application/atom+xml"/>
  
  
  <title type="html">ArXiv Query: search_query=all:&quot;math word problem&quot;&amp;id_list=&amp;start=0&amp;max_results=1000</title>
  
  
  <id>http://arxiv.org/api/omit</id>
  
  
  <updated>2024-04-01T00:00:00-04:00</updated>
  
  
  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">121</opensearch:totalResults>
  
  
  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
  
  
  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1000</opensearch:itemsPerPage>
  
  
  <entry>
    
    
    <id>http://arxiv.org/abs/2402.17916v2</id>
    
    
    <updated>2024-03-30T04:16:20Z</updated>
    
    
    <published>2024-02-27T22:07:52Z</published>
    
    
    <title>LLM-Resistant Math Word Problem Generation via Adversarial Attacks</title>
    
    
    <summary>  Large language models (LLMs) have significantly transformed the educational
landscape. As current plagiarism detection tools struggle to keep pace with
LLMs' rapid advancements, the educational community faces the challenge of
assessing students' true problem-solving abilities in the presence of LLMs. In
this work, we explore a new paradigm for ensuring fair evaluation -- generating
adversarial examples which preserve the structure and difficulty of the
original questions aimed for assessment, but are unsolvable by LLMs. Focusing
on the domain of math word problems, we leverage abstract syntax trees to
structurally generate adversarial examples that cause LLMs to produce incorrect
answers by simply editing the numeric values in the problems. We conduct
experiments on various open- and closed-source LLMs, quantitatively and
qualitatively demonstrating that our method significantly degrades their math
problem-solving ability. We identify shared vulnerabilities among LLMs and
propose a cost-effective approach to attack high-cost models. Additionally, we
conduct automatic analysis on math problems and investigate the cause of
failure, offering a nuanced view into model's limitation.
</summary>
    
    
    <author>
      
      
      <name>Roy Xie</name>
      
    
    </author>
    
    
    <author>
      
      
      <name>Chengxuan Huang</name>
      
    
    </author>
    
    
    <author>
      
      
      <name>Junlin Wang</name>
      
    
    </author>
    
    
    <author>
      
      
      <name>Bhuwan Dhingra</name>
      
    
    </author>
    
    
    <arxiv:comment xmlns:arxiv="http://arxiv.org/schemas/atom">Code/data: https://github.com/ruoyuxie/adversarial_mwps_generation</arxiv:comment>
    
    
    <link href="http://arxiv.org/abs/2402.17916v2" rel="alternate" type="text/html"/>
    
    
    <link title="pdf" href="http://arxiv.org/pdf/2402.17916v2" rel="related" type="application/pdf"/>
    
    
    <arxiv:primary_category xmlns:arxiv="http://arxiv.org/schemas/atom" term="cs.CL" scheme="http://arxiv.org/schemas/atom"/>
    
    
    <category term="cs.CL" scheme="http://arxiv.org/schemas/atom"/>
    
    
    <category term="cs.AI" scheme="http://arxiv.org/schemas/atom"/>
    
  
  </entry>
  ...
</feed>

3. 添加分类信息,获取最新论文

ArXiv API仅在指定分类时提供最新论文(The current arXiv feeds only give you updates on new papers within the category you specify.
(但是我自己使用时其实感觉不到太大的更新速度差异……但是我也没有高强度使用,我也不确定)

ArXiv分类ID见:Category Taxonomy

代码:

keyword = '"math word problem"'
taxonomy = "cs.AI"
url = (
    "http://export.arxiv.org/api/query?search_query=all:"
    + keyword
    + "AND+cat:"
    + taxonomy
    + "&start=0&max_results=1000&sortBy=lastUpdatedDate&sortOrder=descending"
)
url = quote(url, safe='%/:=&?~#+!$,;@()*[]"')
data = urllib.request.urlopen(url)

doc = parseString(data.read().decode("utf-8"))

4. ArXiv查询入参详解

略,待补。

3. 解析XML数据

total_list的每一个元素就是一个字典格式的论文元数据对象:

collection = doc.documentElement

total_list=[]

for entry in collection.getElementsByTagName("entry"):
    now_list = {}
    now_list["paper_url"] = entry.getElementsByTagName("id")[0].childNodes[0].data
    now_list["updated_date"] = (
        entry.getElementsByTagName("updated")[0].childNodes[0].data
    )
    now_list["publication_date"] = (
        entry.getElementsByTagName("published")[0].childNodes[0].data
    )
    now_list["title"] = (
        entry.getElementsByTagName("title")[0].childNodes[0].data.replace("\n", " ")
    )
    now_list["summary"] = (
        entry.getElementsByTagName("summary")[0].childNodes[0].data.replace("\n", " ").strip()
    )

    author_str = ""
    for author in entry.getElementsByTagName("author"):
        author_str += author.getElementsByTagName("name")[0].childNodes[0].data + "; "
    now_list["authors"] = author_str[:-2]

    comments = entry.getElementsByTagName("arxiv:comment")
    if comments:
        now_list["comment"] = comments[0].childNodes[0].data
    else:
        now_list["comment"] = ""

    links = entry.getElementsByTagName("link")
    for link in links:
        rel = link.getAttribute("rel")
        href = link.getAttribute("href")
        link_type = link.getAttribute("type")
        if rel == "alternate":
            now_list["alternate_link"] = href
        elif rel == "related" and link_type == "application/pdf":
            now_list["pdf_link"] = href

    primary_categories = entry.getElementsByTagName("arxiv:primary_category")
    if primary_categories:
        now_list["primary_category"] = primary_categories[0].getAttribute("term")
    else:
        now_list["primary_category"] = ""

    categories = entry.getElementsByTagName("category")
    category_list = []
    for category in categories:
        category_term = category.getAttribute("term")
        category_list.append(category_term)
    now_list["categories"] = "; ".join(category_list)

    total_list.append(now_list)
在 IntelliJ IDEA 插件开发中,要通过 `ProblemsView` 获取所有错误问题节点并输出出错文件路径,可按以下思路实现: 首先,需要获取 `ProblemsView` 的实例。`ProblemsView` 是 IDE 中用于展示代码问题的视图组件。可以通过 `Project` 对象来获取相关的服务实例。 然后,从 `ProblemsView` 中获取所有的问题节点。每个问题节点代表一个代码问题,包含了问题的详细信息,如错误信息、所在文件等。 最后,从问题节点中提取出错文件的路径并输出。 以下是示例代码: ```java import com.intellij.openapi.project.Project; import com.intellij.openapi.wm.ToolWindow; import com.intellij.openapi.wm.ToolWindowManager; import com.intellij.ui.content.Content; import com.intellij.ui.content.ContentManager; import com.intellij.util.messages.MessageBusConnection; import com.intellij.codeInspection.ex.InspectionToolRegistrar; import com.intellij.codeInspection.ex.InspectionToolWrapper; import com.intellij.codeInspection.ex.LocalInspectionToolWrapper; import com.intellij.codeInspection.ex.GlobalInspectionToolWrapper; import com.intellij.codeInspection.ProblemDescriptor; import com.intellij.codeInspection.ex.ProblemDescriptorImpl; import com.intellij.openapi.vfs.VirtualFile; import com.intellij.psi.PsiFile; import com.intellij.psi.PsiManager; import java.util.Collection; // 在合适的地方调用该方法,例如在某个动作的 actionPerformed 方法中 public class ProblemsViewExample { public static void printErrorFilePaths(Project project) { // 获取 ProblemsView 对应的 ToolWindow ToolWindow toolWindow = ToolWindowManager.getInstance(project).getToolWindow("Problems"); if (toolWindow != null) { ContentManager contentManager = toolWindow.getContentManager(); if (contentManager != null) { Content[] contents = contentManager.getContents(); for (Content content : contents) { // 这里需要根据具体的 API 来获取问题描述符集合 // 以下是一个简化的示例,实际可能需要根据 Inspection 相关 API 来获取 Collection<ProblemDescriptor> problemDescriptors = getProblemDescriptors(project); for (ProblemDescriptor descriptor : problemDescriptors) { if (descriptor instanceof ProblemDescriptorImpl) { PsiFile psiFile = ((ProblemDescriptorImpl) descriptor).getPsiElement().getContainingFile(); if (psiFile != null) { VirtualFile virtualFile = psiFile.getVirtualFile(); if (virtualFile != null) { String filePath = virtualFile.getPath(); System.out.println("Error file path: " + filePath); } } } } } } } } private static Collection<ProblemDescriptor> getProblemDescriptors(Project project) { // 这里需要根据 Inspection 相关 API 来获取问题描述符集合 // 以下是一个简化的示例,实际实现可能更复杂 InspectionToolRegistrar registrar = InspectionToolRegistrar.getInstance(); InspectionToolWrapper[] toolWrappers = registrar.createTools(project); // 遍历所有的检查工具 // 以下代码仅为示例,实际需要根据具体的 API 来触发检查并获取问题描述符 // 这里只是简单返回一个空集合,需要根据实际情况实现 return java.util.Collections.emptyList(); } } ``` ### 代码解释: 1. **获取 `ProblemsView` 的 `ToolWindow`**:通过 `ToolWindowManager.getInstance(project).getToolWindow("Problems")` 获取 `ProblemsView` 对应的 `ToolWindow` 实例。 2. **获取 `ContentManager` 和 `Content`**:从 `ToolWindow` 中获取 `ContentManager`,进而获取所有的 `Content`。 3. **获取问题描述符集合**:通过 `getProblemDescriptors` 方法获取所有的问题描述符。在实际实现中,需要根据 `Inspection` 相关的 API 来触发检查并获取问题描述符。 4. **提取出错文件路径**:遍历问题描述符集合,从每个描述符中获取 `PsiFile`,再从 `PsiFile` 中获取 `VirtualFile`,最后获取文件的路径并输出。 ### 注意事项: - 上述代码中的 `getProblemDescriptors` 方法只是一个简化的示例,实际实现需要根据 `Inspection` 相关的 API 来触发检查并获取问题描述符。 - 在实际开发中,可能需要处理各种异常情况,如 `NullPointerException` 等。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

诸神缄默不语

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值