阿里云天池学习赛-零基础入门数据分析-学术前沿趋势分析(task3)


前言

本博客主要记录零基础入门数据分析-学术前沿趋势分析的自己的一些理解,主要是解题思路以及代码的解释。大赛地址:零基础入门数据分析-学术前沿趋势分析


一、赛题描述及数据说明

1:数据集的格式如下:

id:arXiv ID,可用于访问论文;
submitter:论文提交者;
authors:论文作者;
title:论文标题;
comments:论文页数和图表等其他信息;
journal-ref:论文发表的期刊的信息;
doi:数字对象标识符,https://www.doi.org;
report-no:报告编号;
categories:论文在 arXiv 系统的所属类别或标签;
license:文章的许可证;
abstract:论文摘要;
versions:论文版本;
authors_parsed:作者的信息。

2:数据集格式举例:

“root”:{
“id”:string"0704.0001"
“submitter”:string"Pavel Nadolsky"
“authors”:string"C. Bal’azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan"
“title”:string"Calculation of prompt diphoton production cross sections at Tevatron and LHC energies"
“comments”:string"37 pages, 15 figures; published version"
“journal-ref”:string"Phys.Rev.D76:013009,2007"
“doi”:string"10.1103/PhysRevD.76.013009"
“report-no”:string"ANL-HEP-PR-07-12"
“categories”:string"hep-ph"
“license”:NULL
“abstract”:string" A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to leading logarithmic accuracy. The region of phase space is specified in which the calculation is most reliable. Good agreement is demonstrated with data from the Fermilab Tevatron, and predictions are made for more detailed tests with CDF and DO data. Predictions are shown for distributions of diphoton pairs produced at the energy of the Large Hadron Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs boson are contrasted with those produced from QCD processes at the LHC, showing that enhanced sensitivity to the signal can be obtained with judicious selection of events."
“versions”:[
0:{
“version”:string"v1"
“created”:string"Mon, 2 Apr 2007 19:18:42 GMT"
}
1:{
“version”:string"v2"
“created”:string"Tue, 24 Jul 2007 20:10:27 GMT"
}]
“update_date”:string"2008-11-26"
“authors_parsed”:[
0:[
0:string"Balázs"
1:string"C."
2:string""]
1:[
0:string"Berger"
1:string"E. L."
2:string""]
2:[
0:string"Nadolsky"
1:string"P. M."
2:string""]
3:[
0:string"Yuan"
1:string"C. -P."
2:string""]]
}

二、论文代码统计(数据统计任务):统计所有论文类别下包含源代码论文的比例;

1.题目意思解读及整体思路分析

在读取数据集后,这样就得到了所有论文,接下只要分析哪个字段里会出现代码链接,查看数据集可以发现,论文中的’comments’字段和’abstract’字段会给出代码的链接而且一般都是类似这种https://github.com/wang159/FDIntegral_Table)。

2.各节代码展示与讲解

2.1:先读取数据集:

def readArxivFile(path, columns = ['id','submitter','authors','title','comments','journal-ref','doi','report-no','categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'] , count = None):
    # 读取文件的函数,path:文件路径,columns:需要选择的列,count:读取行数
    data = []
    with open(path,'r') as f:
        for idx, line in enumerate(f):
            if idx == count:
                break
            d = json.loads(line)#把每条数据(json格式) --> 转换成python对象,这里是转换成字典类型
            d = {
   
   col: d[col] for col in columns}#更改d,只用获取原数据集中的一部分,即columns的部分
            data.append(d)

    data = pd.DataFrame(data)#将字典类型转换成DataFrame类型
    return data


	#读取100000条数据
	data = readArxivFile('arxiv-metadata-oai-2019.json',['id', 'authors', 'categories', 'authors_parsed']
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值