阿里云天池学习赛-零基础入门数据分析-学术前沿趋势分析(task3)

本文链接：https://blog.youkuaiyun.com/LLM1602/article/details/114498802

阿里云天池学习赛零基础入门数据分析-学术前沿趋势分析

前言
一、赛题描述及数据说明
- 1：数据集的格式如下：
- 2：数据集格式举例：
二、论文代码统计（数据统计任务）：统计所有论文类别下包含源代码论文的比例；

前言

本博客主要记录零基础入门数据分析-学术前沿趋势分析的自己的一些理解，主要是解题思路以及代码的解释。大赛地址：零基础入门数据分析-学术前沿趋势分析

一、赛题描述及数据说明

1：数据集的格式如下：

id：arXiv ID，可用于访问论文；
submitter：论文提交者；
authors：论文作者；
title：论文标题；
comments：论文页数和图表等其他信息；
journal-ref：论文发表的期刊的信息；
doi：数字对象标识符，https://www.doi.org；
report-no：报告编号；
categories：论文在 arXiv 系统的所属类别或标签；
license：文章的许可证；
abstract：论文摘要；
versions：论文版本；
authors_parsed：作者的信息。

2：数据集格式举例：

“root”:{
“id”:string"0704.0001"
“submitter”:string"Pavel Nadolsky"
“authors”:string"C. Bal’azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan"
“title”:string"Calculation of prompt diphoton production cross sections at Tevatron and LHC energies"
“comments”:string"37 pages, 15 figures; published version"
“journal-ref”:string"Phys.Rev.D76:013009,2007"
“doi”:string"10.1103/PhysRevD.76.013009"
“report-no”:string"ANL-HEP-PR-07-12"
“categories”:string"hep-ph"
“license”:NULL
“abstract”:string" A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to leading logarithmic accuracy. The region of phase space is specified in which the calculation is most reliable. Good agreement is demonstrated with data from the Fermilab Tevatron, and predictions are made for more detailed tests with CDF and DO data. Predictions are shown for distributions of diphoton pairs produced at the energy of the Large Hadron Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs boson are contrasted with those produced from QCD processes at the LHC, showing that enhanced sensitivity to the signal can be obtained with judicious selection of events."
“versions”:[
0:{
“version”:string"v1"
“created”:string"Mon, 2 Apr 2007 19:18:42 GMT"
}
1:{
“version”:string"v2"
“created”:string"Tue, 24 Jul 2007 20:10:27 GMT"
}]
“update_date”:string"2008-11-26"
“authors_parsed”:[
0:[
0:string"Balázs"
1:string"C."
2:string""]
1:[
0:string"Berger"
1:string"E. L."
2:string""]
2:[
0:string"Nadolsky"
1:string"P. M."
2:string""]
3:[
0:string"Yuan"
1:string"C. -P."
2:string""]]
}

二、论文代码统计（数据统计任务）：统计所有论文类别下包含源代码论文的比例；

1.题目意思解读及整体思路分析

在读取数据集后，这样就得到了所有论文，接下只要分析哪个字段里会出现代码链接，查看数据集可以发现，论文中的’comments’字段和’abstract’字段会给出代码的链接而且一般都是类似这种https://github.com/wang159/FDIntegral_Table)。

2.各节代码展示与讲解

2.1：先读取数据集：

def readArxivFile(path, columns = ['id','submitter','authors','title','comments','journal-ref','doi','report-no','categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'] , count = None):
    # 读取文件的函数，path：文件路径，columns：需要选择的列，count：读取行数
    data = []
    with open(path,'r') as f:
        for idx, line in enumerate(f):
            if idx == count:
                break
            d = json.loads(line)#把每条数据(json格式) --> 转换成python对象，这里是转换成字典类型
            d = {col: d[col] for col in columns}#更改d,只用获取原数据集中的一部分，即columns的部分
            data.append(d)

    data = pd.DataFrame(data)#将字典类型转换成DataFrame类型
    return data


	#读取100000条数据
	data = readArxivFile('arxiv-metadata-oai-2019.json',['id', 'authors', 'categories', 'authors_parsed'],
                    100000)

2.1.1: json.load(): 把json格式数据 -> python对象（这里转换成了字典类型），看下图，上面是json格式数据，下面是python的字典类型。

在这里插入图片描述

2.1.2: d = {col: d[col] for col in columns} 这里字典类型的索引是col,也就是columns中的每一个，其对应的键值是d[col],从而构成新年的内容，即只选取原数据中的一部分键值对。

2.2：这里可以先简单筛选下，先根据论文的页数进行提取，只留下论文页数 >0 的论文，再根据论文的图表个数进行提取，只留下论文图表个数 >0 的论文
2.2.1： 先从data里提取出论文页数 >0 的论文。这里解释下关于正则表达式，大家可以看看这篇博客学习正则表达式补充下基础，下面的正则表达式的用法其实挺好理解的。
a： re.findall(’[1-9][0-9]* pages’,str(x))：找到以数字1~9开头的数字，后续任意多为0·9的数字，空格，pages，这样的字符串；
b： x[0].replace(’ pages’,’’)：用空格代替字符串中的 ’ pages’

    data = readArxivFile('arxiv-metadata-oai-2019.json',['id','abstract','categories','comments'])

    #首先来统计论文页数，也就是在comments字段中抽取pages和figures和个数，首先完成字段读取。
    #是这种格式，"comments":"15 pages, 15 figures, 3 tables，正则表达式匹配pages
    data['pages'] = data['comments'].apply(lambda x: re.findall('[1-9][0-9]* pages',str(x)))

    #pages>0的保留下来，apply(len) 对data['pages']的每一个元素都进行len()操作
    data = data[data['pages'].apply(len) > 0]

    #上面得到的是一个个lsit,eg:['19 pages'] 转换成 --> 19.0
    data['pages'] = data['pages'].apply(lambda x: float(x[0].replace(' pages','')))
    print(data['pages'])

数据格式转换前：
在这里插入图片描述
格式转换后：

2.2.2： 再对论文图表个数进行提取。这里的方法基本与2.2.1相同。

	  #接下来对论文图个数进行抽取
    data['figures'] = data['comments'].apply(lambda x: re.findall('[1-9][0]* figures',str(x)))
    data = data[data['figures'].apply(len)>0]
    data['figures'] = data['figures'].apply(lambda x: float(x[0].replace(' figures','')))
    print(data['figures'])

2.3： 筛选包含github的论文
2.3.1： 由于github链接出现在字段’comments’ 和 ‘abstract’ 中，因此直接以字段’comments’ 和 ‘abstract’ 找到包含字符串’github’的论文，并把字段’comments’ 和 ‘abstract’ 的内容合并起来到data_with_code[‘text’]中。

#对论文的代码链接进行提取，这里只抽取github链接，即搜寻那些在abstract或者comments里有github字符的文章
    data_with_code = data[
        (data.comments.str.contains('github') == True) | (data.abstract.str.contains('github') == True)
    ]
    #把论文的abstract和comments内容连接起来
    data_with_code['text'] = data_with_code['abstract'].fillna('') + data_with_code['comments'].fillna('')

2.3.2：用正则表达式从data_with_code[‘text’]中提取出girhub链接。并用可视化展示

a： [a-zA-Z]+://github 【^\s】：找到以任意字母的几位（列如http）,后再是://github,后再是匹配所有非空白符号（\s 是匹配所有空白符）

     #使用正则表达式匹配论文的github的完整链接
    pattern = '[a-zA-Z]+://github[^\s]*'
    data_with_code['code_flag'] = data_with_code['text'].str.findall(pattern).apply(len)

    #并对论文按照类别进行绘图：
    data_with_code = data_with_code[data_with_code['code_flag'] == 1]
    plt.figure(figsize=(12,6))
    data_with_code.groupby(['categories'])['code_flag'].count().plot(kind='bar')
    plt.show()

3.完整代码展示

import seaborn as ans
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import json

def readArxivFile(path,column=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
       'report-no', 'categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'],count=None):
    data = []
    with open(path,'r') as f:
        for idx, line in enumerate(f):
            if idx == count:
                break
            d = json.loads(line)
            d = {col:d[col] for col in column}
            data.append(d)

    data = pd.DataFrame(data)
    return data

if __name__ == "__main__":
    data = readArxivFile('arxiv-metadata-oai-2019.json',['id','abstract','categories','comments'])

    #首先来统计论文页数，也就是在comments字段中抽取pages和figures和个数，首先完成字段读取。
    #是这种格式，"comments":"15 pages, 15 figures, 3 tables，正则表达式匹配pages
    data['pages'] = data['comments'].apply(lambda x: re.findall('[1-9][0-9]* pages',str(x)))

    #pages>0的保留下来，apply(len) 对data['pages']的每一个元素都进行len()操作
    data = data[data['pages'].apply(len) > 0]

    #上面得到的是一个个lsit,eg:['19 pages'] 转换成 --> 19.0
    data['pages'] = data['pages'].apply(lambda x: float(x[0].replace(' pages','')))
    print(data['pages'])
    #对pages进行一些统计
    print(data['pages'].describe().astype(int))#显示一些代表值的数据

    # #接下来按照分类统计论文页数，选取论文的第一个类别的主要类别,相当于就是最后的一个类别
    # #solv-int nlin.SI-> solv-int,  math.AT --> math,主要是针对前一种
    # data['categories'] = data['categories'].apply(lambda x: x.split(' ')[0])
    # #solv-int nlin.SI-> solv-int
    # data['categories'] = data['categories'].apply(lambda x: x.split('.')[0])
    #
    # plt.figure(figsize=(12,6))
    # #groupby(分组依据)[数据来源].使用操作
    # data.groupby(data['categories'])['pages'].mean().plot(kind='bar')
    # plt.show()

    #接下来对论文图个数进行抽取
    data['figures'] = data['comments'].apply(lambda x: re.findall('[1-9][0]* figures',str(x)))
    data = data[data['figures'].apply(len)>0]
    data['figures'] = data['figures'].apply(lambda x: float(x[0].replace(' figures','')))
    print(data['figures'])

    #对论文的代码链接进行提取，这里只抽取github链接，即搜寻那些在abstract或者comments里有github字符的文章
    data_with_code = data[
        (data.comments.str.contains('github') == True) | (data.abstract.str.contains('github') == True)
    ]
    #把论文的abstract和comments内容连接起来
    data_with_code['text'] = data_with_code['abstract'].fillna('') + data_with_code['comments'].fillna('')

    #使用正则表达式匹配论文的github的完整链接
    pattern = '[a-zA-Z]+://github[^\s]*'
    data_with_code['code_flag'] = data_with_code['text'].str.findall(pattern).apply(len)

    #并对论文按照类别进行绘图：
    data_with_code = data_with_code[data_with_code['code_flag'] == 1]
    plt.figure(figsize=(12,6))
    data_with_code.groupby(['categories'])['code_flag'].count().plot(kind='bar')
    plt.show()