阿里云天池学习赛零基础入门数据分析-学术前沿趋势分析（task5）

LLM1602

于 2021-03-08 15:50:16 发布

阅读量678

点赞数 1

CC 4.0 BY-SA版权

分类专栏：天池大赛文章标签： python 数据分析

本文链接：https://blog.youkuaiyun.com/LLM1602/article/details/114521790

天池大赛专栏收录该内容

4 篇文章

订阅专栏

这篇博客介绍了阿里云天池的数据分析比赛，重点是解析论文作者关系。首先，通过读取数据集，提取作者信息，构建无向图，并计算最大连通子图。接着，展示了读取数据、构建图、计算顶点度和绘制图像的代码细节。最后，通过代码实例展示了如何分析作者之间的关联，为学术前沿趋势分析提供了基础。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

前言

本博客主要记录零基础入门数据分析-学术前沿趋势分析的自己的一些理解，主要是解题思路以及代码的解释。大赛地址：零基础入门数据分析-学术前沿趋势分析

一、赛题描述及数据说明

1：数据集的格式如下：

id：arXiv ID，可用于访问论文；
submitter：论文提交者；
authors：论文作者；
title：论文标题；
comments：论文页数和图表等其他信息；
journal-ref：论文发表的期刊的信息；
doi：数字对象标识符，https://www.doi.org；
report-no：报告编号；
categories：论文在 arXiv 系统的所属类别或标签；
license：文章的许可证；
abstract：论文摘要；
versions：论文版本；
authors_parsed：作者的信息。

2：数据集格式举例：

“root”:{
“id”:string"0704.0001"
“submitter”:string"Pavel Nadolsky"
“authors”:string"C. Bal’azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan"
“title”:string"Calculation of prompt diphoton production cross sections at Tevatron and LHC energies"
“comments”:string"37 pages, 15 figures; published version"
“journal-ref”:string"Phys.Rev.D76:013009,2007"
“doi”:string"10.1103/PhysRevD.76.013009"
“report-no”:string"ANL-HEP-PR-07-12"
“categories”:string"hep-ph"
“license”:NULL
“abstract”:string" A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to leading logarithmic accuracy. The region of phase space is specified in which the calculation is most reliable. Good agreement is demonstrated with data from the Fermilab Tevatron, and predictions are made for more detailed tests with CDF and DO data. Predictions are shown for distributions of diphoton pairs produced at the energy of the Large Hadron Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs boson are contrasted with those produced from QCD processes at the LHC, showing that enhanced sensitivity to the signal can be obtained with judicious selection of events."
“versions”:[
0:{
“version”:string"v1"
“created”:string"Mon, 2 Apr 2007 19:18:42 GMT"
}
1:{
“version”:string"v2"
“created”:string"Tue, 24 Jul 2007 20:10:27 GMT"
}]
“update_date”:string"2008-11-26"
“authors_parsed”:[
0:[
0:string"Balázs"
1:string"C."
2:string""]
1:[
0:string"Berger"
1:string"E. L."
2:string""]
2:[
0:string"Nadolsky"
1:string"P. M."
2:string""]
3:[
0:string"Yuan"
1:string"C. -P."
2:string""]]
}

二、作者关联（数据建模任务）：对论文作者关系进行建模，统计最常出现的作者关系；

1.题目意思解读及整体思路分析

在读取数据集后，这样就得到了所有论文，再直接获取论文的作者相关信息，这里我们用到’authors_parsed’字段，通过对该字段进行提取并更改格式，再构建图即可，最后计算出最大连通子图。

2.各节代码展示与讲解

2.1：先读取数据集：

def readArxivFile(path, columns = ['id','submitter','authors','title','comments','journal-ref','doi','report-no','categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'] , count = None):
    # 读取文件的函数，path：文件路径，columns：需要选择的列，count：读取行数
    data = []
    with open(path,'r') as f:
        for idx, line in enumerate(f):
            if idx == count:
                break
            d = json.loads(line)#把每条数据(json格式) --> 转换成python对象，这里是转换成字典类型
            d = {col: d[col] for col in columns}#更改d,只用获取原数据集中的一部分，即columns的部分
            data.append(d)

    data = pd.DataFrame(data)#将字典类型转换成DataFrame类型
    return data


	#读取100000条数据
	data = readArxivFile('arxiv-metadata-oai-2019.json',['id', 'authors', 'categories', 'authors_parsed'],
                    100000)

2.1.1: json.load(): 把json格式数据 -> python对象（这里转换成了字典类型），看下图，上面是json格式数据，下面是python的字典类型。

在这里插入图片描述

2.1.2: d = {col: d[col] for col in columns} 这里字典类型的索引是col,也就是columns中的每一个，其对应的键值是d[col],从而构成新年的内容，即只选取原数据中的一部分键值对。

2.2：这里可以先选择5篇论文来构建无向图，并画出图像
2.2.1： itertuples():将DataFrame迭代成元组;
2.2.2： G.add_edge(authors[0],author)：这里是将每篇论文的第一个作者与剩下的作者逐一连接。

   data = readArxivFile('arxiv-metadata-oai-2019.json',['id', 'authors_parsed'],200000)
    print(data)
    #创建无向图
    G = nx.Graph()
    #只用五篇论文进行构建
    for row in data.iloc[:5].itertuples():
        #itertuples:将DataFrame迭代成元组eg:[['Dugmore', 'B.','' ], ['Ntumba', 'PP.','' ]] -->  Pandas(Index=1, id='0704.0342', authors_parsed=[['Dugmore', 'B.', ''], ['Ntumba', 'PP.', '']])['Dugmore B.', 'Ntumba PP.']
        print(row)
        print(row[0])
        print(row[1])
        authors = row[2]
        print(authors)
        authors = [' '.join(x[:-1]) for x in authors] #--> ['Dugmore B.', 'Ntumba PP.']

        #把第一个作者与其他作者链接起来
        for author in authors[1:]:
            G.add_edge(authors[0],author)#即以1结点为中心节点

print(row)得到的结果格式如图，因此row[0]对应的是Index, row[1]对应的是id, row[2]对应的是authors_parsed

在这里插入图片描述
authors格式转换前如图（每篇文章有多个作者，一个[]内是一个作者的名字，比如第一篇文章中就有三个作者）：

authors格式转换后（经过join()函数后）如图：

2.2.3： 绘画出无向图

    #绘制图像
    nx.draw(G,with_labels=True)
    plt.show()

2.3： 计算各个顶点的度，绘图，并嵌入最大连通子图。
2.3.1： for n, d in G.degree(): n 是顶点的名称，d 是对应顶点所连接的边数。
2.3.2： loglog()函数：双对数函数，这里我不是很理解，只有一个序列，尚未找出答案，后续继续研究再来更改。

    #绘制最大连通子图
    degree_sequence = sorted([d for n,d in G.degree()], reverse=True)#降序排列
    #dmax = max(degree_sequence)#得到图中顶点最大的度

    plt.loglog(degree_sequence, "b-", marker="o")
    plt.title("Degree rank plot")
    plt.ylabel("degree")
    plt.xlabel("rank")

2.3.3：嵌入最大连通子图

sorted( nx.connected_components(G),key = len ,reverse = True )：所有连通子图的降序排列，因此Gcc即是最大连通子图

    #嵌入最大连通子图
    plt.axes([0.45,0.45, 0.45,0.45])
    #sorted( nx.connected_components(G),key = len ,reverse = True )： 所有连通子图的降序排列，Gcc即是最大连通子图
    Gcc = G.subgraph(sorted(nx.connected_components(G),key=len,reverse=True)[0])

    pos = nx.spring_layout(Gcc)
    plt.axis("off")
    nx.draw_networkx_nodes(Gcc,pos,node_size=20)
    nx.draw_networkx_edges(Gcc,pos,alpha=0.4)
    plt.show()

3.完整代码展示

import re
import requests
import seaborn as ans
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import json
import networkx as nx


def readArxivFile(path, columns=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
       'report-no', 'categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'],count=None):
    data = []
    with open(path,'r') as f:
        for idx, line in enumerate(f):
            if idx == count:
                break
            d = json.loads(line)
            d = {col: d[col] for col in columns}
            data.append(d)

    data = pd.DataFrame(data)
    return  data


if __name__ == "__main__":
    data = readArxivFile('arxiv-metadata-oai-2019.json',['id', 'authors_parsed'],200000)
    print(data)
    #创建无向图
    G = nx.Graph()
    #只用五篇论文进行构建
    for row in data.iloc[:5].itertuples():
        #如果我们500片论文构建图，则可以得到更加完整作者关系，并选择最大联通子图进行绘制，折线图为子图节点度值。
        #itertuples:将DataFrame迭代成元组eg:[['Dugmore', 'B.','' ], ['Ntumba', 'PP.','' ]] -->  Pandas(Index=1, id='0704.0342', authors_parsed=[['Dugmore', 'B.', ''], ['Ntumba', 'PP.', '']])['Dugmore B.', 'Ntumba PP.']
        print(row)
        print(row[0])
        print(row[1])
        authors = row[2]
        print(authors)
        authors = [' '.join(x[:-1]) for x in authors] #--> ['Dugmore B.', 'Ntumba PP.']

        #把第一个作者与其他作者链接起来
        for author in authors[1:]:
            G.add_edge(authors[0],author)#即以1结点为中心节点

    #绘制图像
    nx.draw(G,with_labels=True)
    plt.show()

   try:
        print(nx.dijkstra_path(G, 'Podsiadlowski Philipp', 'Rosswog Stephan'))#求两顶点间的最短路径
    except:
        print('No path')


    #绘制最大连通子图
    degree_sequence = sorted([d for n,d in G.degree()], reverse=True)#将各个顶点的度的大小，降序排列
    #dmax = max(degree_sequence)#得到图中顶点最大的度

    plt.loglog(degree_sequence, "b-", marker="o")
    plt.title("Degree rank plot")
    plt.ylabel("degree")
    plt.xlabel("rank")

    #嵌入最大连通子图
    plt.axes([0.45,0.45, 0.45,0.45])
    #sorted( nx.connected_components(G),key = len ,reverse = True )： 所有连通子图的降序排列，Gcc即是最大连通子图
    Gcc = G.subgraph(sorted(nx.connected_components(G),key=len,reverse=True)[0])

    pos = nx.spring_layout(Gcc)
    plt.axis("off")
    nx.draw_networkx_nodes(Gcc,pos,node_size=20)
    nx.draw_networkx_edges(Gcc,pos,alpha=0.4)
    plt.show()

4.代码中几个需要注意的地方：

1）itertuples()：将DataFrame迭代成（列名，series）；
2）degree_sequence = sorted([d for n,d in G.degree()], reverse=True)#将各个顶点的度的大小，降序排列
3）plt.loglog(degree_sequence, “b-”, marker=“o”)这个函数待补充