基于GraphScope的学术论文节点分类实战教程-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00035/article/details/148577612

基于GraphScope的学术论文节点分类实战教程

GraphScope 🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统项目地址: https://gitcode.com/gh_mirrors/gr/GraphScope

引言

本文将介绍如何使用GraphScope这一图计算系统，在学术论文引用网络(ogbn-mag数据集)上实现节点分类任务。我们将通过一个端到端的示例，展示如何结合图分析、交互式查询和图神经网络三种计算模式来解决实际问题。

环境准备与数据加载

首先需要安装GraphScope包并导入必要的模块：

# 安装graphscope
!pip3 install graphscope

# 导入graphscope模块
import graphscope
graphscope.set_option(show_log=False)  # 关闭日志显示

# 加载ogbn-mag数据集
from graphscope.dataset import load_ogbn_mag
graph = load_ogbn_mag()

ogbn-mag数据集是从微软学术图谱(Microsoft Academic Graph)中提取的异构网络，包含四种类型的节点：

论文(paper)
作者(author)
机构(institution)
研究领域(field of study)

以及四种节点间的关系。我们的任务是预测每篇论文所属的类别(共349类，代表不同会议或期刊)。

交互式图查询

GraphScope支持Gremlin查询语言，我们可以用它来探索图数据。例如，查询两位特定作者(假设ID为2和4307)共同撰写的论文数量：

# 获取Gremlin查询入口
interactive = graphscope.gremlin(graph)

# 执行Gremlin查询
papers = interactive.execute(
    "g.V().has('author', 'id', 2).out('writes').where(__.in('writes').has('id', 4307)).count()"
).one()
print("两位作者共同撰写的论文数量:", papers)

这个查询首先找到ID为2的作者，然后遍历其撰写的所有论文，再筛选出同时被ID为4307的作者撰写的论文，最后计数。

图分析任务

接下来，我们进行图分析以提取结构特征：

首先提取2014-2020年间发表的论文子图
将子图投影为简单图(只保留论文和引用关系)
计算k-core和三角形计数作为结构特征

# 提取2014-2020年的论文子图
sub_graph = interactive.subgraph("g.V().has('year', inside(2014, 2020)).outE('cites')")

# 投影为简单图
simple_g = sub_graph.project(vertices={"paper": []}, edges={"cites": []})

# 计算k-core (k=5)和三角形计数
kc_result = graphscope.k_core(simple_g, k=5)
tc_result = graphscope.triangles(simple_g)

# 将结果作为新特征添加到图中
sub_graph = sub_graph.add_column(kc_result, {"kcore": "r"})
sub_graph = sub_graph.add_column(tc_result, {"tc": "r"})

k-core分解可以识别网络中的紧密核心结构，而三角形计数则反映局部聚类程度，这些都是有用的图结构特征。

图神经网络训练

最后，我们使用GraphSAGE模型进行节点分类。模型将结合原始特征(128维word2vec向量)和新增的结构特征(k-core和三角形计数)：

# 定义特征列
paper_features = [f"feat_{i}" for i in range(128)] + ["kcore", "tc"]

# 准备学习引擎
lg = graphscope.graphlearn(
    sub_graph,
    nodes=[("paper", paper_features)],
    edges=[("paper", "cites", "paper")],
    gen_labels=[
        ("train", "paper", 100, (0, 75)),  # 75%训练集
        ("val", "paper", 100, (75, 85)),   # 10%验证集
        ("test", "paper", 100, (85, 100)), # 15%测试集
    ],
)

# 定义GraphSAGE模型训练过程
def train_sage(graph, node_type, edge_type, class_num, features_num,
              hops_num=2, nbrs_num=[25, 10], epochs=2,
              hidden_dim=256, in_drop_rate=0.5, learning_rate=0.01):
    # ...模型定义和训练代码...
    pass

# 执行训练
train_sage(lg, node_type="paper", edge_type="cites",
          class_num=349,      # 输出维度(349个类别)
          features_num=130,   # 输入维度(128+2)
)

GraphSAGE是一种归纳式图神经网络，它通过采样和聚合邻居信息来生成节点嵌入。在这个例子中，我们设置：