[论文精读]Towards Deeper Graph Neural Networks-优快云博客

本文链接：https://blog.youkuaiyun.com/Sherlily/article/details/142314912

论文网址：Towards Deeper Graph Neural Networks | Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.3. Background and related works

2.3.1. Graph convolution operations

2.3.2. Related works

2.4. Empirical and theoretical analysis of deep GNNs

2.4.1. Quantitative metric for smoothness

2.4.2. Why deeper GNNs fail?

2.4.3. Theoretical analysis of very deep models

2.5. Deep adaptive graph neural network

2.6. Experimental strudies

2.6.1. Datasets and setup

2.6.2. Overall results

2.6.3. Training set sizes

2.6.4. Model Depths

2.7. Conclusion

3. Reference

1. 省流版

1.1. 心得

（1）有点难评，我觉得是比较简单的东西诶

2. 论文逐段精读

2.1. Abstract

①They proposed a Deep Adaptive Graph Neural Network (DAGNN) to solve the over smoothing problem when the GNN is too deep

2.2. Introduction

①DAGNN adaptively integrates information and learns node representations

2.3. Background and related works

①Graph representation: $G=\left ( V, E \right )$

② $V \in \mathbb{R}^{n}$ , $n=\left | V \right |$ , $E \in \mathbb{R}^{V \times V}$ , $m=\left | E \right |$

③Adjacency matrix $A \in \mathbb{R}^{n \times n}$ with no weight (0 and 1 only)

④Degree matrix $D\in\mathbb{R}^{n\times n}$ and $D_{(i,i)}=\sum_{j}A_{(i,j)}$

⑤Neighbor nodes are represented by $N_i$

⑥Node feature matrix $X\in\mathbb{R}^{n\times d}$

2.3.1. Graph convolution operations

①Traditional message passing:

$\begin{aligned}&a_{i}^{(\ell)}=\mathrm{PROPAGATION}^{(\ell)}\left(\left\{\mathbf{x}_{i}^{(\ell-1)},\{\mathbf{x}_{j}^{(\ell-1)}|j\in\mathcal{N}_{i}\}\right\}\right)\\&x_{i}^{(\ell)}=\mathrm{TRANSFORMATION}^{(\ell)}\left(a_{i}^{(\ell)}\right).\end{aligned}$

②Convolution operator:

$X^{(\ell)}=\sigma\left(\widehat{A}X^{(\ell-1)}W^{(\ell)}\right)$

where ${\widehat A}={\widetilde D}^{-{\frac{1}{2}}}{\widetilde A}{\widetilde D}^{-{\frac{1}{2}}}$ , $\tilde{A}=A+I$

2.3.2. Related works

①Stacking multi layers might brings indistinguishable problem of nodes in different classes

②Listing some remote capturing models

demystify vt. 使非神秘化；阐明；启发

2.4. Empirical and theoretical analysis of deep GNNs

2.4.1. Quantitative metric for smoothness

①Similarity metric (Euclidean distance) between node $i$ and node $j$ :

$D(x_{i},x_{j})=\frac{1}{2}\left\|\frac{x_{i}}{\|x_{i}\|}-\frac{x_{j}}{\|x_{j}\|}\right\|$

where $\left \|\cdot \right \|$ denotes Euclidean norm

②Smoothness metric:

$SMV_i=\frac{1}{n-1}\sum_{j\in V,j\neq i}D(x_i,x_j)$

（为什么作者认为叶子节点有更大的平滑度度量值？这个n不是节点总数吗，叶子节点不是邻居更少吗？那应该平滑度度量值更小吧？还是说这个n代表邻居个数？我认为不太是后者）

③Smoothness metric of the whole graph $G$ :

$SMV_{G}=\frac{1}{n}\sum_{i\in V}SMV_{i}$

periphery n. 外围，边缘；圆周；圆柱体表面

2.4.2. Why deeper GNNs fail?

①Datasets: Cora, CiteSeer and PubMed

②t-SNE visualization on different layers on Cora:

③Accuracy on Cora with different layers:

④Over smoothing problems mostly exist in sparse graph

⑤Decoupled transformation formula:

$Z=\mathrm{MLP}\left(X\right)\\X_{out}=\mathrm{softmax}\left(\widehat{A}^{k}Z\right)$

⑥Statistics of datasets, where the edge density is calculated by $\frac{2m}{n^2}$ :

⑦Deeper layer on Cora:

⑧Accuracy and smoothness on Cora:

2.4.3. Theoretical analysis of very deep models

①Two propagation mechanisms:

$\hat{A}_{\oplus}=\tilde{D}^{-1}\tilde{A}$

mostly in GraphSAGE and DGCNN and

$\widehat{A}_{\odot}=\widetilde{D}^{-\frac{1}{2}}\widetilde{A}\widetilde{D}^{-\frac{1}{2}}$

in GCN

②They define $\Psi(x)=\frac{x}{\mathrm{sum}(x)}$ , $\Phi(x)=\frac{x}{\|x\|}$ , and $e=[1,1,\cdots,1]\in\mathbb{R}^{1\times n}$

③They introduce 2 theorem to prove the convergence:

④剩余定理和证明略

2.5. Deep adaptive graph neural network

①Steps of DAGNN:

$\begin{aligned}&Z=\mathrm{MLP}\left(X\right)&&\in\mathbb{R}^{n\times c}\\&H_{\ell}=\widehat{A}^{\ell}Z,\ell=1,2,\cdots,k&&\in\mathbb{R}^{n\times c}\\&H=\mathrm{stack}\left(Z,H_{1},\cdots,H_{k}\right)&&\in\mathbb{R}^{n\times(k+1)\times c}\\&S=\sigma\left(Hs\right)&&\in\mathbb{R}^{n\times(k+1)\times1}\\&\widetilde{S}=\mathrm{reshape}\left(S\right)&&\in\mathbb{R}^{n\times1\times(k+1)}\\&X_{out}=\mathrm{softmax}\left(\mathrm{squeeze}\left(\widetilde{S}H\right)\right)&&\in\mathbb{R}^{n\times c},\end{aligned}$

where $c$ is the number of node classes, $Z \in \mathbb{R}^{n \times c}$ denotes feature matrix, $s\in\mathbb{R}^{c\times1}$ denotes trainable projection vector, they set $\sigma$ to Sigmoid, $k$ is the layer of model

②It's hard to define the hop number, so they designed adaptive projection vector

③⭐There is no fully connected layer in the last layer but only use $X_{out}$

④The loss function can be:

$L=-\sum_{i\in V_L}\sum_{p=1}^cY_{[i,p]}\ln X_{out[i,p]}$

where $V_L$ denotes the set of labeled nodes, $Y\in\mathbb{R}^{n\times c}$ stores the real label

⑤Workflow of DAGNN:

（说句实话，为什么我觉得这玩意儿这么抽象，毕竟感觉每一层都加入了前面所有层，照这样来说对于越深的层就给越低的权重就好了....因为过于平滑....但这样也聊胜于无嘛）

2.6. Experimental strudies

2.6.1. Datasets and setup

①Datasets: Cora,CiteSeer, PubMed, Coauthor CS, Coauthor Physics, Amazon Computers, Amazon Photo:

②Baselines: Logistic Regression (LogReg), Multilayer Perceptron (MLP), Label Propagation (LabelProp), ormalized Laplacian Label Propagation (LabelProp NL), ChebNet, Graph Convolutional Network (GCN), Graph Attention Network(GAT), Mixture Model Network (MoNet), GraphSAGE, APPNP, SGC

③Grid search for hyperparameters: $k=\left \{ 5,10,20 \right \}$ , $weight\, decay \in \left \{ 0,2e-2,5e-3,5e-4,5e-5 \right \}$ , $dropout\, rate= \left \{ 0.5,0.8 \right \}$

2.6.2. Overall results

①Uniform setting on Cora, CiteSeer and PubMed: 20 labeled nodes/class for training, 500 nodes for val and 1000 to test. On co-authorship and co-purchase: 20 labeled/class for training, 30 nodes/class for val and the rest is for test

②Epoch: 100 for fixed and random split respectively

③Performance comparison table on Cora, CiteSeer and PubMed: