DynWalks: Global Topology and Recent Changes Awareness Dynamic Network Embedding 发表于2019年,提出了一种对真实世界中动态演化的网络的节点进行Embedding的算法,论文链接:https://arxiv.org/pdf/1907.11968.pdf。
关于算法的具体内容本文不会详细阐述,注重于对源码的解读,源码链接:https://github.com/houchengbin/DynWalks
目录
1. src/main.py中的函数解析器
def parse_args():
parser = ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatter, conflict_handler='resolve')
"""
formatter_class: HelpFormatter类,用于打印帮助消息
conflict_handler: 用于只是如何处理冲突的String
"""
# -----------------------------------------------general settings--------------------------------------------------
"""
'--graph': 可选参数,参数名graph
default: '--graph'参数的默认值
help: 参数的帮助信息,当指定为 argparse.SUPPRESS 时表示不显示该参数的帮助信息
"""
parser.add_argument('--graph', default='data/cora/cora_dyn_graphs.pkl',
help='graph/network')
"""
label: 节点标签
"""
parser.add_argument('--label', default='data/cora/cora_node_label_dict.pkl',
help='node label')
"""
emb-dim: 节点embedding的维度
"""
parser.add_argument('--emb-dim', default=128, type=int,
help='node embeddings dimensions')
"""
task: 下游要执行的任务
default: 默认值为 "all"
choices: 参数可允许的值的一个容器,必须从给定的值的中选择
--lp: Link Prediction,链路预测
--gr: Graph Reconstraction,图重构
--nc: Node Classification,节点分类
--all: 所有任务
--save: 保存embedding结果
"""
parser.add_argument('--task', default='all', choices=['lp', 'gr', 'nc', 'all', 'save'],
help='choices of downstream tasks: lp, gr, nc, all, save')
"""
emb-file: 节点embedding的文件
"""
parser.add_argument('--emb-file', default='output/cora_DynWalks_128_embs.pkl',
help='node embeddings; suggest: data_method_dim_embs.pkl')
# -------------------------------------------------method settings-----------------------------------------------------------
"""
method: Embedding的方法
"""
parser.add_argument('--method', default='DynWalks', choices=['DynWalks', 'DeepWalk', 'GraRep', 'HOPE'],
help='choices of Network Embedding methods')
"""
limit: 每个time step中要更新embedding的节点中的数量,论文中的 alpha 参数
"""
parser.add_argument('--limit', default=0.1, type=float,
help='the limit of nodes to be updated at each time step i.e. $\alpha$ in our paper')
"""
local-global: 用于均衡局部改变和全局拓扑的参数,论文中的 beta 参数
"""
parser.add_argument('--local-global', default=0.5, type=float,
help='balancing factor for local changes and global topology; raning [0.0, 1.0] i.e. $\beta$ in our paper')
"""
scheme: 每个time step中选择待更新embedding的节点的方案;
方案1:新出现的节点 + 受影响最大的节点
方案2:新出现的节点 + 随机选择的节点
方案3:新出现的节点 + 受影响最大的节点 + 随机选择的节点
"""
parser.add_argument('--scheme', default=3, type=int,
help='scheme 1: new + most affected; scheme 2: new + random; scheme 3: new + most affected + random')
# walk based methods
"""
num-walks: 每个节点的随机游走的次数,论文中的 r 参数
"""
parser.add_argument('--num-walks', default=20, type=int,
help='# of random walks of each node')
"""
walk-lengths: 每次随机游走的长度,论文中的 l 参数
"""
parser.add_argument('--walk-length', default=80, type=int,
help='length of each random walk')
# gensim word2vec parameters
"""
window: SGNS模型中窗口的大小,论文中的 w 参数
"""
parser.add_argument('--window', default=10, type=int,
help='window size of SGNS model')
"""
negative: SGNS模型中负采样的个数,论文中的 m 参数
"""
parser.add_argument('--negative', default=5, type=int,
help='negative samples of SGNS model')
"""
workers: 并行处理器的个数
"""
parser.add_argument('--workers', default=24, type=int,
help='# of parallel processes.')
"""
seed: random seed
"""
parser.add_argument('--seed', default=2019, type=int,
help='random seed')
# other methods
"""
Kstep: GraRep模型中的方法,当 emb_dim % Kstep != 0 时会报错
"""
parser.add_argument('--Kstep', default=4, type=int,
help='Kstep used in GraRep model, error if not emb_dim % Kstep == 0')
args = parser.parse_args()
return args
2. 节点的Embedding
2.1 准备数据
src/main.py main()
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
print(f'Summary of all settings: {args}') # 输出设定的参数
# ----------------------------------------STEP1: prepare data-------------------------------------------------------
print('\nSTEP1: start loading data......') # 加载数据
t1 = time.time() # 当前时间戳
G_dynamic = load_any_obj_pkl(args.graph) # 加载动态graph数据
t2 = time.time()
print(f'STEP1: end loading data; time cost: {(t2-t1):.2f}s') # 加载数据完成,输出耗时
这里的graph格式采用的是networkx包的格式,pkl文件进行存储
2.2 进行embedding
src/main.py main()
print('\nSTEP2: start learning embeddings......') # 开始进行embedding
# 输出:1.模型进行embedding的方法;2.动态graph的长度(有多少个time step);3.初始与最终graph节点与边的个数
print(f'The model used: {args.method} -------------------- \
\nThe # of dynamic graphs: {len(G_dynamic)}; \
\nThe # of nodes @t_init: {nx.number_of_nodes(G_dynamic[0])}, and @t_last {nx.number_of_nodes(G_dynamic[-1])} \
\nThe # of edges @t_init: {nx.number_of_edges(G_dynamic[0])}, and @t_last {nx.number_of_edges(G_dynamic[-1])}')
t1 = time.time()
model = None
if args.method == 'DynWalks': # 使用DynWalks进行embedding时
from libne import DynWalks
# 导入参数
model = DynWalks.DynWalks(G_dynamic=G_dynamic, limit=args.limit, local_global = args.local_global,
num_walks=args.num_walks, walk_length=args.walk_length, window=args.window,
emb_dim=args.emb_dim, negative=args.negative, workers=args.workers, seed=args.seed, scheme=args.scheme)
# 开始训练
model.sampling_traning()
elif args.method == 'DeepWalk': # 使用DeepWalk进行embedding时
from libne import DeepWalk
model = DeepWalk.DeepWalk(G_dynamic=G_dynamic, num_walks=args.num_walks, walk_length=args.walk_length, window=args.window,
negative=args.negative, emb_dim=args.emb_dim, workers=args.workers, seed=args.seed)
model.sampling_traning()
elif args.method == 'GraRep': # 使用GraRep进行embedding时
from libne import GraRep
model = GraRep.GraRep(G_dynamic=G_dynamic, emb_dim=args.emb_dim, Kstep=args.Kstep)
model.traning()
elif args.method == 'HOPE': # 使用HOPE进行embedding时
from libne import HOPE
model = HOPE.HOPE(G_dynamic=G_dynamic, emb_dim=args.emb_dim)
model.traning()
else:
print('method not found...') # 未知的embedding方法
exit(0)
t2 = time.time()
print(f'STEP3: end learning embeddings; time cost: {(t2-t1):.2f}s') # 输出学习embedding的时间
这里我主要学习了DynWalks算法,所以目前主要对DynWalks进行Embedding的方法进行注释,其他方法后续随缘添加吧。
当agrs.methods=='DynWalk'时,调用libne文件夹中的DynWalks方法
src/libne/DynWalks.py
class DynWalks(object):
def __init__(self, G_dynamic, limit, local_global, num_walks, walk_length, window,
emb_dim, negative, workers, seed, scheme):
self.G_dynamic = G_dynamic.copy() # a series of dynamic graphs 动态graph
self.emb_dim = emb_dim # node emb dimensionarity 节点embedding的维度
self.num_walks = num_walks # num of walks start from each node 每个节点随机游走的次数
self.walk_length = walk_length # walk length for each walk 每次随机游走的长度
self.window = window # Skip-Gram parameter Skip-Gram的窗口大小
self.workers = workers # Skip-Gram parameter Skip-Gram的处理器数量
self.negative = negative # Skip-Gram parameter Skip-Gram负采样的个数
self.seed = seed # Skip-Gram parameter 随机种子
self.scheme = scheme # 选择节点的方案
self.limit = limit # 选择节点的个数
# 均衡局部改变和全局拓扑的因子
self.local_global = local_global # balancing factor for local changes and global topology
# 每个time step节点的embedding
self.emb_dicts = [] # emb_dict @ t0, t1, ...; len(self.emb_dicts) == len(self.G_dynamic)
# 字典,{nodeID: 影响的次数}
self.reservoir = {} # {nodeID: num of affected, ...}
def sampling_traning(self):
# SGNS and suggested parameters to be tuned: size, window, negative, workers, seed
# to tune other parameters, please read https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec
# 仿照了Word2Vec模型,从gensim包中导入Word2Vec模型
w2v = gensim.models.Word2Vec(sentences=None, size=self.emb_dim, window=self.window, sg=1, hs=0, negative=self.negative, ns_exponent=0.75,
alpha=0.025, min_alpha=0.0001, min_count=1, sample=0.001, iter=4, workers=self.workers, seed=self.seed,
corpus_file=None, sorted_vocab=1, batch_words=10000, compute_loss=False,
max_vocab_size=None, max_final_vocab=None, trim_rule=None) # w2v constructor, default parameters
for t in range(len(self.G_dynamic)): # 遍历所有time step的graph
t1 = time.time()
if t ==0: # offline ---------------------------- 时刻0时,为第一阶段,在所有节点上执行random walk
G0 = self.G_dynamic[t] # initial graph 时刻 0 的 graph
# 模拟随机游走,将游走看作sentences
sentences = simulate_walks(nx_graph=G0, num_walks=self.num_walks, walk_length=self.walk_length)
# 将sentences的节点标号转换为string类型
sentences = [[str(j) for j in i] for i in sentences]
# 根据随机游走的结果,为Word2Vec模型构建字典
w2v.build

本文主要解析DynWalks动态网络嵌入算法的源码,关注于`src/main.py`中的函数和节点Embedding过程。首先,介绍如何准备数据并执行embedding,接着详细阐述如何在`DynWalks`模式下利用gensim的Word2Vec进行训练。在下游任务部分,涉及链路预测和图重构,包括相关分类器和数据生成函数的说明。文章提供了论文和源码链接,欢迎补充和交流。
最低0.47元/天 解锁文章
6821





