复杂网络的课程要求用BigClam做社区发现。
from karateclub import BigClam
老师给引好了包,但是一开始并不是太会使用这个。索性有同学找到类似的调用,拿来试了试。
# G为已经生成好的无向图
model = BigClam()
model.fit(G)
Big_ms = model.get_memberships()
发生报错:
然后去查了查源码:
karateclub.community_detection.overlapping.bigclam — karateclub documentation
def fit(self, graph: nx.classes.graph.Graph):
"""
Fitting a BigClam clustering model.
Arg types:
* **graph** *(NetworkX graph)* - The graph to be clustered.
"""
self._set_seed()
graph = self._check_graph(graph)
number_of_nodes = graph.number_of_nodes()
self._initialize_features(number_of_nodes)
nodes = [node for node in graph.nodes()]
for i in range(self.iterations):
random.shuffle(nodes)
for node in nodes:
nebs = [neb for neb in graph.neighbors(node)]
neb_features = self._embedding[nebs, :]
node_feature = self._embedding[node, :]
gradient = self._calculate_gradient(node_feature, neb_features)
self._do_updates(node, gradient, node_feature)
很明确,BigClam继承了Estimator,调用_check_graph()的时候发生了报错,所以接着看Estimator的源码:karateclub/estimator.py at master · benedekrozemberczki/karateclub · GitHub
def _check_indexing(graph: nx.classes.graph.Graph):
"""Checking the consecutive numeric indexing."""
numeric_indices = [index for index in range(graph.number_of_nodes())]
node_indices = sorted([node for node in graph.nodes()])
assert numeric_indices == node_indices, "The node indexing is wrong."
def _check_graph(self, graph: nx.classes.graph.Graph) -> nx.classes.graph.Graph:
"""Check the Karate Club assumptions about the graph."""
self._check_indexing(graph)
graph = self._ensure_integrity(graph)
return graph
也很明确了,我的图G的numeric_indices和node_indices对不上。
原因是numeric_indices是从0开始遍历到节点数,但是node_indices是直接输出节点名,我在生成G的时候用的是某个数据处理出来的List,每个节点自带一个节点名,所以当然对不上。
numeric_indices = [index for index in range(G.number_of_nodes())]
node_indices = sorted([node for node in G.nodes()])
print(numeric_indices)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,...,2707]
print(node_indices)
# [35, 40, 114, 117, 128, 130, 164, 288, 424,434,...,1155073]
然后大致翻了翻源码,大概意思是为了做一些准备工作,需要给图G的每个节点增加一条连向自己的边。这个自连边的操作直接是按照节点数(而不是节点名)的遍历加的边,所以要先确认图G中每个节点的节点名要和其排序后的索引下标一致(也不知道为啥这样写....感觉好蠢...)。
没办法,只能对原来数据操作一下生成一张新图:
# c1为被指向者,c2为指向者,所以是c2——>c1
df1=pd.read_csv("cora.cites", header=None, names=["c1", "c2"], sep='\t')
arr = np.array(df1)[:,[1,0]] #交换列,方便构建有向边,不过暂时还只需要无向图
edgeList = arr.tolist()
noteSet = set(df1['c1']) | set(df1['c2']) # 取两列并集
# 由于需要调用BigClam,需要修改节点名
noteSet_bc = [index for index in range(n_num)]
numeric_indices = np.array(noteSet_bc)
node_indices = np.array(sorted(noteSet))
edgeList_bc = np.copy(edgeList)
for key,value in zip(node_indices, numeric_indices):
edgeList_bc[edgeList == key] = value
# 构建用于BigClam的新图
G_bc = nx.Graph();
G_bc.add_nodes_from(noteSet_bc)
G_bc.add_edges_from(edgeList_bc)
再次运行代码,跑通。
model = BigClam()
model.fit(G_bc)
Big_ms = model.get_memberships()