文章目录
背景
Neo4j 有提供GDS plugin 可以实现在Neo4j上完成一些常见的图计算算法. 业务中需要实现三种相似度计算算法. GDS 也涵盖了较多的相似度算法. 相似度算法的返回值基本就是(from, to, similarity) 这样的三元组. 但是直接调用算法函数返回的结果里面from和to有重复组合对. 比如下面一个例子:
首先是创建图谱Cypher:
CREATE (a1: `TypeA`)
CREATE (a2: `TypeA`)
CREATE (b1: `TypeB`)
CREATE (b2: `TypeB`)
CREATE (b3: `TypeB`)
CREATE (a1)-[:Like]->(b1)
CREATE (a1)-[:Like]->(b2)
CREATE (a2)-[:Like]->(b3)
CREATE (a2)-[:Like]->(b1)
图谱结构如下:
计算指定类型的所有TypeA类型节点间的jaccard 距离, 利用TypeA实例和TypeB实例连接数量关系作为计算距离的集合,查询Cypher如下:
MATCH (n: `TypeA`)-[:Like]->(m: `TypeB`)
WITH id(n) AS leftNodeId, collect(id(m)) AS leftVector
MATCH (n: `TypeA`)-[:Like]->(m: `TypeB`)
WHERE id(n) <> leftNodeId
WITH leftNodeId, leftVector, id(n) AS rightNodeId, collect(id(m)) AS rightVector
WITH leftNodeId AS from, rightNodeId AS to, gds.alpha.similarity.jaccard(leftVector, rightVector) AS similarity
RETURN from, to, similarity
ORDER BY similarity DESC LIMIT 10
运行结果:
可见from to的节点对重复了. 因为相似度始终是针对Node Pair的计算值. 最开始的模式匹配部分返回结果就是AB BA对. 算法计算过后就会出现重复计算的结果, 如果要在相似度计算结果上倒排限制返回结果数量,上述查询过程还要更改LIMIT参数. 解决思路是, 将from,to 两个字段合成一个排序好的LIST 然后在 RETURN的时候 使用DISTINCT即可. Cypher语句如下:
MATCH (n: `TypeA`)-[:Like]->(m: `TypeB`)
WITH id(n) AS leftNodeId, collect(id(m)) AS leftVector
MATCH (n: `TypeA`)-[:Like]->(m: `TypeB`)
WHERE id(n) <> leftNodeId
WITH leftNodeId, leftVector, id(n) AS rightNodeId, collect(id(m)) AS rightVector
WITH leftNodeId, rightNodeId, gds.alpha.similarity.jaccard(leftVector, rightVector) AS similarity
WITH [leftNodeId, rightNodeId] AS nodePair, similarity
WITH
CASE
WHEN nodePair[0]>= nodePair[1] THEN [nodePair[1], nodePair[0]]
ELSE nodePair
END AS nodePair, similarity
WITH DISTINCT nodePair, similarity
RETURN nodePair[0] AS from, nodePair[1] AS to, similarity
ORDER BY similarity DESC LIMIT 10
主要是这一段代码完成节点排序 DISTINCT 去重
WITH
CASE
WHEN nodePair[0]>= nodePair[1] THEN [nodePair[1], nodePair[0]]
ELSE nodePair
END AS nodePair, similarity
WITH DISTINCT nodePair, similarity
最终返回结果:
只有一组节点对,完成了去重过程.