15、图算法助力机器学习：链接预测实践

emacs5lisp

于 2025-11-08 09:14:02 发布

阅读量14

点赞数

CC 4.0 BY-SA版权

分类专栏：图算法实战：连接数据的力量文章标签： Neo4j 图算法链接预测

本文链接：https://blog.youkuaiyun.com/emacs5lisp/article/details/154593820

图算法实战：连接数据的力量专栏收录该内容

16 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

图算法助力机器学习：链接预测实践

1. 连接到 Neo4j 数据库

首先，我们需要创建与 Neo4j 数据库的连接，使用以下代码：

graph = Graph("bolt://localhost:7687", auth=("neo4j", "neo"))

2. 数据导入到 Neo4j

接下来，我们要将数据加载到 Neo4j 中，并为训练和测试创建均衡的分割。具体步骤如下：
1. 下载数据集版本 10 的 ZIP 文件，解压后将内容放在导入文件夹中。我们应拥有以下文件：
- dblp - ref - 0.json
- dblp - ref - 1.json
- dblp - ref - 2.json
- dblp - ref - 3.json
2. 在 Neo4j 设置文件中添加以下属性，以便使用 APOC 库处理这些文件：

apoc.import.file.enabled=true
apoc.import.file.use_neo4j_config=true

创建约束以确保不会创建重复的文章或作者：

CREATE CONSTRAINT ON (article:Article)
ASSERT article.index IS UNIQUE;
CREATE CONSTRAINT ON (author:Author)
ASSERT author.name IS UNIQUE;

运行以下查询从 JSON 文件导入数据：

CALL apoc.periodic.iterate(
  'UNWIND ["dblp-ref-0.json","dblp-ref-1.json",
           "dblp-ref-2.json","dblp-ref-3.json"] AS file
   CALL apoc.load.json("file:///" + file)
   YIELD value
   WHERE value.venue IN ["Lecture Notes in Computer Science",
                         "Communications of The ACM",
                         "international conference on software engineering",
                         "advances in computing and communications"]
   return value',
  'MERGE (a:Article {index:value.id})
   ON CREATE SET a += apoc.map.clean(value,["id","authors","references"],[0])
   WITH a,value.authors as authors
   UNWIND authors as author
   MERGE (b:Author{name:author})
   MERGE (b)<-[:AUTHOR]-(a)'
, {batchSize: 10000, iterateList: true});

3. 创建合著图

为了预测作者之间未来的合作关系，我们创建一个合著图。使用以下 Neo4j Cypher 查询在每对合作过论文的作者之间创建 CO_AUTHOR 关系：

MATCH (a1)<-[:AUTHOR]-(paper)-[:AUTHOR]->(a2:Author)
WITH a1, a2, paper
ORDER BY a1, paper.year
WITH a1, a2, collect(paper)[0].year AS year, count(*) AS collaborations
MERGE (a1)-[coauthor:CO_AUTHOR {year: year}]-(a2)
SET coauthor.collaborations = collaborations;

查询中设置的 CO_AUTHOR 关系的 year 属性是这两位作者首次合作的最早年份。

4. 创建均衡的训练和测试数据集

对于链接预测问题，我们要预测未来链接的创建。由于文章有日期，我们可以利用这些日期来分割数据。具体操作如下：
1. 找出文章的发表年份，使用以下查询按年份统计文章数量：

query = """
MATCH (article:Article)
RETURN article.year AS year, count(*) AS count
ORDER BY year
"""
by_year = graph.run(query).to_data_frame()

将结果可视化成柱状图：

plt.style.use('fivethirtyeight')
ax = by_year.plot(kind='bar', x='year', y='count', legend=None, figsize=(15,8))
ax.xaxis.set_label_text("")
plt.tight_layout()
plt.show()

选择 2006 年作为分割年份，分别创建 CO_AUTHOR_EARLY 和 CO_AUTHOR_LATE 关系：

MATCH (a1)<-[:AUTHOR]-(paper)-[:AUTHOR]->(a2:Author)
WITH a1, a2, paper
ORDER BY a1, paper.year
WITH a1, a2, collect(paper)[0].year AS year, count(*) AS collaborations
WHERE year < 2006
MERGE (a1)-[coauthor:CO_AUTHOR_EARLY {year: year}]-(a2)
SET coauthor.collaborations = collaborations;

MATCH (a1)<-[:AUTHOR]-(paper)-[:AUTHOR]->(a2:Author)
WITH a1, a2, paper
ORDER BY a1, paper.year
WITH a1, a2, collect(paper)[0].year AS year, count(*) AS collaborations
WHERE year >= 2006
MERGE (a1)-[coauthor:CO_AUTHOR_LATE {year: year}]-(a2)
SET coauthor.collaborations = collaborations;

检查 CO_AUTHOR_EARLY 和 CO_AUTHOR_LATE 关系的数量：

MATCH ()-[:CO_AUTHOR_EARLY]->()
RETURN count(*) AS count

MATCH ()-[:CO_AUTHOR_LATE]->()
RETURN count(*) AS count

处理数据不平衡问题，构建负样本并下采样：

def down_sample(df):
    copy = df.copy()
    zero = Counter(copy.label.values)[0]
    un = Counter(copy.label.values)[1]
    n = zero - un
    copy = copy.drop(copy[copy.label == 0].sample(n=n, random_state=1).index)
    return copy.sample(frac=1)

train_existing_links = graph.run("""
MATCH (author:Author)-[:CO_AUTHOR_EARLY]->(other:Author)
RETURN id(author) AS node1, id(other) AS node2, 1 AS label
""").to_data_frame()
train_missing_links = graph.run("""
MATCH (author:Author)
WHERE (author)-[:CO_AUTHOR_EARLY]-()
MATCH (author)-[:CO_AUTHOR_EARLY*2..3]-(other)
WHERE not((author)-[:CO_AUTHOR_EARLY]-(other))
RETURN id(author) AS node1, id(other) AS node2, 0 AS label
""").to_data_frame()
train_missing_links = train_missing_links.drop_duplicates()
training_df = train_missing_links.append(train_existing_links, ignore_index=True)
training_df['label'] = training_df['label'].astype('category')
training_df = down_sample(training_df)
training_data = spark.createDataFrame(training_df)

test_existing_links = graph.run("""
MATCH (author:Author)-[:CO_AUTHOR_LATE]->(other:Author)
RETURN id(author) AS node1, id(other) AS node2, 1 AS label
""").to_data_frame()
test_missing_links = graph.run("""
MATCH (author:Author)
WHERE (author)-[:CO_AUTHOR_LATE]-()
MATCH (author)-[:CO_AUTHOR*2..3]-(other)
WHERE not((author)-[:CO_AUTHOR]-(other))
RETURN id(author) AS node1, id(other) AS node2, 0 AS label
""").to_data_frame()
test_missing_links = test_missing_links.drop_duplicates()
test_df = test_missing_links.append(test_existing_links, ignore_index=True)
test_df['label'] = test_df['label'].astype('category')
test_df = down_sample(test_df)
test_data = spark.createDataFrame(test_df)

5. 预测缺失链接的方法

我们基于以下假设构建图特征来训练二元分类器：
- 更多共同的合著者
- 作者之间潜在的三元关系
- 关系更多的作者
- 同一社区的作者
- 同一紧密社区的作者

创建机器学习管道：

def create_pipeline(fields):
    assembler = VectorAssembler(inputCols=fields, outputCol="features")
    rf = RandomForestClassifier(labelCol="label", featuresCol="features",
                                numTrees=30, maxDepth=10)
    return Pipeline(stages=[assembler, rf])

6. 预测链接：基本图特征

我们创建一个简单模型，基于共同作者、优先连接和邻居总并集提取的特征来预测两位作者未来是否会合作。
1. 计算训练集和测试集的特征：

def apply_graphy_training_features(data):
    query = """
    UNWIND $pairs AS pair
    MATCH (p1) WHERE id(p1) = pair.node1
    MATCH (p2) WHERE id(p2) = pair.node2
    RETURN pair.node1 AS node1,
           pair.node2 AS node2,
           size([(p1)-[:CO_AUTHOR_EARLY]-(a)-
                      [:CO_AUTHOR_EARLY]-(p2) | a]) AS commonAuthors,
           size((p1)-[:CO_AUTHOR_EARLY]-()) * size((p2)-
                     [:CO_AUTHOR_EARLY]-()) AS prefAttachment,
           size(apoc.coll.toSet(
             [(p1)-[:CO_AUTHOR_EARLY]-(a) | id(a)] +
                   [(p2)-[:CO_AUTHOR_EARLY]-(a) | id(a)]
           )) AS totalNeighbors
    """
    pairs = [{"node1": row["node1"], "node2": row["node2"]}
                                 for row in data.collect()]
    features = spark.createDataFrame(graph.run(query,
                                 {"pairs": pairs}).to_data_frame())
    return data.join(features, ["node1", "node2"])

def apply_graphy_test_features(data):
    query = """
    UNWIND $pairs AS pair
    MATCH (p1) WHERE id(p1) = pair.node1
    MATCH (p2) WHERE id(p2) = pair.node2
    RETURN pair.node1 AS node1,
           pair.node2 AS node2,
           size([(p1)-[:CO_AUTHOR]-(a)-[:CO_AUTHOR]-(p2) | a]) AS commonAuthors,
           size((p1)-[:CO_AUTHOR]-()) * size((p2)-[:CO_AUTHOR]-())
                                   AS prefAttachment,
           size(apoc.coll.toSet(
             [(p1)-[:CO_AUTHOR]-(a) | id(a)] + [(p2)-[:CO_AUTHOR]-(a) | id(a)]
           )) AS totalNeighbors
    """
    pairs = [{"node1": row["node1"], "node2": row["node2"]}
                       for row in data.collect()]
    features = spark.createDataFrame(graph.run(query,
                       {"pairs": pairs}).to_data_frame())
    return data.join(features, ["node1", "node2"])

training_data = apply_graphy_training_features(training_data)
test_data = apply_graphy_test_features(test_data)

训练模型：

def train_model(fields, training_data):
    pipeline = create_pipeline(fields)
    model = pipeline.fit(training_data)
    return model

basic_model = train_model(["commonAuthors"], training_data)

评估模型：

eval_df = spark.createDataFrame(
    [(0,), (1,), (2,), (10,), (100,)],
    ['commonAuthors'])
(basic_model.transform(eval_df)
 .select("commonAuthors", "probability", "prediction")
 .show(truncate=False))

7. 评估模型的指标

指标	公式	描述
准确率（Accuracy）	(TruePositives + TrueNegatives) / TotalPredictions	模型预测正确的比例
精确率（Precision）	TruePositives / (TruePositives + FalsePositives)	正识别的比例
召回率（Recall）	TruePositives / (TruePositives + FalseNegatives)	实际正例被正确识别的比例
假正率（False positive rate）	FalsePositives / (FalsePositives + TrueNegatives)	错误正识别的比例
受试者工作特征曲线（ROC）	X - Y 图	召回率与假正率的关系图，曲线下面积（AUC）衡量性能

使用以下函数计算这些指标：

def evaluate_model(model, test_data):
    predictions = model.transform(test_data)
    tp = predictions[(predictions.label == 1) &
                     (predictions.prediction == 1)].count()
    fp = predictions[(predictions.label == 0) &
                     (predictions.prediction == 1)].count()
    fn = predictions[(predictions.label == 1) &
                     (predictions.prediction == 0)].count()
    recall = float(tp) / (tp + fn)
    precision = float(tp) / (tp + fp)
    accuracy = BinaryClassificationEvaluator().evaluate(predictions)
    labels = [row["label"] for row in predictions.select("label").collect()]
    preds = [row["probability"][1] for row in predictions.select
                ("probability").collect()]
    fpr, tpr, threshold = roc_curve(labels, preds)
    roc_auc = auc(fpr, tpr)
    return { "fpr": fpr, "tpr": tpr, "roc_auc": roc_auc, "accuracy": accuracy,
             "recall": recall, "precision": precision }

basic_results = evaluate_model(basic_model, test_data)

8. 可视化 ROC 曲线

def create_roc_plot():
    plt.style.use('classic')
    fig = plt.figure(figsize=(13, 8))
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.rc('axes', prop_cycle=(cycler('color',
                   ['r', 'g', 'b', 'c', 'm', 'y', 'k'])))
    plt.plot([0, 1], [0, 1], linestyle='--', label='Random score (AUC = 0.50)')
    return plt, fig

def add_curve(plt, title, fpr, tpr, roc):
    plt.plot(fpr, tpr, label=f"{title} (AUC = {roc:0.2})")

plt, fig = create_roc_plot()
add_curve(plt, "Common Authors",
          basic_results["fpr"], basic_results["tpr"], basic_results["roc_auc"])
plt.legend(loc='lower right')
plt.show()

9. 使用更多图特征改进预测

添加优先连接和邻居总并集特征训练新模型：

fields = ["commonAuthors", "prefAttachment", "totalNeighbors"]
graphy_model = train_model(fields, training_data)
graphy_results = evaluate_model(graphy_model, test_data)

可视化特征重要性：

def plot_feature_importance(fields, feature_importances):
    df = pd.DataFrame({"Feature": fields, "Importance": feature_importances})
    df = df.sort_values("Importance", ascending=False)
    ax = df.plot(kind='bar', x='Feature', y='Importance', legend=None)
    ax.xaxis.set_label_text("")
    plt.tight_layout()
    plt.show()

rf_model = graphy_model.stages[-1]
plot_feature_importance(fields, rf_model.featureImportances)

可视化决策树：

from spark_tree_plotting import export_graphviz
dot_string = export_graphviz(rf_model.trees[0],
    featureNames=fields, categoryNames=[], classNames=["True", "False"],
    filled=True, roundedCorners=True, roundLeaves=True)
with open("/tmp/rf.dot", "w") as file:
    file.write(dot_string)

在终端运行以下命令生成可视化文件：

dot -Tpdf /tmp/rf.dot -o /tmp/rf.pdf

总结

通过以上步骤，我们完成了从数据导入到模型训练和评估的整个过程。使用基本图特征的模型有一定的预测能力，而添加更多图特征后，模型的准确率和召回率有了显著提高。同时，我们还可以通过可视化特征重要性和决策树来理解模型的工作原理。未来可以考虑添加更多图特征来进一步改进预测效果。

mermaid 流程图

graph LR
    A[连接到 Neo4j 数据库] --> B[数据导入到 Neo4j]
    B --> C[创建合著图]
    C --> D[创建均衡的训练和测试数据集]
    D --> E[预测缺失链接的方法]
    E --> F[预测链接：基本图特征]
    F --> G[评估模型的指标]
    G --> H[可视化 ROC 曲线]
    H --> I[使用更多图特征改进预测]
    I --> J[总结]

图算法助力机器学习：链接预测实践

10. 深入分析更多图特征模型

我们已经训练了一个包含 commonAuthors 、 prefAttachment 和 totalNeighbors 特征的 graphy_model 。现在，我们来详细分析这个模型的预测结果和特征表现。

首先，我们再次查看模型的评估指标：

display_results(graphy_results)

得到的预测指标如下：
| 指标 | 得分 |
| ---- | ---- |
| 准确率（Accuracy） | 0.978351 |
| 召回率（Recall） | 0.924226 |
| 精确率（Precision） | 0.943795 |

从这些指标可以看出，与仅使用 commonAuthors 特征的 basic_model 相比， graphy_model 的准确率和召回率有了显著提高，但精确率略有下降。这意味着模型在识别真正有合作关系的作者对方面表现更好，但也会产生更多的误判。

接下来，我们对比 basic_model 和 graphy_model 的 ROC 曲线：

plt, fig = create_roc_plot()
add_curve(plt, "Common Authors",
          basic_results["fpr"], basic_results["tpr"],
          basic_results["roc_auc"])
add_curve(plt, "Graphy",
          graphy_results["fpr"], graphy_results["tpr"],
          graphy_results["roc_auc"])
plt.legend(loc='lower right')
plt.show()

通过 ROC 曲线的对比，我们可以直观地看到两个模型的性能差异。 graphy_model 的曲线更靠近左上角，说明其整体性能更优。

11. 特征重要性分析

我们之前已经绘制了特征重要性的柱状图，现在进一步分析这些特征对模型预测的影响。
从特征重要性的柱状图中，我们发现 commonAuthors 特征的重要性远远高于 prefAttachment 和 totalNeighbors 。这表明在预测作者之间的合作关系时，共同的合著者数量是一个非常关键的因素。

为了更深入地理解特征的影响，我们可以进行以下操作：
1. 单独分析每个特征与标签的相关性。可以计算每个特征与 label 列的皮尔逊相关系数，以了解它们之间的线性关系。

from pyspark.ml.stat import Correlation
import pyspark.sql.functions as F

# 选择特征列和标签列
features_and_label = training_data.select("commonAuthors", "prefAttachment", "totalNeighbors", "label")

# 计算相关矩阵
assembler = VectorAssembler(inputCols=["commonAuthors", "prefAttachment", "totalNeighbors", "label"], outputCol="features")
df = assembler.transform(features_and_label)
corr_matrix = Correlation.corr(df, "features").collect()[0][0].toArray()

# 输出相关系数
print("Correlation Matrix:")
print(corr_matrix)

进行特征选择实验。可以尝试逐步移除一些特征，观察模型性能的变化。例如，移除 totalNeighbors 特征，重新训练模型并评估性能。

new_fields = ["commonAuthors", "prefAttachment"]
new_graphy_model = train_model(new_fields, training_data)
new_graphy_results = evaluate_model(new_graphy_model, test_data)
display_results(new_graphy_results)

12. 决策树的详细解读

我们之前已经可视化了决策树，现在详细解读决策树的预测过程。
假设我们有一对作者的特征如下：
| commonAuthors | prefAttachment | totalNeighbors |
| ---- | ---- | ---- |
| 10 | 12 | 5 |

决策树的预测步骤如下：
1. 从节点 0 开始，判断 commonAuthors 是否大于 1.5。由于这里 commonAuthors 为 10，大于 1.5，所以沿着 False 分支到节点 2。
2. 在节点 2，判断 commonAuthors 是否大于 2.5。因为 10 大于 2.5，继续沿着 False 分支到节点 6。
3. 在节点 6，判断 prefAttachment 是否小于 15.5。这里 prefAttachment 为 12，小于 15.5，所以到节点 9。
4. 节点 9 是叶子节点，该节点的 Prediction 值为 True，即决策树预测这对作者有合作关系。
5. 最后，随机森林会综合多个决策树的预测结果，根据多数投票原则做出最终预测。

13. 模型优化方向

虽然我们的模型已经取得了不错的性能，但仍有一些优化的方向：
1. 添加更多特征 ：可以考虑添加一些其他的图特征，如节点的中心性指标（度中心性、介数中心性等），或者作者的发表频率等。
2. 调整模型参数 ：可以通过网格搜索等方法，调整随机森林的参数，如 numTrees 和 maxDepth ，以找到最优的参数组合。

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# 定义参数网格
paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [20, 30, 40]) \
    .addGrid(rf.maxDepth, [8, 10, 12]) \
    .build()

# 创建交叉验证器
crossval = CrossValidator(estimator=create_pipeline(fields),
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=3)

# 训练模型
cvModel = crossval.fit(training_data)

# 获取最优模型
best_model = cvModel.bestModel

# 评估最优模型
best_results = evaluate_model(best_model, test_data)
display_results(best_results)

使用其他机器学习算法 ：除了随机森林，还可以尝试使用其他的分类算法，如逻辑回归、支持向量机等，并比较它们的性能。

14. 实际应用场景

链接预测模型在很多实际场景中都有应用，例如：
1. 社交网络 ：预测用户之间未来是否会建立社交关系，如好友推荐。
2. 学术合作 ：预测学者之间未来是否会合作发表论文，为科研团队的组建提供参考。
3. 金融领域 ：预测企业之间是否会建立合作关系，帮助银行评估贷款风险。

15. 总结与展望

通过本文的实践，我们从数据导入、图构建、数据集划分、特征提取到模型训练和评估，完成了一个完整的链接预测任务。使用基本图特征的模型有一定的预测能力，而添加更多图特征后，模型的准确率和召回率有了显著提高。同时，我们通过可视化特征重要性和决策树，深入理解了模型的工作原理。

未来，我们可以继续探索更多的图特征和机器学习算法，进一步优化模型性能。此外，还可以将模型应用到更多的实际场景中，验证其有效性和实用性。

mermaid 流程图

graph LR
    A[深入分析更多图特征模型] --> B[特征重要性分析]
    B --> C[决策树的详细解读]
    C --> D[模型优化方向]
    D --> E[实际应用场景]
    E --> F[总结与展望]