12、图算法在实际中的应用

图算法在实际中的应用

1. 图算法概述

在处理图数据时,随着对不同算法在特定数据集上表现的了解加深,我们采用的图分析方法也会不断发展。这里将通过几个示例,展示如何利用 Yelp 和美国运输部的数据集进行大规模图数据分析。

1.1 采用的算法

  • PageRank :用于找出有影响力的 Yelp 评论者,并关联他们对特定酒店的评分。
  • Betweenness Centrality :揭示与多个群体有连接的评论者,提取他们的偏好。
  • Label Propagation :结合投影创建相似 Yelp 业务的超级类别。
  • Degree Centrality :快速识别美国运输数据集中的机场枢纽。
  • Strongly Connected Components :查看美国机场航线的集群。

2. Yelp 数据概述

Yelp 基于评论、偏好和推荐帮助人们找到当地商家。截至 2018 年底,平台上已有超过 1.8 亿条评论。自 2013 年以来,Yelp 举办了 Yelp 数据集挑战赛,鼓励人们探索和研究其开放数据集。

2.1 数据集内容

  • 超过 700 万条评论和提示。
  • 超过 150 万用户和 28 万张图片。
  • 超过 18.8 万家企业,拥有 140 万个属性。
  • 10 个大都市地区。

2.2 Yelp 社交网络

Yelp 用户不仅可以撰写和阅读商家评论,还能形成社交网络。用户可以向浏览 Yelp.com 时遇到的其他用户发送好友请求,也可以连接他们的通讯录或 Facebook 图谱。

3. 数据导入与图模型

3.1 数据导入

将数据导入 Neo4j 有多种方法,包括导入工具、LOAD CSV 命令和 Neo4j 驱动程序。对于 Yelp 数据集,由于需要一次性导入大量数据,导入工具是最佳选择。

3.2 图模型

Yelp 数据在图模型中的表示如下:
- 包含标记为 User 的节点,它们与其他 User 节点具有 FRIENDS 关系。
- 用户撰写关于 Businesses 的 Reviews 和提示。
- 除业务类别由单独的 Category 节点表示外,所有元数据都存储为节点的属性。
- 对于位置数据,将 City、Area 和 Country 属性提取到子图中。

graph LR
    classDef process fill:#E5F6FF,stroke:#73A6FF,stroke-width:2px;
    User(User):::process -->|FRIENDS| User(User):::process
    User(User):::process -->|WROTE| Review(Review):::process
    Review(Review):::process -->|REVIEWS| Business(Business):::process
    Business(Business):::process -->|IN_CATEGORY| Category(Category):::process
    Business(Business):::process -->|IN_CITY| City(City):::process
    City(City):::process -->|IN_AREA| Area(Area):::process
    Area(Area):::process -->|IN_COUNTRY| Country(Country):::process

4. Yelp 数据快速概览

4.1 安装 Python 库

pip install neo4j-driver tabulate pandas matplotlib

4.2 导入库

from neo4j.v1 import GraphDatabase
import pandas as pd
from tabulate import tabulate
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt

4.3 创建 Neo4j 驱动实例

driver = GraphDatabase.driver("bolt://localhost", auth=("neo4j", "neo"))

4.4 计算节点基数

result = {"label": [], "count": []}
with driver.session() as session:
    labels = [row["label"] for row in session.run("CALL db.labels()")]
    for label in labels:
        query = f"MATCH (:`{label}`) RETURN count(*) as count"
        count = session.run(query).single()["count"]
        result["label"].append(label)
        result["count"].append(count)
df = pd.DataFrame(data=result)
print(tabulate(df.sort_values("count"), headers='keys',
                              tablefmt='psql', showindex=False))
label count
Country 17
Area 54
City 1093
Category 1293
Business 174567
User 1326101
Review 5261669

4.5 计算关系基数

result = {"relType": [], "count": []}
with driver.session() as session:
    rel_types = [row["relationshipType"] for row in session.run
                               ("CALL db.relationshipTypes()")]
    for rel_type in rel_types:
        query = f"MATCH ()-[:`{rel_type}`]->() RETURN count(*) as count"
        count = session.run(query).single()["count"]
        result["relType"].append(rel_type)
        result["count"].append(count)
df = pd.DataFrame(data=result)
print(tabulate(df.sort_values("count"), headers='keys',
                              tablefmt='psql', showindex=False))
relType count
IN_COUNTRY 54
IN_AREA 1154
IN_CITY 174566
IN_CATEGORY 667527
WROTE 5261669
REVIEWS 5261669
FRIENDS 10645356

4.6 查看酒店业务和评论数量

query = """
MATCH (category:Category {name: "Hotels"})
RETURN size((category)<-[:IN_CATEGORY]-()) AS businesses,
       size((:Review)-[:REVIEWS]->(:Business)-[:IN_CATEGORY]->
                                  (category)) AS reviews
"""
businesses reviews
2683 183759

5. 旅行规划应用

5.1 查找评论最多的酒店

# Find the 10 hotels with the most reviews
query = """
MATCH (review:Review)-[:REVIEWS]->(business:Business),
      (business)-[:IN_CATEGORY]->(category:Category {name: $category}),
      (business)-[:IN_CITY]->(:City {name: $city})
RETURN business.name AS business, collect(review.stars) AS allReviews
ORDER BY size(allReviews) DESC
LIMIT 10
"""
fig = plt.figure()
fig.set_size_inches(10.5, 14.5)
fig.subplots_adjust(hspace=0.4, wspace=0.4)
with driver.session() as session:
    params = { "city": "Las Vegas", "category": "Hotels"}
    result = session.run(query, params)
    for index, row in enumerate(result):
        business = row["business"]
        stars = pd.Series(row["allReviews"])
        total = stars.count()
        average_stars = stars.mean().round(2)
        # Calculate the star distribution
        stars_histogram = stars.value_counts().sort_index()
        stars_histogram /= float(stars_histogram.sum())
        # Plot a bar chart showing the distribution of star ratings
        ax = fig.add_subplot(5, 2, index+1)
        stars_histogram.plot(kind="bar", legend=None, color="darkblue",
                             title=f"{business}\nAve: {average_stars}, Total: {total}")
plt.tight_layout()
plt.show()

5.2 寻找有影响力的酒店评论者

为了决定在应用中展示哪些评论,我们基于评论者在 Yelp 上的影响力对评论进行排序。运行 PageRank 算法,对至少评论过三家酒店的用户的投影图进行分析。

CALL algo.pageRank(
  'MATCH (u:User)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CATEGORY]->
                               (:Category {name: $category})
   WITH u, count(*) AS reviews
   WHERE reviews >= $cutOff
   RETURN id(u) AS id',
  'MATCH (u1:User)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CATEGORY]->
                                (:Category {name: $category})
   MATCH (u1)-[:FRIENDS]->(u2)
   RETURN id(u1) AS source, id(u2) AS target',
  {graph: "cypher", write: true, writeProperty: "hotelPageRank",
   params: {category: "Hotels", cutOff: 3}}
)

5.3 查看 PageRank 值分布

MATCH (u:User)
WHERE exists(u.hotelPageRank)
RETURN count(u.hotelPageRank) AS count,
       avg(u.hotelPageRank) AS ave,
       percentileDisc(u.hotelPageRank, 0.5) AS `50%`,
       percentileDisc(u.hotelPageRank, 0.75) AS `75%`,
       percentileDisc(u.hotelPageRank, 0.90) AS `90%`,
       percentileDisc(u.hotelPageRank, 0.95) AS `95%`,
       percentileDisc(u.hotelPageRank, 0.99) AS `99%`,
       percentileDisc(u.hotelPageRank, 0.999) AS `99.9%`,
       percentileDisc(u.hotelPageRank, 0.9999) AS `99.99%`,
       percentileDisc(u.hotelPageRank, 0.99999) AS `99.999%`,
       percentileDisc(u.hotelPageRank, 1) AS `100%`
count ave 50% 75% 90% 95% 99% 99.9% 99.99% 99.999% 100%
1326101 0.1614898 0.15 0.15 0.157497 0.181875 0.330081 1.649511 6.825738 15.27376 22.98046

5.4 找出最有影响力的用户

MATCH (u:User)
WHERE u.hotelPageRank >  1.64951
WITH u ORDER BY u.hotelPageRank DESC
LIMIT 10
RETURN u.name AS name,
       u.hotelPageRank AS pageRank,
       size((u)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CATEGORY]->
            (:Category {name: "Hotels"})) AS hotelReviews,
       size((u)-[:WROTE]->()) AS totalReviews,
       size((u)-[:FRIENDS]-()) AS friends
name pageRank hotelReviews totalReviews friends
Phil 17.361242 15 134 8154
Philip 16.871013 21 620 9634
Carol 12.416060999999997 6 119 6218
Misti 12.239516000000004 19 730 6230
Joseph 12.003887499999998 5 32 6596
Michael 11.460049 13 51 6572
J 11.431505999999997 103 1322 6498
Abby 11.376136999999998 9 82 7922
Erica 10.993773 6 15 7071
Randy 10.748785999999999 21 125 7846

对于应用程序,我们选择突出显示 Phil、Philip 和 J 的酒店评论,以获得有影响力的评论者和评论数量的最佳组合。

6. 旅游业务咨询

6.1 查看特定酒店的评论评级

作为咨询服务的一部分,酒店订阅服务以在有影响力的访客撰写关于他们住宿的评论时得到提醒,以便采取必要的行动。首先,我们查看贝拉吉奥酒店(Bellagio)的评级,并按最有影响力的评论者进行排序。

query = """
MATCH (b:Business {name: $hotel})
MATCH (b)<-[:REVIEWS]-(review)<-[:WROTE]-(user)
WHERE exists(user.hotelPageRank)
RETURN user.name AS name,
       user.hotelPageRank AS pageRank,
       review.stars AS stars
"""
with driver.session() as session:
    params = { "hotel": "Bellagio Hotel" }
    df = pd.DataFrame([dict(record) for record in session.run(query, params)])
    df = df.round(2)
    df = df[["name", "pageRank", "stars"]]
top_reviews = df.sort_values(by=["pageRank"], ascending=False).head(10)
print(tabulate(top_reviews, headers='keys', tablefmt='psql', showindex=False))
name pageRank stars
Misti 12.239516000000004 5
Michael 11.460049 4
J 11.431505999999997 5
Erica 10.993773 4
Christine 10.740770499999998 4
Jeremy 9.576763499999998 5
Connie 9.118103499999998 5
Joyce 7.621449000000001 4
Henry 7.299146 5
Flora 6.7570075 4

这些结果显示,贝拉吉奥酒店的客户服务团队情况良好,前 10 名有影响力的评论者都给酒店打出了好评。酒店可能希望鼓励这些人再次光顾并分享他们的体验。

6.2 查找评分较低的有影响力的客人

我们可以运行以下代码来查找 PageRank 最高但对贝拉吉奥酒店体验评分低于四星的客人。

query = """
MATCH (b:Business {name: $hotel})
MATCH (b)<-[:REVIEWS]-(review)<-[:WROTE]-(user)
WHERE exists(user.hotelPageRank) AND review.stars < $goodRating
RETURN user.name AS name,
       user.hotelPageRank AS pageRank,
       review.stars AS stars
"""
with driver.session() as session:
    params = { "hotel": "Bellagio Hotel", "goodRating": 4 }
    df = pd.DataFrame([dict(record) for record in session.run(query, params)])
    df = df.round(2)
    df = df[["name", "pageRank", "stars"]]
top_reviews = df.sort_values(by=["pageRank"], ascending=False).head(10)
print(tabulate(top_reviews, headers='keys', tablefmt='psql', showindex=False))
name pageRank stars
Chris 5.84 3
Lorrie 4.95 2
Dani 3.47 1
Victor 3.35 3
Francine 2.93 3
Rex 2.79 2
Jon 2.55 3
Rachel 2.47 3
Leslie 2.46 2
Benay 2.46 3

排名最高但给贝拉吉奥酒店较低评分的用户 Chris 和 Lorrie,属于前 1000 名最有影响力的用户,因此可能需要进行个人外联。此外,由于许多评论者在入住期间撰写评论,关于有影响力的人的实时警报可能会促进更多积极的互动。

6.3 贝拉吉奥酒店的交叉推广

在帮助贝拉吉奥酒店找到有影响力的评论者后,酒店现在要求我们帮助确定其他可以通过人脉广泛的客户进行交叉推广的业务。我们可以使用介数中心性(Betweenness Centrality)算法来找出不仅在整个 Yelp 网络中人脉广泛,而且可能在不同群体之间起到桥梁作用的贝拉吉奥酒店评论者。

6.3.1 标记特定城市的用户

我们只对拉斯维加斯的有影响力的人感兴趣,因此首先标记那些用户。

MATCH (u:User)
WHERE exists((u)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CITY]->
                              (:City {name: "Las Vegas"}))
SET u:LasVegas
6.3.2 运行介数中心性算法

由于对拉斯维加斯用户运行介数中心性算法需要很长时间,我们使用 RA - Brandes 变体。该算法通过对节点进行采样并仅探索到一定深度的最短路径来计算介数得分。经过实验,我们使用了与默认值不同的一些参数,使用最多 4 跳的最短路径(maxDepth 为 4)并采样 20% 的节点(probability 为 0.2)。

CALL algo.betweenness.sampled('LasVegas', 'FRIENDS',
  {write: true, writeProperty: "between", maxDepth: 4, probability: 0.2}
)
6.3.3 查看介数得分分布

在查询中使用这些得分之前,我们先查看得分的分布情况。

MATCH (u:User)
WHERE exists(u.between)
RETURN count(u.between) AS count,
       avg(u.between) AS ave,
       toInteger(percentileDisc(u.between, 0.5)) AS `50%`,
       toInteger(percentileDisc(u.between, 0.75)) AS `75%`,
       toInteger(percentileDisc(u.between, 0.90)) AS `90%`,
       toInteger(percentileDisc(u.between, 0.95)) AS `95%`,
       toInteger(percentileDisc(u.between, 0.99)) AS `99%`,
       toInteger(percentileDisc(u.between, 0.999)) AS `99.9%`,
       toInteger(percentileDisc(u.between, 0.9999)) AS `99.99%`,
       toInteger(percentileDisc(u.between, 0.99999)) AS `99.999%`,
       toInteger(percentileDisc(u.between, 1)) AS p100
count ave 50% 75% 90% 95% 99% 99.9% 99.99% 99.999% 100%
506028 320538.6014 0 10005 318944 1001655 4436409 34854988 214080923 621434012 1998032952

一半的用户得分是 0,这意味着他们根本没有很好的连接。前 1%(99% 列)的用户至少处于我们 50 万用户集合中 400 万条最短路径上。综合来看,我们知道大多数用户连接较差,但有少数用户对信息有很大的控制权,这是小世界网络的典型行为。

6.3.4 找出超级连接者

我们可以通过以下查询找出超级连接者。

MATCH(u:User)-[:WROTE]->()-[:REVIEWS]->(:Business {name:"Bellagio Hotel"})
WHERE exists(u.between)
RETURN u.name AS user,
       toInteger(u.between) AS betweenness,
       u.hotelPageRank AS pageRank,
       size((u)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CATEGORY]->
                             (:Category {name: "Hotels"}))
       AS hotelReviews
ORDER BY u.between DESC
LIMIT 10
user betweenness pageRank hotelReviews
Misti 841707563 12.239516000000004 19
Christine 236269693 10.740770499999998 16
Erica 235806844 10.993773 6
Mike 215534452 NULL 2
J 192155233 11.431505999999997 103
Michael 161335816 5.105143 31
Jeremy 160312436 9.576763499999998 6
Michael 139960910 11.460049 13
Chris 136697785 5.838922499999999 5
Connie 133372418 9.118103499999998 7

我们看到这里有一些在 PageRank 查询中出现过的人,Mike 是一个有趣的例外。他因为评论的酒店数量不足(阈值是三家)而被排除在 PageRank 计算之外,但在拉斯维加斯的 Yelp 用户世界中,他似乎人脉很广。

6.3.5 找出连接者喜欢的餐厅

为了吸引更多不同类型的客户,我们查看这些“连接者”的其他偏好,以确定应该推广什么。许多这些用户也评论过餐厅,我们可以运行以下查询找出他们最喜欢的餐厅。

# Find the top 50 users who have reviewed the Bellagio
MATCH (u:User)-[:WROTE]->()-[:REVIEWS]->(:Business {name:"Bellagio Hotel"})
WHERE u.between > 4436409
WITH u ORDER BY u.between DESC LIMIT 50
# Find the restaurants those users have reviewed in Las Vegas
MATCH (u)-[:WROTE]->(review)-[:REVIEWS]-(business)
WHERE (business)-[:IN_CATEGORY]->(:Category {name: "Restaurants"})
AND   (business)-[:IN_CITY]->(:City {name: "Las Vegas"})
# Only include restaurants that have more than 3 reviews by these users
WITH business, avg(review.stars) AS averageReview, count(*) AS numberOfReviews

通过以上步骤,我们可以完成对 Yelp 数据的分析,为旅行规划应用和旅游业务咨询提供有价值的信息。整个分析流程可以用以下 mermaid 流程图表示:

graph LR
    classDef process fill:#E5F6FF,stroke:#73A6FF,stroke-width:2px;
    A(数据导入):::process --> B(数据概览):::process
    B --> C(旅行规划应用):::process
    C --> C1(查找评论最多的酒店):::process
    C --> C2(寻找有影响力的评论者):::process
    B --> D(旅游业务咨询):::process
    D --> D1(查看特定酒店评论评级):::process
    D --> D2(查找评分低的有影响力客人):::process
    D --> D3(交叉推广分析):::process
    D3 --> D31(标记特定城市用户):::process
    D31 --> D32(运行介数中心性算法):::process
    D32 --> D33(查看得分分布):::process
    D33 --> D34(找出超级连接者):::process
    D34 --> D35(找出连接者喜欢的餐厅):::process

综上所述,通过对 Yelp 数据的深入分析,我们可以利用图算法为旅行规划和旅游业务咨询提供有力的支持,帮助企业更好地了解客户需求,做出更明智的决策。

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符  | 博主筛选后可见
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值