图算法在实际中的应用
1. 图算法概述
在处理图数据时,随着对不同算法在特定数据集上表现的了解加深,我们采用的图分析方法也会不断发展。这里将通过几个示例,展示如何利用 Yelp 和美国运输部的数据集进行大规模图数据分析。
1.1 采用的算法
- PageRank :用于找出有影响力的 Yelp 评论者,并关联他们对特定酒店的评分。
- Betweenness Centrality :揭示与多个群体有连接的评论者,提取他们的偏好。
- Label Propagation :结合投影创建相似 Yelp 业务的超级类别。
- Degree Centrality :快速识别美国运输数据集中的机场枢纽。
- Strongly Connected Components :查看美国机场航线的集群。
2. Yelp 数据概述
Yelp 基于评论、偏好和推荐帮助人们找到当地商家。截至 2018 年底,平台上已有超过 1.8 亿条评论。自 2013 年以来,Yelp 举办了 Yelp 数据集挑战赛,鼓励人们探索和研究其开放数据集。
2.1 数据集内容
- 超过 700 万条评论和提示。
- 超过 150 万用户和 28 万张图片。
- 超过 18.8 万家企业,拥有 140 万个属性。
- 10 个大都市地区。
2.2 Yelp 社交网络
Yelp 用户不仅可以撰写和阅读商家评论,还能形成社交网络。用户可以向浏览 Yelp.com 时遇到的其他用户发送好友请求,也可以连接他们的通讯录或 Facebook 图谱。
3. 数据导入与图模型
3.1 数据导入
将数据导入 Neo4j 有多种方法,包括导入工具、LOAD CSV 命令和 Neo4j 驱动程序。对于 Yelp 数据集,由于需要一次性导入大量数据,导入工具是最佳选择。
3.2 图模型
Yelp 数据在图模型中的表示如下:
- 包含标记为 User 的节点,它们与其他 User 节点具有 FRIENDS 关系。
- 用户撰写关于 Businesses 的 Reviews 和提示。
- 除业务类别由单独的 Category 节点表示外,所有元数据都存储为节点的属性。
- 对于位置数据,将 City、Area 和 Country 属性提取到子图中。
graph LR
classDef process fill:#E5F6FF,stroke:#73A6FF,stroke-width:2px;
User(User):::process -->|FRIENDS| User(User):::process
User(User):::process -->|WROTE| Review(Review):::process
Review(Review):::process -->|REVIEWS| Business(Business):::process
Business(Business):::process -->|IN_CATEGORY| Category(Category):::process
Business(Business):::process -->|IN_CITY| City(City):::process
City(City):::process -->|IN_AREA| Area(Area):::process
Area(Area):::process -->|IN_COUNTRY| Country(Country):::process
4. Yelp 数据快速概览
4.1 安装 Python 库
pip install neo4j-driver tabulate pandas matplotlib
4.2 导入库
from neo4j.v1 import GraphDatabase
import pandas as pd
from tabulate import tabulate
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
4.3 创建 Neo4j 驱动实例
driver = GraphDatabase.driver("bolt://localhost", auth=("neo4j", "neo"))
4.4 计算节点基数
result = {"label": [], "count": []}
with driver.session() as session:
labels = [row["label"] for row in session.run("CALL db.labels()")]
for label in labels:
query = f"MATCH (:`{label}`) RETURN count(*) as count"
count = session.run(query).single()["count"]
result["label"].append(label)
result["count"].append(count)
df = pd.DataFrame(data=result)
print(tabulate(df.sort_values("count"), headers='keys',
tablefmt='psql', showindex=False))
| label | count |
|---|---|
| Country | 17 |
| Area | 54 |
| City | 1093 |
| Category | 1293 |
| Business | 174567 |
| User | 1326101 |
| Review | 5261669 |
4.5 计算关系基数
result = {"relType": [], "count": []}
with driver.session() as session:
rel_types = [row["relationshipType"] for row in session.run
("CALL db.relationshipTypes()")]
for rel_type in rel_types:
query = f"MATCH ()-[:`{rel_type}`]->() RETURN count(*) as count"
count = session.run(query).single()["count"]
result["relType"].append(rel_type)
result["count"].append(count)
df = pd.DataFrame(data=result)
print(tabulate(df.sort_values("count"), headers='keys',
tablefmt='psql', showindex=False))
| relType | count |
|---|---|
| IN_COUNTRY | 54 |
| IN_AREA | 1154 |
| IN_CITY | 174566 |
| IN_CATEGORY | 667527 |
| WROTE | 5261669 |
| REVIEWS | 5261669 |
| FRIENDS | 10645356 |
4.6 查看酒店业务和评论数量
query = """
MATCH (category:Category {name: "Hotels"})
RETURN size((category)<-[:IN_CATEGORY]-()) AS businesses,
size((:Review)-[:REVIEWS]->(:Business)-[:IN_CATEGORY]->
(category)) AS reviews
"""
| businesses | reviews |
|---|---|
| 2683 | 183759 |
5. 旅行规划应用
5.1 查找评论最多的酒店
# Find the 10 hotels with the most reviews
query = """
MATCH (review:Review)-[:REVIEWS]->(business:Business),
(business)-[:IN_CATEGORY]->(category:Category {name: $category}),
(business)-[:IN_CITY]->(:City {name: $city})
RETURN business.name AS business, collect(review.stars) AS allReviews
ORDER BY size(allReviews) DESC
LIMIT 10
"""
fig = plt.figure()
fig.set_size_inches(10.5, 14.5)
fig.subplots_adjust(hspace=0.4, wspace=0.4)
with driver.session() as session:
params = { "city": "Las Vegas", "category": "Hotels"}
result = session.run(query, params)
for index, row in enumerate(result):
business = row["business"]
stars = pd.Series(row["allReviews"])
total = stars.count()
average_stars = stars.mean().round(2)
# Calculate the star distribution
stars_histogram = stars.value_counts().sort_index()
stars_histogram /= float(stars_histogram.sum())
# Plot a bar chart showing the distribution of star ratings
ax = fig.add_subplot(5, 2, index+1)
stars_histogram.plot(kind="bar", legend=None, color="darkblue",
title=f"{business}\nAve: {average_stars}, Total: {total}")
plt.tight_layout()
plt.show()
5.2 寻找有影响力的酒店评论者
为了决定在应用中展示哪些评论,我们基于评论者在 Yelp 上的影响力对评论进行排序。运行 PageRank 算法,对至少评论过三家酒店的用户的投影图进行分析。
CALL algo.pageRank(
'MATCH (u:User)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CATEGORY]->
(:Category {name: $category})
WITH u, count(*) AS reviews
WHERE reviews >= $cutOff
RETURN id(u) AS id',
'MATCH (u1:User)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CATEGORY]->
(:Category {name: $category})
MATCH (u1)-[:FRIENDS]->(u2)
RETURN id(u1) AS source, id(u2) AS target',
{graph: "cypher", write: true, writeProperty: "hotelPageRank",
params: {category: "Hotels", cutOff: 3}}
)
5.3 查看 PageRank 值分布
MATCH (u:User)
WHERE exists(u.hotelPageRank)
RETURN count(u.hotelPageRank) AS count,
avg(u.hotelPageRank) AS ave,
percentileDisc(u.hotelPageRank, 0.5) AS `50%`,
percentileDisc(u.hotelPageRank, 0.75) AS `75%`,
percentileDisc(u.hotelPageRank, 0.90) AS `90%`,
percentileDisc(u.hotelPageRank, 0.95) AS `95%`,
percentileDisc(u.hotelPageRank, 0.99) AS `99%`,
percentileDisc(u.hotelPageRank, 0.999) AS `99.9%`,
percentileDisc(u.hotelPageRank, 0.9999) AS `99.99%`,
percentileDisc(u.hotelPageRank, 0.99999) AS `99.999%`,
percentileDisc(u.hotelPageRank, 1) AS `100%`
| count | ave | 50% | 75% | 90% | 95% | 99% | 99.9% | 99.99% | 99.999% | 100% |
|---|---|---|---|---|---|---|---|---|---|---|
| 1326101 | 0.1614898 | 0.15 | 0.15 | 0.157497 | 0.181875 | 0.330081 | 1.649511 | 6.825738 | 15.27376 | 22.98046 |
5.4 找出最有影响力的用户
MATCH (u:User)
WHERE u.hotelPageRank > 1.64951
WITH u ORDER BY u.hotelPageRank DESC
LIMIT 10
RETURN u.name AS name,
u.hotelPageRank AS pageRank,
size((u)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CATEGORY]->
(:Category {name: "Hotels"})) AS hotelReviews,
size((u)-[:WROTE]->()) AS totalReviews,
size((u)-[:FRIENDS]-()) AS friends
| name | pageRank | hotelReviews | totalReviews | friends |
|---|---|---|---|---|
| Phil | 17.361242 | 15 | 134 | 8154 |
| Philip | 16.871013 | 21 | 620 | 9634 |
| Carol | 12.416060999999997 | 6 | 119 | 6218 |
| Misti | 12.239516000000004 | 19 | 730 | 6230 |
| Joseph | 12.003887499999998 | 5 | 32 | 6596 |
| Michael | 11.460049 | 13 | 51 | 6572 |
| J | 11.431505999999997 | 103 | 1322 | 6498 |
| Abby | 11.376136999999998 | 9 | 82 | 7922 |
| Erica | 10.993773 | 6 | 15 | 7071 |
| Randy | 10.748785999999999 | 21 | 125 | 7846 |
对于应用程序,我们选择突出显示 Phil、Philip 和 J 的酒店评论,以获得有影响力的评论者和评论数量的最佳组合。
6. 旅游业务咨询
6.1 查看特定酒店的评论评级
作为咨询服务的一部分,酒店订阅服务以在有影响力的访客撰写关于他们住宿的评论时得到提醒,以便采取必要的行动。首先,我们查看贝拉吉奥酒店(Bellagio)的评级,并按最有影响力的评论者进行排序。
query = """
MATCH (b:Business {name: $hotel})
MATCH (b)<-[:REVIEWS]-(review)<-[:WROTE]-(user)
WHERE exists(user.hotelPageRank)
RETURN user.name AS name,
user.hotelPageRank AS pageRank,
review.stars AS stars
"""
with driver.session() as session:
params = { "hotel": "Bellagio Hotel" }
df = pd.DataFrame([dict(record) for record in session.run(query, params)])
df = df.round(2)
df = df[["name", "pageRank", "stars"]]
top_reviews = df.sort_values(by=["pageRank"], ascending=False).head(10)
print(tabulate(top_reviews, headers='keys', tablefmt='psql', showindex=False))
| name | pageRank | stars |
|---|---|---|
| Misti | 12.239516000000004 | 5 |
| Michael | 11.460049 | 4 |
| J | 11.431505999999997 | 5 |
| Erica | 10.993773 | 4 |
| Christine | 10.740770499999998 | 4 |
| Jeremy | 9.576763499999998 | 5 |
| Connie | 9.118103499999998 | 5 |
| Joyce | 7.621449000000001 | 4 |
| Henry | 7.299146 | 5 |
| Flora | 6.7570075 | 4 |
这些结果显示,贝拉吉奥酒店的客户服务团队情况良好,前 10 名有影响力的评论者都给酒店打出了好评。酒店可能希望鼓励这些人再次光顾并分享他们的体验。
6.2 查找评分较低的有影响力的客人
我们可以运行以下代码来查找 PageRank 最高但对贝拉吉奥酒店体验评分低于四星的客人。
query = """
MATCH (b:Business {name: $hotel})
MATCH (b)<-[:REVIEWS]-(review)<-[:WROTE]-(user)
WHERE exists(user.hotelPageRank) AND review.stars < $goodRating
RETURN user.name AS name,
user.hotelPageRank AS pageRank,
review.stars AS stars
"""
with driver.session() as session:
params = { "hotel": "Bellagio Hotel", "goodRating": 4 }
df = pd.DataFrame([dict(record) for record in session.run(query, params)])
df = df.round(2)
df = df[["name", "pageRank", "stars"]]
top_reviews = df.sort_values(by=["pageRank"], ascending=False).head(10)
print(tabulate(top_reviews, headers='keys', tablefmt='psql', showindex=False))
| name | pageRank | stars |
|---|---|---|
| Chris | 5.84 | 3 |
| Lorrie | 4.95 | 2 |
| Dani | 3.47 | 1 |
| Victor | 3.35 | 3 |
| Francine | 2.93 | 3 |
| Rex | 2.79 | 2 |
| Jon | 2.55 | 3 |
| Rachel | 2.47 | 3 |
| Leslie | 2.46 | 2 |
| Benay | 2.46 | 3 |
排名最高但给贝拉吉奥酒店较低评分的用户 Chris 和 Lorrie,属于前 1000 名最有影响力的用户,因此可能需要进行个人外联。此外,由于许多评论者在入住期间撰写评论,关于有影响力的人的实时警报可能会促进更多积极的互动。
6.3 贝拉吉奥酒店的交叉推广
在帮助贝拉吉奥酒店找到有影响力的评论者后,酒店现在要求我们帮助确定其他可以通过人脉广泛的客户进行交叉推广的业务。我们可以使用介数中心性(Betweenness Centrality)算法来找出不仅在整个 Yelp 网络中人脉广泛,而且可能在不同群体之间起到桥梁作用的贝拉吉奥酒店评论者。
6.3.1 标记特定城市的用户
我们只对拉斯维加斯的有影响力的人感兴趣,因此首先标记那些用户。
MATCH (u:User)
WHERE exists((u)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CITY]->
(:City {name: "Las Vegas"}))
SET u:LasVegas
6.3.2 运行介数中心性算法
由于对拉斯维加斯用户运行介数中心性算法需要很长时间,我们使用 RA - Brandes 变体。该算法通过对节点进行采样并仅探索到一定深度的最短路径来计算介数得分。经过实验,我们使用了与默认值不同的一些参数,使用最多 4 跳的最短路径(maxDepth 为 4)并采样 20% 的节点(probability 为 0.2)。
CALL algo.betweenness.sampled('LasVegas', 'FRIENDS',
{write: true, writeProperty: "between", maxDepth: 4, probability: 0.2}
)
6.3.3 查看介数得分分布
在查询中使用这些得分之前,我们先查看得分的分布情况。
MATCH (u:User)
WHERE exists(u.between)
RETURN count(u.between) AS count,
avg(u.between) AS ave,
toInteger(percentileDisc(u.between, 0.5)) AS `50%`,
toInteger(percentileDisc(u.between, 0.75)) AS `75%`,
toInteger(percentileDisc(u.between, 0.90)) AS `90%`,
toInteger(percentileDisc(u.between, 0.95)) AS `95%`,
toInteger(percentileDisc(u.between, 0.99)) AS `99%`,
toInteger(percentileDisc(u.between, 0.999)) AS `99.9%`,
toInteger(percentileDisc(u.between, 0.9999)) AS `99.99%`,
toInteger(percentileDisc(u.between, 0.99999)) AS `99.999%`,
toInteger(percentileDisc(u.between, 1)) AS p100
| count | ave | 50% | 75% | 90% | 95% | 99% | 99.9% | 99.99% | 99.999% | 100% |
|---|---|---|---|---|---|---|---|---|---|---|
| 506028 | 320538.6014 | 0 | 10005 | 318944 | 1001655 | 4436409 | 34854988 | 214080923 | 621434012 | 1998032952 |
一半的用户得分是 0,这意味着他们根本没有很好的连接。前 1%(99% 列)的用户至少处于我们 50 万用户集合中 400 万条最短路径上。综合来看,我们知道大多数用户连接较差,但有少数用户对信息有很大的控制权,这是小世界网络的典型行为。
6.3.4 找出超级连接者
我们可以通过以下查询找出超级连接者。
MATCH(u:User)-[:WROTE]->()-[:REVIEWS]->(:Business {name:"Bellagio Hotel"})
WHERE exists(u.between)
RETURN u.name AS user,
toInteger(u.between) AS betweenness,
u.hotelPageRank AS pageRank,
size((u)-[:WROTE]->()-[:REVIEWS]->()-[:IN_CATEGORY]->
(:Category {name: "Hotels"}))
AS hotelReviews
ORDER BY u.between DESC
LIMIT 10
| user | betweenness | pageRank | hotelReviews |
|---|---|---|---|
| Misti | 841707563 | 12.239516000000004 | 19 |
| Christine | 236269693 | 10.740770499999998 | 16 |
| Erica | 235806844 | 10.993773 | 6 |
| Mike | 215534452 | NULL | 2 |
| J | 192155233 | 11.431505999999997 | 103 |
| Michael | 161335816 | 5.105143 | 31 |
| Jeremy | 160312436 | 9.576763499999998 | 6 |
| Michael | 139960910 | 11.460049 | 13 |
| Chris | 136697785 | 5.838922499999999 | 5 |
| Connie | 133372418 | 9.118103499999998 | 7 |
我们看到这里有一些在 PageRank 查询中出现过的人,Mike 是一个有趣的例外。他因为评论的酒店数量不足(阈值是三家)而被排除在 PageRank 计算之外,但在拉斯维加斯的 Yelp 用户世界中,他似乎人脉很广。
6.3.5 找出连接者喜欢的餐厅
为了吸引更多不同类型的客户,我们查看这些“连接者”的其他偏好,以确定应该推广什么。许多这些用户也评论过餐厅,我们可以运行以下查询找出他们最喜欢的餐厅。
# Find the top 50 users who have reviewed the Bellagio
MATCH (u:User)-[:WROTE]->()-[:REVIEWS]->(:Business {name:"Bellagio Hotel"})
WHERE u.between > 4436409
WITH u ORDER BY u.between DESC LIMIT 50
# Find the restaurants those users have reviewed in Las Vegas
MATCH (u)-[:WROTE]->(review)-[:REVIEWS]-(business)
WHERE (business)-[:IN_CATEGORY]->(:Category {name: "Restaurants"})
AND (business)-[:IN_CITY]->(:City {name: "Las Vegas"})
# Only include restaurants that have more than 3 reviews by these users
WITH business, avg(review.stars) AS averageReview, count(*) AS numberOfReviews
通过以上步骤,我们可以完成对 Yelp 数据的分析,为旅行规划应用和旅游业务咨询提供有价值的信息。整个分析流程可以用以下 mermaid 流程图表示:
graph LR
classDef process fill:#E5F6FF,stroke:#73A6FF,stroke-width:2px;
A(数据导入):::process --> B(数据概览):::process
B --> C(旅行规划应用):::process
C --> C1(查找评论最多的酒店):::process
C --> C2(寻找有影响力的评论者):::process
B --> D(旅游业务咨询):::process
D --> D1(查看特定酒店评论评级):::process
D --> D2(查找评分低的有影响力客人):::process
D --> D3(交叉推广分析):::process
D3 --> D31(标记特定城市用户):::process
D31 --> D32(运行介数中心性算法):::process
D32 --> D33(查看得分分布):::process
D33 --> D34(找出超级连接者):::process
D34 --> D35(找出连接者喜欢的餐厅):::process
综上所述,通过对 Yelp 数据的深入分析,我们可以利用图算法为旅行规划和旅游业务咨询提供有力的支持,帮助企业更好地了解客户需求,做出更明智的决策。
超级会员免费看


被折叠的 条评论
为什么被折叠?



