在文章 Hash Join 中介绍了 Hash Join的原理及实现方法,在Python实现中作者使用了defaultdict,本文是对这篇文章的一些补充,另外,实现的时候没有使用defaultdict:
#1. 表在Python中的存放方式
# 可以使用一个元组(tuple)表示一行,使用列表(list)表示一张表,如下所示:
tableA = []
row1 = (27, "Jonah")
tableA.append(row1)
tableA.append((18, "Alan"))
tableA.append((28, "Glory"))
tableA.append((18, "Popeye"))
tableA.append((28, "Alan"))
tableB = [("Jonah", "Whales"),
("Jonah", "Spiders"),
("Alan", "Ghosts"),
("Alan", "Zombies"),
("Glory", "Buffy")]
# 模拟SQL: select * from tableA where Name="Alan"
for row in tableA :
if row[1] == "Alan":
print(row)
# multimap 表的存放方式: 该表要支持基于哈希的查找,因此直接使用字典(dict)即可,我们的目标是构造一个如下的哈希表:
MB = {
"Jonah" : [("Jonah","Whales"), ("Jonah","Spiders")],
"Alan" : [("Alan","Ghosts"), ("Alan","Zombies")],
"Glory" : [("Glory","Buffy")]
}
# 先构造一个字典,该字典的key包含了tableB的Character列的所有值,value均为空集合
# {'Jonah': [], 'Alan': [], 'Glory': []}
MB = {}
for row in tableB:
MB[row[0]]=[]
print(MB)
# 构建hash表
for row in tableB:
MB[row[0]].append(row)
print(MB)
#Hash Join
print("**********")
tableC=[]
for row_a in tableA:
if row_a[1] not in MB:
continue
else:
for row_mb in MB[row_a[1]]:
row_c = row_a + row_mb
print(row_c)
tableC.append(row_c)
输出中最后部分如下:
(27, 'Jonah', 'Jonah', 'Whales')
(27, 'Jonah', 'Jonah', 'Spiders')
(18, 'Alan', 'Alan', 'Ghosts')
(18, 'Alan', 'Alan', 'Zombies')
(28, 'Glory', 'Glory', 'Buffy')
(28, 'Alan', 'Alan', 'Ghosts')
(28, 'Alan', 'Alan', 'Zombies')