Nested loops, Hash join and Sort Merge joins – difference?

本文详细介绍了SQL中的三种主要连接方式:嵌套循环连接、哈希连接及排序合并连接。每种连接方式都包括其工作原理、适用场景以及优缺点。这对于理解数据库查询优化器如何选择最有效的连接算法至关重要。

from:http://oracle-online-help.blogspot.com/2007/03/nested-loops-hash-join-and-sort-merge.html

 

Nested loop (loop over loop)

In this algorithm, an outer loop is formed which consists of few entries and then for each entry, and inner loop is processed.

Ex:

Select tab1.*, tab2.* from tabl, tab2 where tabl.col1=tab2.col2;

It is processed like:

For i in (select * from tab1) loop
For j in (select * from tab2 where col2=i.col1) loop
Display results;
End loop;
End loop;

The Steps involved in doing nested loop are:

a) <!--[endif]-->Identify outer (driving) table

<!--[if !supportLists]--> b) <!--[endif]-->Assign inner (driven) table to outer table.

<!--[if !supportLists]--> c) <!--[endif]-->For every row of outer table, access the rows of inner table.

In execution plan it is seen like this:

 

NESTED LOOPS
outer_loop
inner_loop

When optimizer uses nested loops?

Optimizer uses nested loop when we are joining tables containing small number of rows with an efficient driving condition. It is important to have an index on column of inner join table as this table is probed every time for a new value from outer table.

Optimizer may not use nested loop in case:

<!--[endif]-->

  1. No of rows of both the table is quite high
  2. Inner query always results in same set of records
  3. The access path of inner table is independent of data coming from outer table.

<!--[endif]-->

Note: You will see more use of nested loop when using FIRST_ROWS optimizer mode as it works on model of showing instantaneous results to user as they are fetched. There is no need for selecting caching any data before it is returned to user. In case of hash join it is needed and is explained below.

 

Hash join

Hash joins are used when the joining large tables. The optimizer uses smaller of the 2 tables to build a hash table in memory and the scans the large tables and compares the hash value (of rows from large table) with this hash table to find the joined rows.

The algorithm of hash join is divided in two parts

<!--[if !supportLists]--><!--[endif]-->

  1. Build a in-memory hash table on smaller of the two tables.
  2. Probe this hash table with hash value for each row second table

<!--[endif]-->

 

In simpler terms it works like

Build phase

For each row RW1 in small (left/build) table loop
Calculate hash value on RW1 join key
Insert RW1 in appropriate hash bucket.
End loop;

 

Probe Phase

For each row RW2 in big (right/probe) table loop
Calculate the hash value on RW2 join key
For each row RW1 in hash table loop
If RW1 joins with RW2
Return RW1, RW2
End loop;
End loop;

 

When optimizer uses hash join?

 

Optimizer uses has join while joining big tables or big fraction of small tables.

Unlike nested loop, the output of hash join result is not instantaneous as hash joining is blocked on building up hash table.

 

Note: You may see more hash joins used with ALL_ROWS optimizer mode, because it works on model of showing results after all the rows of at least one of the tables are hashed in hash table.

Sort merge join

Sort merge join is used to join two independent data sources. They perform better than nested loop when the volume of data is big in tables but not as good as hash joins in general.

They perform better than hash join when the join condition columns are already sorted or there is no sorting required.

The full operation is done in two parts:

  • <!--[if !supportLists]-->Sort join operation

get first row RW1 from input1
get first row RW2 from input2.


  • Merge join operation

while not at the end of either input loop
if RW1 joins with RW2
get next row R2 from input 2
return (RW1, RW2)
else if RW1 < style=""> get next row RW1 from input 1
else
get next row RW2 from input 2
end loop

Note: If the data is already sorted, first step is avoided.

Important point to understand is, unlike nested loop where driven (inner) table is read as many number of times as the input from outer table, in sort merge join each of the tables involved are accessed at most once. So they prove to be better than nested loop when the data set is large.

When optimizer uses Sort merge join?

<!--[if !supportLists]-->a) <!--[endif]-->When the join condition is an inequality condition (like <, <=, >=). This is because hash join cannot be used for inequality conditions and if the data set is large, nested loop is definitely not an option.

<!--[if !supportLists]-->b) <!--[endif]-->If sorting is anyways required due to some other attribute (other than join) like “order by”, optimizer prefers sort merge join over hash join as it is cheaper.

 

Note: Sort merge join can be seen with both ALL_ROWS and FIRST_ROWS optimizer hint because it works on a model of first sorting both the data sources and then start returning the results. So if the data set is large and you have FIRST_ROWS as optimizer goal, optimizer may prefer sort merge join over nested loop because of large data. And if you have ALL_ROWS as optimizer goal and if any inequality condition is used the SQL, optimizer may use sort-merge join over hash join

<think> 我们正在讨论SQL Server中优化器选择的联接方式(Hash JoinNested Loops等)。需要详细解释如何观察以及这些联接方式的适用场景,最后提出相关问题。 注意:必须用中文回答,并且在最后输出“§§相关问题§§”以及3-5个相关问题。 回答结构: 1. 引言:简要介绍联接方式的重要性。 2. 观察方法:如何获取执行计划并识别联接方式。 3. 各种联接方式的特点和适用场景。 4. 影响优化器选择联接方式的要素。 5. 实战案例。 6. 注意事项。 7. 相关问题。 注意:由于之前的问题都是关于SQL Server性能优化的,这里继续在同一背景下讨论。 </think> ### SQL Server 联接方式优化分析(Hash Join vs Nested Loops) 在 SQL Server 查询优化中,观察优化器选择的联接方式(Hash JoinNested LoopsMerge Join)是性能调优的关键环节。以下是完整的分析方法: --- ### **核心联接方式识别方法** #### 1. 通过执行计划图形界面查看 - **操作符图标识别**: - 🔄 **Nested Loops**:嵌套循环图标(小圆环嵌套) - 💧 **Hash Join**:水滴形图标(哈希桶) - ↔️ **Merge Join**:双箭头合并图标 - **属性窗口关键指标**: ```output |--Hash Match(Inner Join, HASH:([t1].ID)=([t2].ID)) |--Nested Loops(Inner Join, OUTER REFERENCES:([t1].DeptID)) ``` #### 2. T-SQL 获取执行计划细节 ```sql SET SHOWPLAN_TEXT ON; -- 获取文本执行计划 GO SELECT * FROM Orders o INNER JOIN Customers c ON o.CustomerID = c.CustomerID; GO SET SHOWPLAN_TEXT OFF; ``` **输出关键行**: ```output |--Hash Match(Inner Join, HASH:([c].[CustomerID])=([o].[CustomerID])) ``` --- ### **三种联接方式特性对比** | 特性 | Nested Loops | Hash Join | Merge Join | |------|-------------|-----------|-----------| | **适用场景** | 小表驱动大表<br>索引支持好 | 中大型表等值连接<br>无索引支持 | 预排序数据集<br>大规模等值连接 | | **内存消耗** | 低 | 高(需建哈希表) | 中等 | | **执行效率** | O(n*m) | O(n+m) | O(n+m) | | **最佳数据量** | 外层表<1000行 | 内层表>10000行 | 两表均>100000行 | | **索引依赖** | 必需驱动表索引 | 不依赖索引 | 需要已排序输入 | --- ### **优化器选择逻辑深度解析** #### 影响决策的关键因素 1. **统计信息准确性** - 过时统计信息可能导致误选 Hash Join(需 `UPDATE STATISTICS`) 2. **内存授予预估** - 不足的内存会使 Hash Join 转为磁盘操作(`tempdb` 瓶颈) 3. **索引覆盖情况** ```sql -- 有索引:倾向 Nested Loops CREATE INDEX IX_CustomerID ON Customers(CustomerID) INCLUDE (Name) ``` 4. **查询提示强制** ```sql SELECT * FROM Table1 INNER LOOP JOIN Table2 -- 强制使用Nested Loops ``` #### 性能风险预警信号 - ⚠️ **大数据量 Nested Loops**:超长执行时间 - ️ **内存溢出警告**:`WARNING: Hash spilling to tempdb` - ⚠️ **排序溢出**:`Sort operator used disk space` --- ### **实战优化案例** #### 场景:订单-客户表联接效率低下 ```sql -- 原始查询(执行计划显示Hash Join) SELECT o.OrderID, c.Name FROM Orders o JOIN Customers c ON o.CustomerID = c.CustomerID WHERE o.OrderDate > '2023-01-01' -- 优化方案1:创建覆盖索引 CREATE INDEX IX_OrderDate ON Orders(CustomerID, OrderDate) -- 优化方案2:强制更优联接方式(需测试验证) SELECT /*+ LOOP JOIN */ o.OrderID, c.Name FROM Orders o JOIN Customers c ON o.CustomerID = c.CustomerID OPTION (RECOMPILE) ``` --- ### **高级调试技巧** 1. **实时开销分析** ```sql SET STATISTICS TIME, IO ON -- 查看实际I/O和CPU消耗 ``` 2. **内存授予监控** ```sql SELECT * FROM sys.dm_exec_query_memory_grants WHERE session_id = @@SPID ``` 3. **执行计划对比** 使用 SSMS "Compare Showplan" 工具对比不同联接方式的代价 --- ### **注意事项** 1. **参数嗅探陷阱** 不同参数可能导致联接方式突变(使用 `OPTION (RECOMPILE)`) 2. **临时表优化** 对中间结果集创建索引可改变联接策略: ```sql SELECT * INTO #Temp FROM ... CREATE CLUSTERED INDEX IX_Temp ON #Temp(ID) ``` 3. **批处理模式影响** 列存储索引会优先选择 Batch Mode Hash Join
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值