left join/inner join以后数据条数比原数据表多的原因及解决办法

最新推荐文章于 2025-01-17 10:52:05 发布

原创最新推荐文章于 2025-01-17 10:52:05 发布 · 4.1k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#mysql

mysql 专栏收录该内容

21 篇文章

订阅专栏

本文探讨了使用left join/inner join后数据条数增多的现象，分析了其背后的原因，并提出了解决这种问题的优化策略。通过案例展示了在不聚合情况下，查询效率和数据量的关系，以及如何提升查询速度。

页面一：

select vendor, application,applicationversion,filename ,fileversion, lastuse from `ASIA`.applications
where INSTR(vendor, "") and  INSTR(application , "") and  INSTR(filename, "") and lastuse< 1597741978 and lastuse> 1595149978
GROUP BY vendor ,application ,applicationversion ,filename
ORDER BY vendor desc LIMIT 10 OFFSET 0;                           # 没有join表

页面二：

select vendor ,application ,applicationversion ,fileversion,filename ,lastuse ,host.deviceid , hostname from `ASIA`.applications as app
left join  `ASIA`.hostinfo as host on app.deviceid=host.deviceid
where INSTR(hostname, "") and INSTR(vendor, "") and  INSTR(application , "") and  INSTR(applicationversion, "") and  INSTR(filename, "")
GROUP BY vendor ,application ,applicationversion ,filename
ORDER BY vendor desc LIMIT 10 OFFSET 0;              # 有join表

问题：第一个页面比第二个页面快，页面二查询太慢了

分析：

页面1聚合：耗时0.19s

select vendor, application,applicationversion,filename ,fileversion, lastuse from `ASIA`.applications
                            where INSTR(vendor, "") and  INSTR(application , "") and  INSTR(filename, "") and lastuse< 1597741978 and lastuse> 1595149978
                            GROUP BY vendor ,application ,applicationversion ,filename
                            ORDER BY vendor desc LIMIT 10 OFFSET 0;

#条数

select count(*) from `ASIA`.applications
where INSTR(vendor, "") and  INSTR(application , "") and  INSTR(filename, "") and lastuse< 1597741978 and lastuse> 1595149978;    
+----------+
| count(*) |
+----------+
|    28726 |
+----------+
1 row in set (0.05 sec)

页面1不聚合：耗时0.07s

select vendor, application,applicationversion,filename ,fileversion, lastuse from `ASIA`.applications
                            where INSTR(vendor, "") and  INSTR(application , "") and  INSTR(filename, "") and lastuse< 1597741978 and lastuse> 1595149978
                            ORDER BY vendor desc LIMIT 10 OFFSET 0;

页面2聚合：耗时8.69s ，聚合后耗时突然猛增， 因为left join后聚合的数据条数为1897744，远远高于应用库applications的条数49174

select vendor ,application ,applicationversion ,fileversion,filename ,lastuse ,host.deviceid , hostname from `ASIA`.applications as app
            left join  `ASIA`.hostinfo as host on app.deviceid=host.deviceid
            where INSTR(hostname, "") and INSTR(vendor, "") and  INSTR(application , "") and  INSTR(applicationversion, "") and  INSTR(filename, "")
            GROUP BY vendor ,application ,applicationversion ,filename
            ORDER BY vendor desc LIMIT 10 OFFSET 0;

#条数

select count(*) from `ASIA`.applications as app
            join  `ASIA`.hostinfo as host on app.deviceid=host.deviceid
            where INSTR(hostname, "") and INSTR(vendor, "") and  INSTR(application , "") and  INSTR(applicationversion, "") and  INSTR(filename, "");
+----------+
| count(*) |
+----------+
|  1897744 |
+----------+
1 row in set (2.10 sec)

页面2不聚合：耗时0.15s

select vendor ,application ,applicationversion ,fileversion,filename ,lastuse ,host.deviceid , hostname from `ASIA`.applications as app
            left join  `ASIA`.hostinfo as host on app.deviceid=host.deviceid
            where INSTR(hostname, "") and INSTR(vendor, "") and  INSTR(application , "") and  INSTR(applicationversion, "") and  INSTR(filename, "")
            ORDER BY vendor desc LIMIT 10 OFFSET 0;

#应用库中条数49174

mysql> select count(*) from `ASIA`.applications;
+----------+
| count(*) |
+----------+
|    49174 |
+----------+
1 row in set (0.03 sec)

页面2优化：

SELECT vendor ,application ,applicationversion ,fileversion,filename ,lastuse ,host.deviceid , hostname FROM

( select deviceid, vendor ,application ,applicationversion ,fileversion,filename ,lastuse from `ASIA`.applications where INSTR(vendor, "") and INSTR(application , "") and INSTR(applicationversion, "") and INSTR(filename, "") GROUP BY vendor ,application ,applicationversion ,filename

) as app

LEFT JOIN ( select deviceid,hostname from `ASIA`.hostinfo where INSTR(hostname, "") group by deviceid) as host

ON app.deviceid=host.deviceid

ORDER BY vendor desc LIMIT 10 OFFSET 0;

left join后数据条数比原数据表多的原因及解决方法

举例：

A 表（1500行）是用户id和对应年龄，B 表（5000行）是用户id 和对应的爱好（如果有多个爱好就会出现多行数据），现在想通过用户id来链接这两张表，使用 left join（A，B）或者 inner join，结果连完发现生成了一张 3000行的表C（ 但理论上行数应该小于等于表 A）

问题原因：一对多

在 R 中发生这种问题，很大原因是两张表拿来连接的字段（比如用户id），在第二张表里面出现了多次，或者在第一张表里也有重复值

在 SQL 中发生这种问题，还有一种可能性是因为表 B 行数比表 A 多，所以会导致表 C 行数高于表 A

解决办法：

1、当两个表数据都很大时，我们用left join进行查询速度就会明显感觉到有点慢了，这是最好的解决办法就是 尽量给每个表都加上where条件去过滤一下数据

2、可以明确的是多出来的这些行是重复值，不是什么错误值，所以不需要太慌， 无论是在R 还是 SQL中，发生这种问题都可以直接去重解决

在R中的代码，group_by 配合 summarize 去重

在SQL中的代码，group_by 去重

SELECT * FROM ( SELECT * FROM A GROUP BY id ) AS A1

LEFT JOIN ( SELECT * FROM B GROUP BY id ) AS B1

ON A1.id = B1.id;