MySQL优化器特性(一)IN和Exists(semijoin)子查询优化策略

俊达2年前技术文章1239

这篇文章中的SQL和执行计划在mysql 8.0.31环境下进行测试。


测试的表结构和数据:


表结构

mysql> show create table tp\G
*************************** 1. row ***************************
       Table: tp
Create Table: CREATE TABLE `tp` (
  `a` int DEFAULT NULL,
  `b` int DEFAULT NULL,
  `c` int DEFAULT NULL,
  KEY `idx_a` (`a`)
) ENGINE=InnoDB
1 row in set (0.00 sec)

mysql> show create table ty\G
*************************** 1. row ***************************
       Table: ty
Create Table: CREATE TABLE `ty` (
  `a` int DEFAULT NULL,
  `b` int DEFAULT NULL,
  `c` int DEFAULT NULL,
  `d` int DEFAULT NULL,
  UNIQUE KEY `uk_cb` (`c`,`b`),
  KEY `idx_abc` (`a`,`b`,`c`)
) ENGINE=InnoDB
1 row in set (0.00 sec)


测试数据

mysql> select * from tp;
+------+------+------+
| a    | b    | c    |
+------+------+------+
|    1 |    1 |    1 |
|    2 |    2 |    2 |
|    3 |    3 |    3 |
|    4 |    4 |    4 |
|    5 |    5 |    5 |
| NULL |    0 |    0 |
+------+------+------+
6 rows in set (0.00 sec)

mysql> select * from ty;
+------+------+------+------+
| a    | b    | c    | d    |
+------+------+------+------+
|    1 |    1 |    1 |    1 |
|    2 |    2 |    2 |    2 |
|    3 |    3 |    3 |    3 |
|    2 |    4 |    4 |    2 |
+------+------+------+------+



子查询

以下是一个简单的子查询:

mysql>  select * from tp where a in (select d from ty where a = 1);
+------+------+------+
| a    | b    | c    |
+------+------+------+
|    1 |    1 |    1 |
+------+------+------+
1 row in set (0.00 sec)


如果没有开启semijoin优化,执行计划是这样的:

set optimizer_switch='semijoin=off';

mysql> explain select * from tp where a in (select d from ty where a = 1);
+----+--------------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------------+
| id | select_type        | table | partitions | type | possible_keys | key     | key_len | ref   | rows | filtered | Extra       |
+----+--------------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------------+
|  1 | PRIMARY            | tp    | NULL       | ALL  | NULL          | NULL    | NULL    | NULL  |    6 |   100.00 | Using where |
|  2 | DEPENDENT SUBQUERY | ty    | NULL       | ref  | idx_abc       | idx_abc | 5       | const |    1 |    25.00 | Using where |
+----+--------------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------------+


mysql> explain format=tree select * from tp where a in (select d from ty where a = 1);
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                     |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: <in_optimizer>(tp.a,<exists>(select #2))  (cost=0.85 rows=6)
    -> Table scan on tp  (cost=0.85 rows=6)
    -> Select #2 (subquery in condition; dependent)
        -> Limit: 1 row(s)  (cost=0.28 rows=0.2)
            -> Filter: (<cache>(tp.a) = ty.d)  (cost=0.28 rows=0.2)
                -> Index lookup on ty using idx_abc (a=1)  (cost=0.28 rows=1)


上述执行计划,以tp为驱动表,对tp表的每一条记录,执行子查询(select from ty where a=2 and d = tp.a limit 1)。如果tp表的记录数很大,则SQL执行代价会比较高。


改写SQL

类似上面的SQL,我们可以将子查询改写为普通的关联查询:

mysql> select tp.* from ty join tp on ty.d = tp.a where ty.a=1;
+------+------+------+
| a    | b    | c    |
+------+------+------+
|    1 |    1 |    1 |
+------+------+------+
1 row in set (0.00 sec)


我们来看一下改写后的执行计划:

mysql> explain select tp.* from ty join tp on ty.d = tp.a where ty.a=1;
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref       | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------+
|  1 | SIMPLE      | ty    | NULL       | ref  | idx_abc       | idx_abc | 5       | const     |    1 |   100.00 | Using where |
|  1 | SIMPLE      | tp    | NULL       | ref  | idx_a         | idx_a   | 5       | test.ty.d |    1 |   100.00 | NULL        |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------+


mysql> explain format=tree select tp.* from ty join tp on ty.d = tp.a where ty.a=1;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                       |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Nested loop inner join  (cost=0.70 rows=1)
    -> Filter: (ty.d is not null)  (cost=0.35 rows=1)
        -> Index lookup on ty using idx_abc (a=1)  (cost=0.35 rows=1)
    -> Index lookup on tp using idx_a (a=ty.d)  (cost=0.35 rows=1)



查询改写之后,可以使用ty表作为驱动表。在关联tp表时,使用了idx_a,避免了tp表的全表扫描。

但是,上面的改写还存在一个问题,当子查询的结果中存在重复记录时,改写后的关联查询结果也会返回重复的数据:

mysql> select * from tp where a in (select d from ty where a = 2);
+------+------+------+
| a    | b    | c    |
+------+------+------+
|    2 |    2 |    2 |
+------+------+------+
1 row in set (0.00 sec)


改写后的SQL,返回的数据有重复:

mysql> select tp.* from ty join tp on ty.d = tp.a where ty.a=2;
+------+------+------+
| a    | b    | c    |
+------+------+------+
|    2 |    2 |    2 |
|    2 |    2 |    2 |
+------+------+------+
2 rows in set (0.00 sec)

这种情况下,可以使用distinct对结果集进行去重:

mysql> select distinct tp.* from ty join tp on ty.d = tp.a where ty.a=2;
+------+------+------+
| a    | b    | c    |
+------+------+------+
|    2 |    2 |    2 |
+------+------+------+
1 row in set (0.00 sec)


当然去重会带来额外的成本。我们来看一下执行计划:


mysql> explain select distinct tp.* from ty join tp on ty.d = tp.a where ty.a=2;
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref       | rows | filtered | Extra                        |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+
|  1 | SIMPLE      | ty    | NULL       | ref  | idx_abc       | idx_abc | 5       | const     |    2 |   100.00 | Using where; Using temporary |
|  1 | SIMPLE      | tp    | NULL       | ref  | idx_a         | idx_a   | 5       | test.ty.d |    1 |   100.00 | NULL                         |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+
2 rows in set, 1 warning (0.00 sec)

mysql> explain format=tree select distinct tp.* from ty join tp on ty.d = tp.a where ty.a=2;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Table scan on <temporary>  (cost=2.86..4.12 rows=2)
    -> Temporary table with deduplication  (cost=1.60..1.60 rows=2)
        -> Nested loop inner join  (cost=1.40 rows=2)
            -> Filter: (ty.d is not null)  (cost=0.70 rows=2)
                -> Index lookup on ty using idx_abc (a=2)  (cost=0.70 rows=2)
            -> Index lookup on tp using idx_a (a=ty.d)  (cost=0.30 rows=1)


优化器的子查询优化策略

对于上述的子查询,MySQL 8.0存在多种策略,优化器会根据具体情况,自动选择合适的优化方法,不需要手动改写SQL。这些优化策略分别是pullout, duplicate weedout, first match, loose scan, and materialization。

  • Pullout:将子查询提到外层,改写成表关联。和我们手动改写SQL类似。

  • Duplicate weedout:如果子查询中的数据可能存在重复,优化器会自动加上去重。

  • First Match:只需要匹配到第一条记录。

  • Loose Scan:子查询利用索引的有序性进去去重。

  • Materialization:将子查询的结果存储在临时表,临时表和父表进行关联。

下面通过具体的例子来分析上述几种优化策略。


Pullout

将子查询转换成普通表关联,给优化器更多的选择。

由于子查询有唯一索引(uk_bc),保证了无重复数据,转换成关联查询后,不需要再额外进行去重。

mysql> explain select * from tp where a in (select b from ty where c = 1);
+----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key   | key_len | ref       | rows | filtered | Extra                    |
+----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+--------------------------+
|  1 | SIMPLE      | ty    | NULL       | ref  | uk_cb         | uk_cb | 5       | const     |    1 |   100.00 | Using where; Using index |
|  1 | SIMPLE      | tp    | NULL       | ref  | idx_a         | idx_a | 5       | test.ty.b |    1 |   100.00 | NULL                     |
+----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+--------------------------+
2 rows in set, 1 warning (0.00 sec)

mysql> explain format=tree select * from tp where a in (select b from ty where c = 1);
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Nested loop inner join  (cost=0.70 rows=1)
    -> Filter: (ty.b is not null)  (cost=0.35 rows=1)
        -> Covering index lookup on ty using uk_cb (c=1)  (cost=0.35 rows=1)
    -> Index lookup on tp using idx_a (a=ty.b)  (cost=0.35 rows=1)


Duplicate Weedout

如果子查询中select的字段可能存在重复值,则查询转换后,可能会出现重复数据,需要对结果数据进行去重处理。

mysql> explain format=tree select * from tp where a in (select d from ty where a in (1,3));
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                                                    |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Remove duplicate tp rows using temporary table (weedout)  (cost=2.11 rows=2)
    -> Nested loop inner join  (cost=2.11 rows=2)
        -> Filter: (ty.d is not null)  (cost=1.41 rows=2)
            -> Index range scan on ty using idx_abc over (a = 1) OR (a = 3), with index condition: (ty.a in (1,3))  (cost=1.41 rows=2)
        -> Index lookup on tp using idx_a (a=ty.d)  (cost=0.30 rows=1)



mysql> explain select * from tp where a in (select d from ty where a in (1,3));
+----+-------------+-------+------------+-------+---------------+---------+---------+-----------+------+----------+-----------------------------------------------------+
| id | select_type | table | partitions | type  | possible_keys | key     | key_len | ref       | rows | filtered | Extra                                               |
+----+-------------+-------+------------+-------+---------------+---------+---------+-----------+------+----------+-----------------------------------------------------+
|  1 | SIMPLE      | ty    | NULL       | range | idx_abc       | idx_abc | 5       | NULL      |    2 |   100.00 | Using index condition; Using where; Start temporary |
|  1 | SIMPLE      | tp    | NULL       | ref   | idx_a         | idx_a   | 5       | test.ty.d |    1 |   100.00 | End temporary                                       |
+----+-------------+-------+------------+-------+---------------+---------+---------+-----------+------+----------+-----------------------------------------------------+


执行计划中Extra列显示的“Start temporary“,”End temporary“就是使用临时表进行数据去重。


First match

mysql> explain select * from tp where c in (select c from ty);
+----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+-----------------------------+
| id | select_type | table | partitions | type | possible_keys | key   | key_len | ref       | rows | filtered | Extra                       |
+----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+-----------------------------+
|  1 | SIMPLE      | tp    | NULL       | ALL  | NULL          | NULL  | NULL    | NULL      |    6 |   100.00 | Using where                 |
|  1 | SIMPLE      | ty    | NULL       | ref  | uk_cb         | uk_cb | 5       | test.tp.c |    1 |   100.00 | Using index; FirstMatch(tp) |



mysql> explain format=tree select * from tp where c in (select c from ty);
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Nested loop semijoin  (cost=2.95 rows=6)
    -> Filter: (tp.c is not null)  (cost=0.85 rows=6)
        -> Table scan on tp  (cost=0.85 rows=6)
    -> Covering index lookup on ty using uk_cb (c=tp.c)  (cost=0.27 rows=1)


关联ty表时,使用First Match,也就是关联到一条记录就返回。


LooseScan

子查询中,可以利用索引的有序性对select列表中的字段进行去重,无需借助临时表。(where条件中的等值查询,加上select字段,是表中某个索引的前缀)。


mysql> explain  select * from tp where a in (select d from ty where a in (2));
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref       | rows | filtered | Extra                        |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+
|  1 | SIMPLE      | ty    | NULL       | ref  | idx_abc       | idx_abc | 5       | const     |    2 |   100.00 | Using where; Start temporary |
|  1 | SIMPLE      | tp    | NULL       | ref  | idx_a         | idx_a   | 5       | test.ty.d |    1 |   100.00 | End temporary                |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+
2 rows in set, 1 warning (0.00 sec)

mysql> explain select * from tp where a in ( select b from ty where a = 1 );
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref       | rows | filtered | Extra                               |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------------------------------+
|  1 | SIMPLE      | ty    | NULL       | ref  | idx_abc       | idx_abc | 5       | const     |    1 |   100.00 | Using where; Using index; LooseScan |
|  1 | SIMPLE      | tp    | NULL       | ref  | idx_a         | idx_a   | 5       | test.ty.b |    1 |   100.00 | NULL                                |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------------------------------+
2 rows in set, 1 warning (0.00 sec)

mysql> explain format=tree select * from tp where a in ( select b from ty where a = 1 );
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Nested loop inner join  (cost=0.70 rows=1)
    -> Remove duplicates from input sorted on idx_abc  (cost=0.35 rows=1)
        -> Filter: (ty.b is not null)  (cost=0.35 rows=1)
            -> Covering index lookup on ty using idx_abc (a=1)  (cost=0.35 rows=1)
    -> Index lookup on tp using idx_a (a=ty.b)  (cost=0.35 rows=1)


下列SQL可以使用loose scan:

select * from tp where a in ( select a from ty );

select * from tp where a in ( select b from ty where a = 1 )

select * from tp where (b,c) in ( select b,c from ty where a = 1 )

select * from tp where (b,c) in ( select a,b from ty where a in(1,2) )


而下列SQL都不可以使用loose scan

Materialize with deduplication

mysql> explain  select * from tp where a in (select d from ty where a in (2));
+----+--------------+-------------+------------+------+---------------+---------+---------+---------------+------+----------+-------------+
| id | select_type  | table       | partitions | type | possible_keys | key     | key_len | ref           | rows | filtered | Extra       |
+----+--------------+-------------+------------+------+---------------+---------+---------+---------------+------+----------+-------------+
|  1 | SIMPLE       | <subquery2> | NULL       | ALL  | NULL          | NULL    | NULL    | NULL          | NULL |   100.00 | Using where |
|  1 | SIMPLE       | tp          | NULL       | ref  | idx_a         | idx_a   | 5       | <subquery2>.d |    1 |   100.00 | NULL        |
|  2 | MATERIALIZED | ty          | NULL       | ref  | idx_abc       | idx_abc | 5       | const         |    2 |   100.00 | NULL        |
+----+--------------+-------------+------------+------+---------------+---------+---------+---------------+------+----------+-------------+

mysql> explain format=tree select * from tp where a in (select d from ty where a in (2));
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Nested loop inner join  (cost=1.10 rows=2)
    -> Filter: (`<subquery2>`.d is not null)  (cost=0.65..0.40 rows=2)
        -> Table scan on <subquery2>  (cost=2.16..3.42 rows=2)
            -> Materialize with deduplication  (cost=0.90..0.90 rows=2)
                -> Filter: (ty.d is not null)  (cost=0.70 rows=2)
                    -> Index lookup on ty using idx_abc (a=2)  (cost=0.70 rows=2)
    -> Index lookup on tp using idx_a (a=`<subquery2>`.d)  (cost=0.60 rows=1)


1、先将子查询物化(Materialize),生产一个临时表。生成临时表的时候同时对数据进行去重。

2、关联tp表。


子查询优化相关的优化器参数

optimizer_switch可以控制上述几种优化策略是否启用。这些优化策略默认开启。

optimizer_switch

描述

semijoin

semijoin优化策略总开关。

loosescan

是否开启loose scan策略

firstmatch

是否开启first match优化策略

duplicateweedout

是否开启duplicate weedout优化策略

materialization

是否开启materialization优化策略


除了使用optimizer_switch参数,也可以通过查询hint(SEMIJOIN, NO_SEMIJOIN)来启用或禁用semijoin优化策略。


Hint的具体使用方法这里不详细展开。可以通过官方文档查看具体使用方法: https://dev.mysql.com/doc/refman/8.0/en/optimizer-hints.html#optimizer-hints-subquery




相关文章

MySQL优化器特性(四)表关联之BNL(Block Nested Loop)和Hash Join

MySQL优化器特性(四)表关联之BNL(Block Nested Loop)和Hash Join

什么是BNLMySQL表关联时,如果关联条件上没有合适的索引,则join时,对于驱动表的每一条记录,都需要全表扫描被驱动表。如果驱动表有多条数据,则需要多次全表扫描被驱动表,查询性能很差。对于这种情况...

 MySQL优化器特性(九)行数评估

MySQL优化器特性(九)行数评估

查询的行数在成本计算中起了很重要的作用:1、row_evaluate_cost和行数直接相关2、需要访问多少索引页面,和行数直接相关。根据页面大小和平均索引条目长度计算每个索引页面的记录数,根据记录数...

MySQL优化器特性(五)单表访问路径

数据库的访问路径(access path)是指从表中获取数据的方式,一般可以通过扫描表或通过索引获取数据。想熟练掌握SQL优化技能,首先需要熟悉单表访问路径。本文先简单介绍MySQL支持的各种单表访问...

MySQL优化器特性(八)索引范围扫描成本计算

MySQL优化器特性(八)索引范围扫描成本计算

range执行计划中的range表示索引范围扫描。索引范围扫描的执行过程大致如下:1、根据where条件中索引字段的条件,定位到索引结构中的第一条满足条件的记录。2、根据索引中记录的rowid,到表中...

MySQL优化器特性(三)表关联之BKA(Batched Key Access)优化

MySQL优化器特性(三)表关联之BKA(Batched Key Access)优化

单表range查询时,可以使用MRR优化,先对rowid进行排序,然后再回表查询数据。在表关联的时候,也可以使用类似的优化方法,先根据关联条件取出被关联表的rowid,将rowid缓存在join bu...

MySQL优化器特性(六)表扫描成本计算

全表扫描成本使用optimizer_trace,或者使用explain format=tree, 或者explain format=json,可以查看查询的costmysql> exp...

发表评论    

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。