MySQL优化器特性（一）IN和Exists(semijoin)子查询优化策略

俊达2年前 (2023-04-24)技术文章2017

这篇文章中的SQL和执行计划在mysql 8.0.31环境下进行测试。

测试的表结构和数据：

表结构

mysql> show create table tp\G
*************************** 1. row ***************************
       Table: tp
Create Table: CREATE TABLE `tp` (
  `a` int DEFAULT NULL,
  `b` int DEFAULT NULL,
  `c` int DEFAULT NULL,
  KEY `idx_a` (`a`)
) ENGINE=InnoDB
1 row in set (0.00 sec)

mysql> show create table ty\G
*************************** 1. row ***************************
       Table: ty
Create Table: CREATE TABLE `ty` (
  `a` int DEFAULT NULL,
  `b` int DEFAULT NULL,
  `c` int DEFAULT NULL,
  `d` int DEFAULT NULL,
  UNIQUE KEY `uk_cb` (`c`,`b`),
  KEY `idx_abc` (`a`,`b`,`c`)
) ENGINE=InnoDB
1 row in set (0.00 sec)

测试数据

mysql> select * from tp;
+------+------+------+
| a    | b    | c    |
+------+------+------+
|    1 |    1 |    1 |
|    2 |    2 |    2 |
|    3 |    3 |    3 |
|    4 |    4 |    4 |
|    5 |    5 |    5 |
| NULL |    0 |    0 |
+------+------+------+
6 rows in set (0.00 sec)

mysql> select * from ty;
+------+------+------+------+
| a    | b    | c    | d    |
+------+------+------+------+
|    1 |    1 |    1 |    1 |
|    2 |    2 |    2 |    2 |
|    3 |    3 |    3 |    3 |
|    2 |    4 |    4 |    2 |
+------+------+------+------+

子查询

以下是一个简单的子查询：

mysql>  select * from tp where a in (select d from ty where a = 1);
+------+------+------+
| a    | b    | c    |
+------+------+------+
|    1 |    1 |    1 |
+------+------+------+
1 row in set (0.00 sec)

如果没有开启semijoin优化，执行计划是这样的：

set optimizer_switch='semijoin=off';

mysql> explain select * from tp where a in (select d from ty where a = 1);
+----+--------------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------------+
| id | select_type        | table | partitions | type | possible_keys | key     | key_len | ref   | rows | filtered | Extra       |
+----+--------------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------------+
|  1 | PRIMARY            | tp    | NULL       | ALL  | NULL          | NULL    | NULL    | NULL  |    6 |   100.00 | Using where |
|  2 | DEPENDENT SUBQUERY | ty    | NULL       | ref  | idx_abc       | idx_abc | 5       | const |    1 |    25.00 | Using where |
+----+--------------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------------+


mysql> explain format=tree select * from tp where a in (select d from ty where a = 1);
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                     |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: <in_optimizer>(tp.a,<exists>(select #2))  (cost=0.85 rows=6)
    -> Table scan on tp  (cost=0.85 rows=6)
    -> Select #2 (subquery in condition; dependent)
        -> Limit: 1 row(s)  (cost=0.28 rows=0.2)
            -> Filter: (<cache>(tp.a) = ty.d)  (cost=0.28 rows=0.2)
                -> Index lookup on ty using idx_abc (a=1)  (cost=0.28 rows=1)

上述执行计划，以tp为驱动表，对tp表的每一条记录，执行子查询（select from ty where a=2 and d = tp.a limit 1)。如果tp表的记录数很大，则SQL执行代价会比较高。

改写SQL

类似上面的SQL，我们可以将子查询改写为普通的关联查询：

mysql> select tp.* from ty join tp on ty.d = tp.a where ty.a=1;
+------+------+------+
| a    | b    | c    |
+------+------+------+
|    1 |    1 |    1 |
+------+------+------+
1 row in set (0.00 sec)

我们来看一下改写后的执行计划：

mysql> explain select tp.* from ty join tp on ty.d = tp.a where ty.a=1;
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref       | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------+
|  1 | SIMPLE      | ty    | NULL       | ref  | idx_abc       | idx_abc | 5       | const     |    1 |   100.00 | Using where |
|  1 | SIMPLE      | tp    | NULL       | ref  | idx_a         | idx_a   | 5       | test.ty.d |    1 |   100.00 | NULL        |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------+


mysql> explain format=tree select tp.* from ty join tp on ty.d = tp.a where ty.a=1;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                       |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Nested loop inner join  (cost=0.70 rows=1)
    -> Filter: (ty.d is not null)  (cost=0.35 rows=1)
        -> Index lookup on ty using idx_abc (a=1)  (cost=0.35 rows=1)
    -> Index lookup on tp using idx_a (a=ty.d)  (cost=0.35 rows=1)

查询改写之后，可以使用ty表作为驱动表。在关联tp表时，使用了idx_a，避免了tp表的全表扫描。

但是，上面的改写还存在一个问题，当子查询的结果中存在重复记录时，改写后的关联查询结果也会返回重复的数据：

mysql> select * from tp where a in (select d from ty where a = 2);
+------+------+------+
| a    | b    | c    |
+------+------+------+
|    2 |    2 |    2 |
+------+------+------+
1 row in set (0.00 sec)

改写后的SQL，返回的数据有重复：

mysql> select tp.* from ty join tp on ty.d = tp.a where ty.a=2;
+------+------+------+
| a    | b    | c    |
+------+------+------+
|    2 |    2 |    2 |
|    2 |    2 |    2 |
+------+------+------+
2 rows in set (0.00 sec)

这种情况下，可以使用distinct对结果集进行去重：

mysql> select distinct tp.* from ty join tp on ty.d = tp.a where ty.a=2;
+------+------+------+
| a    | b    | c    |
+------+------+------+
|    2 |    2 |    2 |
+------+------+------+
1 row in set (0.00 sec)

当然去重会带来额外的成本。我们来看一下执行计划：

mysql> explain select distinct tp.* from ty join tp on ty.d = tp.a where ty.a=2;
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref       | rows | filtered | Extra                        |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+
|  1 | SIMPLE      | ty    | NULL       | ref  | idx_abc       | idx_abc | 5       | const     |    2 |   100.00 | Using where; Using temporary |
|  1 | SIMPLE      | tp    | NULL       | ref  | idx_a         | idx_a   | 5       | test.ty.d |    1 |   100.00 | NULL                         |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+
2 rows in set, 1 warning (0.00 sec)

mysql> explain format=tree select distinct tp.* from ty join tp on ty.d = tp.a where ty.a=2;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Table scan on <temporary>  (cost=2.86..4.12 rows=2)
    -> Temporary table with deduplication  (cost=1.60..1.60 rows=2)
        -> Nested loop inner join  (cost=1.40 rows=2)
            -> Filter: (ty.d is not null)  (cost=0.70 rows=2)
                -> Index lookup on ty using idx_abc (a=2)  (cost=0.70 rows=2)
            -> Index lookup on tp using idx_a (a=ty.d)  (cost=0.30 rows=1)

优化器的子查询优化策略

对于上述的子查询，MySQL 8.0存在多种策略，优化器会根据具体情况，自动选择合适的优化方法，不需要手动改写SQL。这些优化策略分别是pullout, duplicate weedout, first match, loose scan, and materialization。

Pullout：将子查询提到外层，改写成表关联。和我们手动改写SQL类似。
Duplicate weedout：如果子查询中的数据可能存在重复，优化器会自动加上去重。
First Match：只需要匹配到第一条记录。
Loose Scan：子查询利用索引的有序性进去去重。
Materialization：将子查询的结果存储在临时表，临时表和父表进行关联。

下面通过具体的例子来分析上述几种优化策略。

Pullout

将子查询转换成普通表关联，给优化器更多的选择。

由于子查询有唯一索引(uk_bc)，保证了无重复数据，转换成关联查询后，不需要再额外进行去重。

mysql> explain select * from tp where a in (select b from ty where c = 1);
+----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key   | key_len | ref       | rows | filtered | Extra                    |
+----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+--------------------------+
|  1 | SIMPLE      | ty    | NULL       | ref  | uk_cb         | uk_cb | 5       | const     |    1 |   100.00 | Using where; Using index |
|  1 | SIMPLE      | tp    | NULL       | ref  | idx_a         | idx_a | 5       | test.ty.b |    1 |   100.00 | NULL                     |
+----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+--------------------------+
2 rows in set, 1 warning (0.00 sec)

mysql> explain format=tree select * from tp where a in (select b from ty where c = 1);
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Nested loop inner join  (cost=0.70 rows=1)
    -> Filter: (ty.b is not null)  (cost=0.35 rows=1)
        -> Covering index lookup on ty using uk_cb (c=1)  (cost=0.35 rows=1)
    -> Index lookup on tp using idx_a (a=ty.b)  (cost=0.35 rows=1)

Duplicate Weedout

如果子查询中select的字段可能存在重复值，则查询转换后，可能会出现重复数据，需要对结果数据进行去重处理。

mysql> explain format=tree select * from tp where a in (select d from ty where a in (1,3));
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                                                    |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Remove duplicate tp rows using temporary table (weedout)  (cost=2.11 rows=2)
    -> Nested loop inner join  (cost=2.11 rows=2)
        -> Filter: (ty.d is not null)  (cost=1.41 rows=2)
            -> Index range scan on ty using idx_abc over (a = 1) OR (a = 3), with index condition: (ty.a in (1,3))  (cost=1.41 rows=2)
        -> Index lookup on tp using idx_a (a=ty.d)  (cost=0.30 rows=1)



mysql> explain select * from tp where a in (select d from ty where a in (1,3));
+----+-------------+-------+------------+-------+---------------+---------+---------+-----------+------+----------+-----------------------------------------------------+
| id | select_type | table | partitions | type  | possible_keys | key     | key_len | ref       | rows | filtered | Extra                                               |
+----+-------------+-------+------------+-------+---------------+---------+---------+-----------+------+----------+-----------------------------------------------------+
|  1 | SIMPLE      | ty    | NULL       | range | idx_abc       | idx_abc | 5       | NULL      |    2 |   100.00 | Using index condition; Using where; Start temporary |
|  1 | SIMPLE      | tp    | NULL       | ref   | idx_a         | idx_a   | 5       | test.ty.d |    1 |   100.00 | End temporary                                       |
+----+-------------+-------+------------+-------+---------------+---------+---------+-----------+------+----------+-----------------------------------------------------+

执行计划中Extra列显示的“Start temporary“，”End temporary“就是使用临时表进行数据去重。

First match

mysql> explain select * from tp where c in (select c from ty);
+----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+-----------------------------+
| id | select_type | table | partitions | type | possible_keys | key   | key_len | ref       | rows | filtered | Extra                       |
+----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+-----------------------------+
|  1 | SIMPLE      | tp    | NULL       | ALL  | NULL          | NULL  | NULL    | NULL      |    6 |   100.00 | Using where                 |
|  1 | SIMPLE      | ty    | NULL       | ref  | uk_cb         | uk_cb | 5       | test.tp.c |    1 |   100.00 | Using index; FirstMatch(tp) |



mysql> explain format=tree select * from tp where c in (select c from ty);
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Nested loop semijoin  (cost=2.95 rows=6)
    -> Filter: (tp.c is not null)  (cost=0.85 rows=6)
        -> Table scan on tp  (cost=0.85 rows=6)
    -> Covering index lookup on ty using uk_cb (c=tp.c)  (cost=0.27 rows=1)

关联ty表时，使用First Match，也就是关联到一条记录就返回。

LooseScan

子查询中，可以利用索引的有序性对select列表中的字段进行去重，无需借助临时表。(where条件中的等值查询，加上select字段，是表中某个索引的前缀）。

mysql> explain  select * from tp where a in (select d from ty where a in (2));
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref       | rows | filtered | Extra                        |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+
|  1 | SIMPLE      | ty    | NULL       | ref  | idx_abc       | idx_abc | 5       | const     |    2 |   100.00 | Using where; Start temporary |
|  1 | SIMPLE      | tp    | NULL       | ref  | idx_a         | idx_a   | 5       | test.ty.d |    1 |   100.00 | End temporary                |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+
2 rows in set, 1 warning (0.00 sec)

mysql> explain select * from tp where a in ( select b from ty where a = 1 );
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref       | rows | filtered | Extra                               |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------------------------------+
|  1 | SIMPLE      | ty    | NULL       | ref  | idx_abc       | idx_abc | 5       | const     |    1 |   100.00 | Using where; Using index; LooseScan |
|  1 | SIMPLE      | tp    | NULL       | ref  | idx_a         | idx_a   | 5       | test.ty.b |    1 |   100.00 | NULL                                |
+----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------------------------------+
2 rows in set, 1 warning (0.00 sec)

mysql> explain format=tree select * from tp where a in ( select b from ty where a = 1 );
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Nested loop inner join  (cost=0.70 rows=1)
    -> Remove duplicates from input sorted on idx_abc  (cost=0.35 rows=1)
        -> Filter: (ty.b is not null)  (cost=0.35 rows=1)
            -> Covering index lookup on ty using idx_abc (a=1)  (cost=0.35 rows=1)
    -> Index lookup on tp using idx_a (a=ty.b)  (cost=0.35 rows=1)

下列SQL可以使用loose scan：

select * from tp where a in ( select a from ty );

select * from tp where a in ( select b from ty where a = 1 )

select * from tp where (b,c) in ( select b,c from ty where a = 1 )

select * from tp where (b,c) in ( select a,b from ty where a in(1,2) )

而下列SQL都不可以使用loose scan

Materialize with deduplication

mysql> explain  select * from tp where a in (select d from ty where a in (2));
+----+--------------+-------------+------------+------+---------------+---------+---------+---------------+------+----------+-------------+
| id | select_type  | table       | partitions | type | possible_keys | key     | key_len | ref           | rows | filtered | Extra       |
+----+--------------+-------------+------------+------+---------------+---------+---------+---------------+------+----------+-------------+
|  1 | SIMPLE       | <subquery2> | NULL       | ALL  | NULL          | NULL    | NULL    | NULL          | NULL |   100.00 | Using where |
|  1 | SIMPLE       | tp          | NULL       | ref  | idx_a         | idx_a   | 5       | <subquery2>.d |    1 |   100.00 | NULL        |
|  2 | MATERIALIZED | ty          | NULL       | ref  | idx_abc       | idx_abc | 5       | const         |    2 |   100.00 | NULL        |
+----+--------------+-------------+------------+------+---------------+---------+---------+---------------+------+----------+-------------+

mysql> explain format=tree select * from tp where a in (select d from ty where a in (2));
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Nested loop inner join  (cost=1.10 rows=2)
    -> Filter: (`<subquery2>`.d is not null)  (cost=0.65..0.40 rows=2)
        -> Table scan on <subquery2>  (cost=2.16..3.42 rows=2)
            -> Materialize with deduplication  (cost=0.90..0.90 rows=2)
                -> Filter: (ty.d is not null)  (cost=0.70 rows=2)
                    -> Index lookup on ty using idx_abc (a=2)  (cost=0.70 rows=2)
    -> Index lookup on tp using idx_a (a=`<subquery2>`.d)  (cost=0.60 rows=1)

1、先将子查询物化(Materialize)，生产一个临时表。生成临时表的时候同时对数据进行去重。

2、关联tp表。

子查询优化相关的优化器参数

optimizer_switch可以控制上述几种优化策略是否启用。这些优化策略默认开启。

optimizer_switch	描述
semijoin	semijoin优化策略总开关。
loosescan	是否开启loose scan策略
firstmatch	是否开启first match优化策略
duplicateweedout	是否开启duplicate weedout优化策略
materialization	是否开启materialization优化策略