MySQL优化器特性(一)IN和Exists(semijoin)子查询优化策略
这篇文章中的SQL和执行计划在mysql 8.0.31环境下进行测试。
测试的表结构和数据:
表结构
mysql> show create table tp\G *************************** 1. row *************************** Table: tp Create Table: CREATE TABLE `tp` ( `a` int DEFAULT NULL, `b` int DEFAULT NULL, `c` int DEFAULT NULL, KEY `idx_a` (`a`) ) ENGINE=InnoDB 1 row in set (0.00 sec) mysql> show create table ty\G *************************** 1. row *************************** Table: ty Create Table: CREATE TABLE `ty` ( `a` int DEFAULT NULL, `b` int DEFAULT NULL, `c` int DEFAULT NULL, `d` int DEFAULT NULL, UNIQUE KEY `uk_cb` (`c`,`b`), KEY `idx_abc` (`a`,`b`,`c`) ) ENGINE=InnoDB 1 row in set (0.00 sec)
测试数据
mysql> select * from tp; +------+------+------+ | a | b | c | +------+------+------+ | 1 | 1 | 1 | | 2 | 2 | 2 | | 3 | 3 | 3 | | 4 | 4 | 4 | | 5 | 5 | 5 | | NULL | 0 | 0 | +------+------+------+ 6 rows in set (0.00 sec) mysql> select * from ty; +------+------+------+------+ | a | b | c | d | +------+------+------+------+ | 1 | 1 | 1 | 1 | | 2 | 2 | 2 | 2 | | 3 | 3 | 3 | 3 | | 2 | 4 | 4 | 2 | +------+------+------+------+
子查询
以下是一个简单的子查询:
mysql> select * from tp where a in (select d from ty where a = 1); +------+------+------+ | a | b | c | +------+------+------+ | 1 | 1 | 1 | +------+------+------+ 1 row in set (0.00 sec)
如果没有开启semijoin优化,执行计划是这样的:
set optimizer_switch='semijoin=off'; mysql> explain select * from tp where a in (select d from ty where a = 1); +----+--------------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+--------------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------------+ | 1 | PRIMARY | tp | NULL | ALL | NULL | NULL | NULL | NULL | 6 | 100.00 | Using where | | 2 | DEPENDENT SUBQUERY | ty | NULL | ref | idx_abc | idx_abc | 5 | const | 1 | 25.00 | Using where | +----+--------------------+-------+------------+------+---------------+---------+---------+-------+------+----------+-------------+ mysql> explain format=tree select * from tp where a in (select d from ty where a = 1); +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Filter: <in_optimizer>(tp.a,<exists>(select #2)) (cost=0.85 rows=6) -> Table scan on tp (cost=0.85 rows=6) -> Select #2 (subquery in condition; dependent) -> Limit: 1 row(s) (cost=0.28 rows=0.2) -> Filter: (<cache>(tp.a) = ty.d) (cost=0.28 rows=0.2) -> Index lookup on ty using idx_abc (a=1) (cost=0.28 rows=1)
上述执行计划,以tp为驱动表,对tp表的每一条记录,执行子查询(select from ty where a=2 and d = tp.a limit 1)。如果tp表的记录数很大,则SQL执行代价会比较高。
改写SQL
类似上面的SQL,我们可以将子查询改写为普通的关联查询:
mysql> select tp.* from ty join tp on ty.d = tp.a where ty.a=1; +------+------+------+ | a | b | c | +------+------+------+ | 1 | 1 | 1 | +------+------+------+ 1 row in set (0.00 sec)
我们来看一下改写后的执行计划:
mysql> explain select tp.* from ty join tp on ty.d = tp.a where ty.a=1; +----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------+ | 1 | SIMPLE | ty | NULL | ref | idx_abc | idx_abc | 5 | const | 1 | 100.00 | Using where | | 1 | SIMPLE | tp | NULL | ref | idx_a | idx_a | 5 | test.ty.d | 1 | 100.00 | NULL | +----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------+ mysql> explain format=tree select tp.* from ty join tp on ty.d = tp.a where ty.a=1; +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Nested loop inner join (cost=0.70 rows=1) -> Filter: (ty.d is not null) (cost=0.35 rows=1) -> Index lookup on ty using idx_abc (a=1) (cost=0.35 rows=1) -> Index lookup on tp using idx_a (a=ty.d) (cost=0.35 rows=1)
查询改写之后,可以使用ty表作为驱动表。在关联tp表时,使用了idx_a,避免了tp表的全表扫描。
但是,上面的改写还存在一个问题,当子查询的结果中存在重复记录时,改写后的关联查询结果也会返回重复的数据:
mysql> select * from tp where a in (select d from ty where a = 2); +------+------+------+ | a | b | c | +------+------+------+ | 2 | 2 | 2 | +------+------+------+ 1 row in set (0.00 sec)
改写后的SQL,返回的数据有重复:
mysql> select tp.* from ty join tp on ty.d = tp.a where ty.a=2; +------+------+------+ | a | b | c | +------+------+------+ | 2 | 2 | 2 | | 2 | 2 | 2 | +------+------+------+ 2 rows in set (0.00 sec)
这种情况下,可以使用distinct对结果集进行去重:
mysql> select distinct tp.* from ty join tp on ty.d = tp.a where ty.a=2; +------+------+------+ | a | b | c | +------+------+------+ | 2 | 2 | 2 | +------+------+------+ 1 row in set (0.00 sec)
当然去重会带来额外的成本。我们来看一下执行计划:
mysql> explain select distinct tp.* from ty join tp on ty.d = tp.a where ty.a=2; +----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+ | 1 | SIMPLE | ty | NULL | ref | idx_abc | idx_abc | 5 | const | 2 | 100.00 | Using where; Using temporary | | 1 | SIMPLE | tp | NULL | ref | idx_a | idx_a | 5 | test.ty.d | 1 | 100.00 | NULL | +----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+ 2 rows in set, 1 warning (0.00 sec) mysql> explain format=tree select distinct tp.* from ty join tp on ty.d = tp.a where ty.a=2; +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Table scan on <temporary> (cost=2.86..4.12 rows=2) -> Temporary table with deduplication (cost=1.60..1.60 rows=2) -> Nested loop inner join (cost=1.40 rows=2) -> Filter: (ty.d is not null) (cost=0.70 rows=2) -> Index lookup on ty using idx_abc (a=2) (cost=0.70 rows=2) -> Index lookup on tp using idx_a (a=ty.d) (cost=0.30 rows=1)
优化器的子查询优化策略
对于上述的子查询,MySQL 8.0存在多种策略,优化器会根据具体情况,自动选择合适的优化方法,不需要手动改写SQL。这些优化策略分别是pullout, duplicate weedout, first match, loose scan, and materialization。
Pullout:将子查询提到外层,改写成表关联。和我们手动改写SQL类似。
Duplicate weedout:如果子查询中的数据可能存在重复,优化器会自动加上去重。
First Match:只需要匹配到第一条记录。
Loose Scan:子查询利用索引的有序性进去去重。
Materialization:将子查询的结果存储在临时表,临时表和父表进行关联。
下面通过具体的例子来分析上述几种优化策略。
Pullout
将子查询转换成普通表关联,给优化器更多的选择。
由于子查询有唯一索引(uk_bc),保证了无重复数据,转换成关联查询后,不需要再额外进行去重。
mysql> explain select * from tp where a in (select b from ty where c = 1); +----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+--------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+--------------------------+ | 1 | SIMPLE | ty | NULL | ref | uk_cb | uk_cb | 5 | const | 1 | 100.00 | Using where; Using index | | 1 | SIMPLE | tp | NULL | ref | idx_a | idx_a | 5 | test.ty.b | 1 | 100.00 | NULL | +----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+--------------------------+ 2 rows in set, 1 warning (0.00 sec) mysql> explain format=tree select * from tp where a in (select b from ty where c = 1); +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Nested loop inner join (cost=0.70 rows=1) -> Filter: (ty.b is not null) (cost=0.35 rows=1) -> Covering index lookup on ty using uk_cb (c=1) (cost=0.35 rows=1) -> Index lookup on tp using idx_a (a=ty.b) (cost=0.35 rows=1)
Duplicate Weedout
如果子查询中select的字段可能存在重复值,则查询转换后,可能会出现重复数据,需要对结果数据进行去重处理。
mysql> explain format=tree select * from tp where a in (select d from ty where a in (1,3)); +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Remove duplicate tp rows using temporary table (weedout) (cost=2.11 rows=2) -> Nested loop inner join (cost=2.11 rows=2) -> Filter: (ty.d is not null) (cost=1.41 rows=2) -> Index range scan on ty using idx_abc over (a = 1) OR (a = 3), with index condition: (ty.a in (1,3)) (cost=1.41 rows=2) -> Index lookup on tp using idx_a (a=ty.d) (cost=0.30 rows=1) mysql> explain select * from tp where a in (select d from ty where a in (1,3)); +----+-------------+-------+------------+-------+---------------+---------+---------+-----------+------+----------+-----------------------------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+-------+---------------+---------+---------+-----------+------+----------+-----------------------------------------------------+ | 1 | SIMPLE | ty | NULL | range | idx_abc | idx_abc | 5 | NULL | 2 | 100.00 | Using index condition; Using where; Start temporary | | 1 | SIMPLE | tp | NULL | ref | idx_a | idx_a | 5 | test.ty.d | 1 | 100.00 | End temporary | +----+-------------+-------+------------+-------+---------------+---------+---------+-----------+------+----------+-----------------------------------------------------+
执行计划中Extra列显示的“Start temporary“,”End temporary“就是使用临时表进行数据去重。
First match
mysql> explain select * from tp where c in (select c from ty); +----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+-----------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+-----------------------------+ | 1 | SIMPLE | tp | NULL | ALL | NULL | NULL | NULL | NULL | 6 | 100.00 | Using where | | 1 | SIMPLE | ty | NULL | ref | uk_cb | uk_cb | 5 | test.tp.c | 1 | 100.00 | Using index; FirstMatch(tp) | mysql> explain format=tree select * from tp where c in (select c from ty); +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Nested loop semijoin (cost=2.95 rows=6) -> Filter: (tp.c is not null) (cost=0.85 rows=6) -> Table scan on tp (cost=0.85 rows=6) -> Covering index lookup on ty using uk_cb (c=tp.c) (cost=0.27 rows=1)
关联ty表时,使用First Match,也就是关联到一条记录就返回。
LooseScan
子查询中,可以利用索引的有序性对select列表中的字段进行去重,无需借助临时表。(where条件中的等值查询,加上select字段,是表中某个索引的前缀)。
mysql> explain select * from tp where a in (select d from ty where a in (2)); +----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+ | 1 | SIMPLE | ty | NULL | ref | idx_abc | idx_abc | 5 | const | 2 | 100.00 | Using where; Start temporary | | 1 | SIMPLE | tp | NULL | ref | idx_a | idx_a | 5 | test.ty.d | 1 | 100.00 | End temporary | +----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+------------------------------+ 2 rows in set, 1 warning (0.00 sec) mysql> explain select * from tp where a in ( select b from ty where a = 1 ); +----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------------------------------+ | 1 | SIMPLE | ty | NULL | ref | idx_abc | idx_abc | 5 | const | 1 | 100.00 | Using where; Using index; LooseScan | | 1 | SIMPLE | tp | NULL | ref | idx_a | idx_a | 5 | test.ty.b | 1 | 100.00 | NULL | +----+-------------+-------+------------+------+---------------+---------+---------+-----------+------+----------+-------------------------------------+ 2 rows in set, 1 warning (0.00 sec) mysql> explain format=tree select * from tp where a in ( select b from ty where a = 1 ); +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Nested loop inner join (cost=0.70 rows=1) -> Remove duplicates from input sorted on idx_abc (cost=0.35 rows=1) -> Filter: (ty.b is not null) (cost=0.35 rows=1) -> Covering index lookup on ty using idx_abc (a=1) (cost=0.35 rows=1) -> Index lookup on tp using idx_a (a=ty.b) (cost=0.35 rows=1)
下列SQL可以使用loose scan:
select * from tp where a in ( select a from ty ); select * from tp where a in ( select b from ty where a = 1 ) select * from tp where (b,c) in ( select b,c from ty where a = 1 ) select * from tp where (b,c) in ( select a,b from ty where a in(1,2) )
而下列SQL都不可以使用loose scan
Materialize with deduplication
mysql> explain select * from tp where a in (select d from ty where a in (2)); +----+--------------+-------------+------------+------+---------------+---------+---------+---------------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+--------------+-------------+------------+------+---------------+---------+---------+---------------+------+----------+-------------+ | 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using where | | 1 | SIMPLE | tp | NULL | ref | idx_a | idx_a | 5 | <subquery2>.d | 1 | 100.00 | NULL | | 2 | MATERIALIZED | ty | NULL | ref | idx_abc | idx_abc | 5 | const | 2 | 100.00 | NULL | +----+--------------+-------------+------------+------+---------------+---------+---------+---------------+------+----------+-------------+ mysql> explain format=tree select * from tp where a in (select d from ty where a in (2)); +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Nested loop inner join (cost=1.10 rows=2) -> Filter: (`<subquery2>`.d is not null) (cost=0.65..0.40 rows=2) -> Table scan on <subquery2> (cost=2.16..3.42 rows=2) -> Materialize with deduplication (cost=0.90..0.90 rows=2) -> Filter: (ty.d is not null) (cost=0.70 rows=2) -> Index lookup on ty using idx_abc (a=2) (cost=0.70 rows=2) -> Index lookup on tp using idx_a (a=`<subquery2>`.d) (cost=0.60 rows=1)
1、先将子查询物化(Materialize),生产一个临时表。生成临时表的时候同时对数据进行去重。
2、关联tp表。
子查询优化相关的优化器参数
optimizer_switch可以控制上述几种优化策略是否启用。这些优化策略默认开启。
optimizer_switch | 描述 |
semijoin | semijoin优化策略总开关。 |
loosescan | 是否开启loose scan策略 |
firstmatch | 是否开启first match优化策略 |
duplicateweedout | 是否开启duplicate weedout优化策略 |
materialization | 是否开启materialization优化策略 |
除了使用optimizer_switch参数,也可以通过查询hint(SEMIJOIN, NO_SEMIJOIN)来启用或禁用semijoin优化策略。
Hint的具体使用方法这里不详细展开。可以通过官方文档查看具体使用方法: https://dev.mysql.com/doc/refman/8.0/en/optimizer-hints.html#optimizer-hints-subquery