CTEs and Recursive CTEs appeared in MySQL 8.0; first in a Labs release and now in the official release 8.0.1. Several blogs have been published: here, here and here ; my colleague Øystein also wrote about how using a CTE made one DBT3 query run twice faster.
Today I will continue the series, focusing on these points of recursive CTEs:
- obtaining depth-first or breadth-first ordering of the result
- doing simple cycle elimination with UNION DISTINCT, to compute a transitive closure
- achieving more complex cycle elimination.
Depth-first or breadth-first
I already covered this previously (in paragraph retrieving a full tree) but let’s recap the problem and solution. We assume we have a table of people with a parent-child
relationship:
1
2
3
4
5
6
7
8
9
10
11
|
CREATE TABLE tree (person CHAR(20), parent CHAR(20)); INSERT INTO tree VALUES ('Robert I', NULL), ('Thurimbert', 'Robert I'), ('Robert II', 'Thurimbert'), ('Cancor', 'Thurimbert'), ('Landrade', 'Thurimbert'), ('Ingramm', 'Thurimbert'), ('Robert III', 'Robert II'), ('Chaudegrand', 'Landrade'), ('Ermengarde', 'Ingramm'); |
This is a part of the genealogy of some ancient French kings, around the 8th century AD. In a picture:
1
2
3
4
5
6
7
|
Robert I -> Thurimbert -+-> Robert II -> Robert III | +-> Cancor | +-> Landrade -> Chaudegrand | +-> Ingramm -> Ermengarde |
Now we want to know all descendants of Thurimbert:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
WITH RECURSIVE descendants AS ( SELECT person FROM tree WHERE person='Thurimbert' UNION ALL SELECT t.person FROM descendants d, tree t WHERE t.parent=d.person ) SELECT * FROM descendants; +-------------+ | person | +-------------+ | Thurimbert | | Robert II | | Cancor | | Landrade | | Ingramm | | Robert III | | Chaudegrand | | Ermengarde | +-------------+ |
Breadth-first ordering means we want to see all children (direct descendants) first, grouped together, and only then should their own children appear. This is what the query returned, but you should not rely on it (per the usual rule in SQL: “no ORDER BY specified? no predictable order!”). As direct descendants all appear at the first iteration, we just need to count iterations in a new “level” column, and sort by it:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
WITH RECURSIVE descendants AS ( SELECT person, 1 as level FROM tree WHERE person='Thurimbert' UNION ALL SELECT t.person, d.level+1 FROM descendants d, tree t WHERE t.parent=d.person ) SELECT * FROM descendants ORDER BY level; +-------------+-------+ | person | level | +-------------+-------+ | Thurimbert | 1 | | Robert II | 2 | | Cancor | 2 | | Landrade | 2 | | Ingramm | 2 | | Robert III | 3 | | Chaudegrand | 3 | | Ermengarde | 3 | +-------------+-------+ |
Depth-first ordering means we want to see children grouped immediately under their parent. For that we build a “path” column and sort by it:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
WITH RECURSIVE descendants AS ( SELECT person, CAST(person AS CHAR(500)) AS path FROM tree WHERE person='Thurimbert' UNION ALL SELECT t.person, CONCAT(d.path, ',', t.person) FROM descendants d, tree t WHERE t.parent=d.person ) SELECT * FROM descendants ORDER BY path; +-------------+---------------------------------+ | person | path | +-------------+---------------------------------+ | Thurimbert | Thurimbert | | Cancor | Thurimbert,Cancor | | Ingramm | Thurimbert,Ingramm | | Ermengarde | Thurimbert,Ingramm,Ermengarde | | Landrade | Thurimbert,Landrade | | Chaudegrand | Thurimbert,Landrade,Chaudegrand | | Robert II | Thurimbert,Robert II | | Robert III | Thurimbert,Robert II,Robert III | +-------------+---------------------------------+ |
I had already mentioned the necessity of CAST here : the type of the “path” column is determined from the initial SELECT, and it has to be long enough to fit the path of the deepest leaf in the tree: CAST ensures that.
In the SQL standard, there is an optional SEARCH BY clause to request breadth-first or depth-first ordering; as we saw above, it is easy to emulate in MySQL with a “level” or “path” column.
Computing transitive closures with simple cycle avoidance
In year 2300, the Earth is overcrowded and people are encouraged to emigrate to nearby planets by boarding on these space rockets:
1
2
3
4
5
6
|
CREATE TABLE rockets (origin CHAR(20), destination CHAR(20), trip_time INT); INSERT INTO rockets VALUES ('Earth', 'Mars', 2), ('Mars', 'Jupiter', 3), ('Jupiter', 'Saturn', 4); |
Note that the rulers of the Earth have intentionally not set up any way to return from those planets back to the Earth.
Let’s list all destinations which can be reached from the Earth: we first find the planets which can be directly reached, then we find what can be directly reached from these first planets, and so on:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
WITH RECURSIVE all_destinations AS ( SELECT destination AS planet FROM rockets WHERE origin='Earth' UNION ALL SELECT r.destination FROM rockets r, all_destinations d WHERE r.origin=d.planet ) SELECT * FROM all_destinations; +---------+ | planet | +---------+ | Mars | | Jupiter | | Saturn | +---------+ |
Now it is year 2400 and the population of the Earth has decreased so much that the rulers decide to bring some emigrates back, so they set up a new rocket from Saturn to Earth:
1 |
INSERT INTO rockets VALUES ('Saturn', 'Earth', 9); |
Let’s repeat our query to list all destinations which can be reached from the Earth:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
WITH RECURSIVE all_destinations AS ( SELECT destination AS planet FROM rockets WHERE origin='Earth' UNION ALL SELECT r.destination FROM rockets r, all_destinations d WHERE r.origin=d.planet ) SELECT * FROM all_destinations; ^C -- query aborted ERROR 1317 (70100): Query execution was interrupted |
The query seemed to never finish, I had to interrupt it manually (note that starting from MySQL 8.0.3, the query would instead abort automatically, thanks to @@cte_max_recursion_depth). Why is it so? We have a cycle in our data: starting from the Earth, we add Mars, then Jupiter, then Saturn, then the Earth (because of the new rocket), so we’re back to the beginning – the Earth. Then we add Mars, and so on, forever.
Setting max_execution_time will surely ensure that such runaway query is automatically killed, but… how do we modify the query to work properly, i.e. not run away?
A simple solution is to not add a planet to the result if it’s already there. This elimination of duplicates is done by using UNION DISTINCT instead of UNION ALL:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
WITH RECURSIVE all_destinations AS ( SELECT destination AS planet FROM rockets WHERE origin='Earth' UNION DISTINCT SELECT r.destination FROM rockets r, all_destinations d WHERE r.origin=d.planet ) SELECT * FROM all_destinations; +---------+ | planet | +---------+ | Mars | | Jupiter | | Saturn | | Earth | +---------+ |
More complex cycle avoidance
So we have a nicely working query with UNION DISTINCT. Let’s extend it just a bit, to additionally show the time taken by each trip (pretty innocent idea, right?):
1
2
3
4
5
6
7
8
9
10
11
12
13
|
WITH RECURSIVE all_destinations AS ( SELECT destination AS planet, trip_time AS total_time FROM rockets WHERE origin='Earth' UNION DISTINCT SELECT r.destination, d.total_time+r.trip_time FROM rockets r, all_destinations d WHERE r.origin=d.planet ) SELECT * FROM all_destinations; ^C -- query aborted ERROR 1317 (70100): Query execution was interrupted |
What? again it never ends? Indeed, consider what the first generated rows must be:
- Initial SELECT: Mars, 2
- Iteration 1: Jupiter, 5 (2+3)
- Iteration 2: Saturn, 9 (5+4)
- Iteration 3: Earth, 18 (9+9)
- Iteration 4: Mars, 20 (18+2)
The time of the Earth->Mars->Jupiter->Saturn->Earth->Mars trip is 20, while the time of the Earth->Mars trip is 2; the two trips now form two different rows as the their differing time is part of the rows: (Mars, 2) and (Mars, 20). UNION DISTINCT will not help with cycle elimination, anymore, as it eliminates only rows for which all columns match… So (Mars, 20) is put into the result, and it leads to (Jupiter, 23) and so on.
The SQL standard has an optional CYCLE clause for that, which we can easily emulate: the solution is to build a “path” column (as in the depth/breadth-first paragraph above), and break when we see we’re adding something which is already in the path. We needn’t use DISTINCT anymore so we use ALL which avoids the overhead of (pointless) duplicate elimination:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
WITH RECURSIVE all_destinations AS ( SELECT destination AS planet, trip_time AS total_time, CAST(destination AS CHAR(500)) AS path FROM rockets WHERE origin='Earth' UNION ALL SELECT r.destination, d.total_time+r.trip_time, CONCAT(d.path, ',', r.destination) FROM rockets r, all_destinations d WHERE r.origin=d.planet AND FIND_IN_SET(r.destination, d.path)=0 ) SELECT * FROM all_destinations; +---------+------------+---------------------------+ | planet | total_time | path | +---------+------------+---------------------------+ | Mars | 2 | Mars | | Jupiter | 5 | Mars,Jupiter | | Saturn | 9 | Mars,Jupiter,Saturn | | Earth | 18 | Mars,Jupiter,Saturn,Earth | +---------+------------+---------------------------+ |
Note that if the “row keys” (the planet’s name, here) are allowed to contain a comma, it may be safer to encode the keys when adding them to the path, for example with TO_BASE64 or HEX, so that FIND_IN_SET works reliably.
If, in your particular use case, cycles are considered an anomaly which you want to hunt for, a variant of our query will flag them. We just need to allow the cycle to enter our result and advertise itself (with a new “is_cycle” column), but we still must prevent the query from running forever, by not iterating on rows where is_cycle=1:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
WITH RECURSIVE all_destinations AS ( SELECT destination AS planet, trip_time AS total_time, CAST(destination AS CHAR(500)) AS path, 0 AS is_cycle FROM rockets WHERE origin='Earth' UNION ALL SELECT r.destination, d.total_time+r.trip_time, CONCAT(d.path, ',', r.destination), FIND_IN_SET(r.destination, d.path)!=0 FROM rockets r, all_destinations d WHERE r.origin=d.planet AND is_cycle=0 ) SELECT * FROM all_destinations; +---------+------------+--------------------------------+----------+ | planet | total_time | path | is_cycle | +---------+------------+--------------------------------+----------+ | Mars | 2 | Mars | 0 | | Jupiter | 5 | Mars,Jupiter | 0 | | Saturn | 9 | Mars,Jupiter,Saturn | 0 | | Earth | 18 | Mars,Jupiter,Saturn,Earth | 0 | | Mars | 20 | Mars,Jupiter,Saturn,Earth,Mars | 1 | +---------+------------+--------------------------------+----------+ |
A “1” in the “is_cycle” column points us to the cycle.
That’s all for today. I hope the above tricks will help you write powerful queries to navigate your hierarchical or graph data. As always: Thank you for using MySQL!