Yannakakis algorithm

The Yannakakis algorithm is an algorithm in database theory for evaluating acyclic join queries (more generally, (alpha-)acyclic conjunctive queries, see GYO algorithm). It first uses semijoins to prune tuples that cannot participate in any answer and then joins the reduced relations to produce the result. The algorithm is named after Mihalis Yannakakis.^[1]

High-level description

For an acyclic query, a join tree exists and can be computed in linear time; one standard way to obtain it is via the GYO reduction. The nodes of the join tree correspond to the relation occurrences in the query (rather than necessarily to distinct base relations), and for each query variable the nodes containing that variable form a connected subtree.

After rooting the join tree arbitrarily, the algorithm keeps one working relation for each relation occurrence. It then performs two semijoin sweeps. In the bottom-up sweep, each child filters its parent: a semijoin removes parent tuples that cannot join with anything in that child subtree. In the top-down sweep, each parent filters its children, removing additional child tuples that cannot join with the already reduced remainder of the query. After both sweeps, every remaining tuple participates in at least one full join result. A final join/projection pass then produces the query output. The last and final pass is usually combined with the second top-down sweep.^[2]

Complexity

Let $|D|$ be the size of the database (i.e., the total number of tuples across all input relations), $|Q|$ the size of the query, and $|OUT|$ the number of tuples in the query output.

If the query does not project out any variables (referred to as a full conjunctive query or a join query or a conjunctive query with no existential quantifiers), then the complexity of the algorithm is $O(|Q|(|D|+|OUT|)$ . Assuming a fixed query $Q$ (a setting referred to as data complexity), this means that the algorithm's worst-case running time is asymptotically the same as reading the input and writing the output, which is a natural lower bound.

If some variables are projected out in the query, then there is an additional $|D|$ factor, making the complexity $O(|Q||D||OUT|)$ .

Connections to other problems

The algorithm has been influential in database theory and its core ideas are found in algorithms for other tasks such as enumeration^[3]^[4]^[2] and aggregate computation.^[5] An important realization is that the algorithm implicitly operates on the Boolean semiring (the elimination of a tuple corresponds to a False value in the semiring), but its correctness is maintained if we use any other semiring. For example, using the natural numbers semiring, where the operations are addition and multiplication, the same algorithm computes the total number of query answers.

History

Origin of the algorithm

The algorithm is named after Mihalis Yannakakis, author of the 1981 VLDB paper "Algorithms for acyclic database schemes".^[1] A key component of the later algorithm is a semi-join reduction phase that had already been developed in contemporaneous work by Philip A. Bernstein with Dah-Ming W. Chiu and with Nathan Goodman.^[6]^[7] These results were obtained and submitted in 1979 and were already circulating in technical report and conference form, but appeared in journal form only in 1981.^[8]^[9] Yannakakis explicitly cites Bernstein and Goodman’s technical report [BG] for the full-reduction step.^[1] This difference in publication timing contributed to the algorithm becoming associated with Yannakakis’s name. A work by Gottlob et al. (2001) is among the first to use the term "Yannakakis's algorithm", as both consider the more general setting of conjunctive queries with projections, as detailed below.^[10]

The exact connection

In the setting of equijoin queries and single-attribute semi-joins, Bernstein and Chiu (1981) showed that tree queries admit a full reducer, i.e., a sequence of semi-joins that removes all tuples that do not participate in the final join result.^[6] They also gave a linear-time tree-query membership test.^[6] Bernstein and Goodman (1981) extended these results to natural (multiattribute) semi-joins and natural-join queries, proving that for tree queries a two-pass semi-join program (first propagating reductions from the leaves of a rooted tree representation of the query to the root, and then back from the root to the leaves) is a full reducer.^[7] Later work by Gottlob et al. (2001) notes that acyclic queries coincide with tree queries (for the corresponding structural characterization, see GYO algorithm) and that "in a slightly restricted setting, a very similar semi-join based evaluation procedure was independently developed by Bernstein and Chiu [1981]."^[10] Yannakakis explicitly reuses the reduction step ("At first we compute a full reduction D′ of D as in [BG] using semijoins") and then performs the subsequent join/projection phase along a join tree.^[1] Bernstein and Chiu explicitly excluded target lists (their terminology for the set of attributes to be returned by the query), noting that semi-joins do not "project out" attributes,^[6] and Bernstein and Goodman later showed that target lists do not change the full-reducer characterization.^[7] Thus, the Bernstein papers already contain the key element of what became later known as the Yannakakis algorithm: the full reduction (or full semi-join reduction).^[6]^[7] Yannakakis presents the full two-phase procedure for evaluating acyclic joins with projection.^[1]

References

^ ^a ^b ^c ^d ^e Yannakakis, Mihalis (1981-09-09). "Algorithms for acyclic database schemes". Proceedings of the Seventh International Conference on Very Large Data Bases - Volume 7. VLDB '81. Cannes, France: VLDB Endowment. pp. 82–94.
^ ^a ^b Tziavelis, Gatterbauer, Riedewald. Toward Responsive DBMS: Optimal Join Algorithms, Enumeration, Factorization, Ranking, and Dynamic Programming. Tutorial at ICDE 2022. https://doi.org/10.1109/ICDE53745.2022.00299. Part 3: Acyclic queries & Enumeration. Slides, 20-min video, Tutorial page
^ Bagan, Guillaume; Durand, Arnaud; Grandjean, Etienne (2007). "On Acyclic Conjunctive Queries and Constant Delay Enumeration". In Duparc, Jacques; Henzinger, Thomas A. (eds.). Computer Science Logic. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer. pp. 208–222. doi:10.1007/978-3-540-74915-8_18. ISBN 978-3-540-74915-8.
^ Berkholz, Christoph; Gerhardt, Fabian; Schweikardt, Nicole (2020-02-24). "Constant delay enumeration for conjunctive queries: a tutorial". ACM SIGLOG News. 7 (1): 4–33. doi:10.1145/3385634.3385636. S2CID 211521785.
^ Abo Khamis, Mahmoud; Ngo, Hung Q.; Rudra, Atri (2016-06-15). "FAQ: Questions Asked Frequently". Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. PODS '16. New York, NY, USA: Association for Computing Machinery. pp. 13–28. doi:10.1145/2902251.2902280. ISBN 978-1-4503-4191-2.
^ ^a ^b ^c ^d ^e Philip A. Bernstein; Dah-Ming W. Chiu, "Using Semi-Joins to Solve Relational Queries", Journal of the ACM 28(1): 25–40, 1981 (received December 1978), https://doi.org/10.1145/322234.322238
^ ^a ^b ^c ^d Philip A. Bernstein; Nathan Goodman, "Power of Natural Semijoins", SIAM Journal on Computing 10(4): 751–771, 1981 (received December 1979), https://doi.org/10.1137/0210059
^ Bernstein, Philip A.; Goodman, Nathan (1979). "Full reducers for relational queries using multi-attribute semijoins". Proceedings of the Computer Networking Symposium. Gaithersburg, Maryland. pp. 206–215, (cited in Bernstein and Chiu 1981){{cite conference}}: CS1 maint: postscript (link)
^ Bernstein, Philip A.; Goodman, Nathan (1979). The theory of semi-joins (Technical report). Computer Corporation of America. TR CCA-79-27, (cited in Yannakakis 1981){{cite tech report}}: CS1 maint: postscript (link)
^ ^a ^b Gottlob, Georg; Leone, Nicola; Scarcello, Francesco (2001). "The Complexity of Acyclic Conjunctive Queries". Journal of the ACM. 48 (3): 431–498. doi:10.1145/382780.382783.