Chapter 2 - Query Processing and Optimization
Chapter 2 - Query Processing and Optimization
Chapter 2 - Query Processing and Optimization
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 15- 1 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe 2
This chapter discuss the techniques used internally by a DBMS to Then an internal representation of the query is created, usually as a tree
process, optimize, and execute high-level queries. data structure called a query tree.
It is also possible to represent the query using a graph data structure
A query expressed in a high-level query language such as SQL must first called a query graph. The DBMS must then devise an execution
be scanned, parsed, and validated. strategy or query plan for retrieving the results of the query from the
database files.
The Scanner identifies the query tokens such as SQL keywords,
attribute names, and relation names that appear in the text of the query. A query typically has many possible execution strategies, and the
process of choosing a suitable one for processing a query is known as
Parser checks the query syntax to determine whether it is formulated query optimization.
according to the syntax rules (rules of grammar) of the query language.
Query-Processing:- It is the process of translating high-level
The query must also be validated by checking that all attribute and queries(SQL) into Low-Level expressions.
relation names are valid and semantically meaningful names in the Query-Processing-Steps:- Consists of parsing, translation,
schema of the particular database being queried. optimization and execution of query.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 3 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 4
1
Introduction (3) Introduction (4)
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 5 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 6
Introduction (5)
Translating SQL Queries into Relational Algebra(1)
Schema diagram for the company relational database. Query-Decomposition:-Before translation, The SQL query is
decomposed into query blocks.
Query-Blocks:- It is the basic unit that can be translated into the
algebraic operators and then optimized.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 7 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 8
2
Translating SQL Queries into Relational Algebra(2) Translating SQL Queries into Relational Algebra(3)
This query retrieves the names of employees (from any department in The Outer block is:-
the company) who earn a salary that is greater than the highest salary
in department 5. The query includes a nested subquery and hence
would be decomposed into two blocks as shown below. Where C represents the result returned from the inner block.
Then the queries are translated into relational algebra as shown
below.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 9 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 10
SELECT LNAME, FNAME After that internally, the relational algebra can be represented in either
FROM EMPLOYEE Query-Tree or Query-Graph data structure and then optimized.
WHERE SALARY > ( SELECT MAX (SALARY) Query Trees or Query Graphs are two internal representations of a
FROM EMPLOYEE query.
WHERE DNO = 5); Query-Tree:- It is a tree data structure that corresponds to a relational
algebra expression.
It represents the input relations of the query as leaf nodes of the tree,
and represents the relational algebra operations as internal nodes.
SELECT LNAME, FNAME SELECT MAX (SALARY)
FROM EMPLOYEE FROM EMPLOYEE An execution of the query tree consists of executing an internal node
WHERE SALARY > C WHERE DNO = 5 operation whenever its operands are available and then replacing that
internal node by the relation that results from executing the operation.
πLNAME, FNAME (σSALARY>C(EMPLOYEE)) ℱMAX SALARY (σDNO=5 (EMPLOYEE))
Then the query optimizer would choose an execution plan for each
query block.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 11 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 12
3
Notation for Query Trees (2) Notation for Query Trees (3)
The execution terminates when the root node operation is executed and
produces the result relation for the query.
Figure a below shows a query tree for query For every project located
in ‘Stafford’, retrieve the project number, the controlling department
number, and the department manager’s Last Name, Address, and
Birthdate. Corresponds to the following relational algebra
expression.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 13 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 14
Notation for Query Trees (4) Notation for Query Graph (1)
In the Figure (a) above , the leaf nodes P, D, and E represent the three Another data structure for representation of a query is the query graph.
relations PROJECT, DEPARTMENT, and EMPLOYEE, respectively, Relations in query graph are represented by single circles. Constant
and the internal tree nodes represent the relational algebra operations values are represented by constant nodes, which are displayed as
of the expression. double circles or ovals.
When this query tree is executed, the node marked (1) in Figure a must
begin execution before node (2) because some resulting tuples of Selection and join conditions are represented by the graph edges, as
operation (1) must be available before we can begin executing operation shown in the Figure c below.
(2). Similarly, node (2) must begin executing and producing results
before node (3) can start execution, and so on. Since Query Tree Finally, the attributes to be retrieved from each relation are displayed
represents a order of operations. in square brackets above each relation.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 15 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 16
4
Notation for Query Graphs(2) Algorithms for External Sorting (1)
Algorithms for External Sorting (2) Algorithms for External Sorting (3)
Sorting Algorithm requires buffer space in main memory to perform Sorting Phase:-For Example:- if the number of available main memory
sorting and merging. buffers nB = 5 disk blocks and the size of the file b = 1024 disk blocks,
It uses the Sort-Merge Strategy. then nR= (b/nB) or 205 initial runs each of size 5 blocks. Hence, after
the sorting phase, 205 sorted runs (or 205 sorted subfiles of the original
Sort-Merge Strategy:- It the strategy used by sorting algorithm. starts file) are stored as temporary subfiles on disk.
by sorting small subfiles (runs) of the main file and then merges the
sorted runs, creating larger sorted subfiles that are merged in turn. In the merging-phase, the sorted runs are merged during one or more
The sort-merge algorithm requires buffer-space in main-memory, merge passes.
where the actual sorting and merging of the runs is performed. Merging Phase:- The degree of merging (dM) is the number of sorted
Sorting phase:- nR = (b/nB) subfiles that can be merged in each merge step.
Merging phase:- dM = Min (nB-1, nR); nP = (logdM(nR)) Hence, dM is the smaller of (nB - 1) and nR, and the number of merge
passes is (Log dM(nR)).
nR: number of initial runs.
b: number of file blocks.
In our example where nB = 5, dM = 4 (four-way merging), so the 205
nB: available buffer space.
dM: degree of merging.
initial sorted runs would be merged 4 at a time. In each step into 52
nP: number of passes. larger sorted subfiles at the end of the first merge pass.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 19 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 20
5
Algorithms for External Sorting (4) Algorithms for External Sorting (5)
These 52 sorted files are then merged 4 at a time into 13 sorted files,
which are then merged into 4 sorted files, and then finally into 1 fully
sorted file, which means that four passes are needed.
The performance of the sort-merge algorithm can be measured in the
number of disk block reads and writes (between the disk and main
memory) before the sorting of the whole file is completed. See Example.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 21 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 22
Algorithms for External Sorting (6) Algorithms for External Sorting (7)
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 23 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 24
6
Algorithms for External Sorting (8) Algorithms for External Sorting (9)
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 25 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 26
σSSN=‘12345’(EMPLOYEE) in OP1.
7
Implementing the SELECT Operation(3) Implementing the SELECT Operation(4)
Implementing the SELECT Operation (Contd.):- Implementing the SELECT Operation (contd.):-
Search Methods for Simple Selection:- 2. Search Methods for Complex Selection:-
S5 Using a clustering index to retrieve multiple records:-
Whenever a single condition specifies the selection such as OP1, OP2, Disjunctive Selection Conditions. Compared to a conjunctive selection
or OP3 the DBMS can only check whether or not an access path exists condition, a disjunctive condition (where simple conditions are
on the attribute involved in that condition. connected by the OR logical connective rather than by AND) is much
If an access path (such as index or hash key or sorted file) exists, the harder to process and optimize. For example, consider.
method corresponding to that access path is used; otherwise, the brute
force, linear search approach of method S1 can be used.
With such a condition, little optimization can be done, because the
Query optimization for a SELECT operation is needed mostly for records satisfying the disjunctive condition are the union of the
conjunctive select conditions whenever more than one of the records satisfying the individual conditions. Hence, if any one of
attributes involved in the conditions have an access path. The optimizer the conditions does not have an access path, we are forced to use the
should choose the access path that retrieves the fewest records in the brute force, linear search approach.
most efficient way by estimating the different costs and choosing the
method with the least estimated cost.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 31 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 32
8
Implementing the JOIN Operation(1) Implementing the JOIN Operation(2)
The JOIN operation is one of the most time-consuming operations in J1 Nested-loop join:-
query processing. For each record t in R (outer-loop), retrieve every record s from S
The algorithms we discuss next are for a join operation of the form:- (inner-loop) and test whether the two records satisfy the join
condition t[A] = s[B].
J2 Single-loop join :-
If an index exists for one of the two join attributes say, attribute B of
Where A and B are the join attributes, which should be domain-
file S retrieve each record t in R (loop over file R), and then use the
compatible attributes of R and S, respectively. The methods we
access structure (such as an index or a hash key) to retrieve directly
discuss can be extended to more general forms of join. We illustrate
all matching records s from S that satisfy s[B] = t[A].
three most common techniques for performing a join, using the
following sample operations:
J3 Sort-merge join. If the records of R and S are physically sorted
by value of the join attributes A and B, respectively, we can
implement the join in the most efficient way possible.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 33 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 34
Query Optimization
Implementing the JOIN Operation(3)
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 35 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 36
9
Query Optimization Query Optimization
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 37 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 38
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 39 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 15- 40
10
1. Heuristic Optimization of Query Trees(2) 1. Heuristic Optimization of Query Trees(3)
In general, many different relational algebra expressions and hence Corresponding SQL query written is written as follows:-
many different query trees can be equivalent that is they can represent
the same query.
The query parser will generate initial query tree to correspond to an
SQL query, without doing any optimization.
For-Example:- For every project located in ‘Stafford’, retrieve the
project number, the controlling department number, and the department 1. Initial Canonical is as follows:-
manager’s last name, address, and birthdate.
Corresponding Relational algebra expression is as follows:-
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 41 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 42
The CARTESIAN PRODUCT of the relations specified in the FROM Example of Transforming a Query:-
clause is first applied. Then the selection and join conditions of the Consider the following query Q on the database Find the last names of
WHERE clause are applied, followed by the projection on the employees born after 1957 who work on a project named ‘Aquarius’.
SELECT clause attributes. This query can be specified in SQL as follows:
Such a canonical query tree represents a relational algebra expression
that is very inefficient if executed directly, because of the
CARTESIAN PRODUCT (×) operations.
For Example, if the PROJECT, DEPARTMENT, and EMPLOYEE
relations had record sizes of 100, 50, and 150 bytes and contained 100,
20, and 5,000 tuples, respectively, the result of the CARTESIAN The initial query tree for Q is shown in Figure (a) below. Executing
PRODUCT would contain 10 million tuples of record size 300 bytes this tree directly first creates a very large file containing the
each. CARTESIAN PRODUCT of the entire EMPLOYEE, WORKS_ON,
Therefore, It will never be executed. The heuristic query optimizer will and PROJECT files.
transform this initial query tree into an equivalent final query tree
that is efficient to execute.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 15- 43 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 44
11
1. Heuristic Optimization of Query Trees(7)
1. Heuristic Optimization of Query Trees(6)
That is why the initial query tree is never executed, but is transformed
into another equivalent tree that is efficient to execute using the
following heuristic rules.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 45 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 15- 46
Slide
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 15- 47 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 15- 48
Slide 47 Slide 48
12
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 15- 49 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 15- 50
Slide 49 Slide 50
13
2. Using Selectivity and Cost Estimates in Query Optimization(1) 2. Using Selectivity and Cost Estimates in Query Optimization(2)
2. Using Selectivity and Cost Estimates in Query Optimization(3) 2. Using Selectivity and Cost Estimates in Query Optimization(4)
This is the cost of transferring (reading and writing) data blocks between 4. Memory Uses Cost: -It is a cost relating to the number of memory
secondary disk storage and main memory buffers. The cost of searching buffers needed during query execution.
for tuples in the database relations depends on the type of access 5. Communication Cost:-It is the cost of transferring query and its
structures on that relation, such ordering, hashing and primary or results from the database site to the site of terminal of query organization.
secondary indexes.
2. Storage Cost: -
The storage cost is of storing any intermediate relations that are
generated by the executing strategy for the query.
3. Computation Cost: -
Computation cost is the cost of performing in-memory operations on
the data buffers during query execution. Such operations contain
searching for and sorting records, merging records for a join and
performing computation on a field value. This is also known as CPU
cost.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 55 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 56
14
2. Using Selectivity and Cost Estimates in Query Optimization(5) 2. Using Selectivity and Cost Estimates in Query Optimization(6)
nTuples(EMPLOYEE) = 6,000
bFactor(EMPLOYEE) = 60
= nTuples(EMPLOYEE) / bFactor(EMPLOYEE)
nBlocks(EMPLOYEE)
= 6,000 / 60 = 100
nDistinctDEPT-
= 1,000
Example of Cost Estimation for SELECT Operation: - ID(EMPLOYEE)
Let us consider the relation EMPLOYEE having following attributes: - nTuples(EMPLOYEE) / nDistinctDEPT
=
EMPLOYEE (EMP-ID, DEPT-ID, POSITION, SALARY) SCDEPT-ID(EMPLOYEE) ID(EMPLOYEE)
=
Let us consider the following assumptions:- 6,000 / 1,000 = 6
There is a hash index on primary key attribute EMP-ID. nDistinctPOSITION(EMPLO
There is a clustering index on foreign key attribute DEPT-ID. = 20
YEE)
Let us also assume that the EMPLOYEE relation has the following
nTuples(EMPLOYEE) /
statistics in the system catalog: - =
SCPOSITION(EMPLOYEE) nDistinctPOSITION(EMPLOYEE)
=
6,000 / 20 = 300
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 57 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 15- 58
2. Using Selectivity and Cost Estimates in Query Optimization(7) 2. Using Selectivity and Cost Estimates in Query Optimization(8)
Let us consider the following SELECT operations. The selection operation contains an equality condition on the primary
key EMP-ID of the relation EMPLOYEE. Therefore, as the attribute
Selection -1 EMP-ID is hashed we can use the strategy 3 to estimate the cost as 1
block. The estimated cardinality of the result relation is SCEMP-ID
(EMPLOYEE) = 1.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 59 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 60
15
2. Using Selectivity and Cost Estimates in Query Optimization(9)
Semantic Query Optimization
The predicate here involves a range search on the SALARY attribute, Semantic Query Optimization:-
which has the B+-Tree index. Therefore, we can use the strategy 6 to Uses constraints specified on the database schema in order to
Selection -4
estimate the cost as (2 + (50/2) + (6,000/2)) = 3027 blocks. Thus the modify one query into another query that is more efficient to
linear search strategy is used in this case, the estimated cardinality of execute.
the result relation is SCSALARY (EMPLOYEE) = [6000*(8000- Consider the following SQL query:-
2000*2)/(8000-2000)] = 4000.
While we are retrieving each tuple using the clustering index, we can SELECT E.LNAME, M.LNAME
check whether they satisfied the first condition (POSITION =
FROM EMPLOYEE E M
‘MANAGER’). We know that estimated cardinality of the second
condition SCDEPT-ID (EMPLOYEE) = 6. Let us assume that this
WHERE E.SUPERSSN=M.SSN AND E.SALARY>M.SALARY
intermediate condition is S. then the number of distinct values of Explanation:-
Selection -5 Suppose that we had a constraint on the database schema that
POSITION in S can be estimated as [(6 + 20)/2] = 9. Let us apply
now the second condition using the clustering index on DEPT-ID, stated that no employee can earn more than his or her direct
which has an estimated cost of 3 blocks. Thus, the estimated
supervisor. If the semantic query optimizer checks for the
existence of this constraint, it need not execute the query at all
cardinality of the result relation will be SC POSITION (S) = 6/9 = 1,
because it knows that the result of the query will be empty.
which would be correct if there is one manager for each branch.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 15- 61 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 62
plan.
Finally, best query evaluation plan is submitted to query
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 63 Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 15- 64
16