Title | 05 Query Processing - lec 5 |
---|---|
Author | Jenny Char |
Course | Data Science. Big Data and Data Variety |
Institution | University of Sydney |
Pages | 53 |
File Size | 3.1 MB |
File Type | |
Total Downloads | 92 |
Total Views | 196 |
lec 5...
DATA3404 – SCALABLE DATA MANAGEMENT Week 5: Query Processing – Introduction; External Sorting 2021 sem1 References: Garcia-Molina/Ullman/Widom – Chapter 15 (start); Ramakrishnan/Gehrke – Chapters 12, 13 and 14 (start); Kifer/Bernstein/Lewis – Chapter 10 (start) Lecturer: Alan Fekete Slides prepared by: Uwe Roehm Based on slides from: Ramakrishnan/Gehrke and Kifer/Bernstein/Lewis And also slides from Dr Bryn Jeffries
DATA3404 "Scalable Data Management"
2-1
Learning Objectives › Understanding of Role and Structure of Query Processing - From SQL to physical data access 1) Query Parsing, 2) Query Optimization, 3) Plan Execution - Expression Tree vs. Evaluation Plan
› Query Execution Algorithms - External Merge Sort - Next week: joins
SQL Commands
Weeks 5 - 7 Parser
Plan Executor
Optimizer
Operator Evaluator
Query Evaluation Engine
3
Agenda › Overview of Query Processing › Relational Algebra [review for students from ISYS2120] › Concepts of Query Processing › External Sorting
4
Overview of Query Processing
5
Database Querying with SQL “List all students (by SID) who are enrolled in Database Systems I” SELECT FROM WHERE AND
studID Enrolled E, UnitOfStudy U E.uosCode = U.uosCode U.uosName = 'Database Systems I'
04a-6
Compiling a query › SQL is declarative - It indicates what information is to be returned
› The DBMS must execute a concrete, imperative program, that describes step-by-step what to do › The DBMS must therefore begin by finding an imperative program which will calculate the result described in the declarative query - This is somewhat similar to the way a programming language can be compiled into machine instructions
› The process is done in stages - Convert the text of SQL into a structured logical form (described using Relational Algebra) - Convert the Relational Algebra logical plan into a plan with physical operations (each of which has a program already available) - Actually, there are many possible physical plans for a single SQL query - The way the DBMS chooses a good physical plan is called “query optimisation” 7
Basic Steps in Query Processing SELECT name FROM Student NATURAL JOIN Enrolled WHERE uosCode=‘DATA3404’;
query
parser and translator
PROJECT(name) NESTED LOOPS
INDEX FULL SCAN (Enrolled_idx, uosCode=‘DATA3404')
INDEX UNIQUE SCAN (Student_idx, Student.sid=Enrolled.sid)
relational algebra expression
pname
TABLE ACCESS BY INDEX ROWID (Student)
optimizer
execution plan
⋈ evaluation engine
suosCode=‘DATA3404’ Student Enrolled
Name statistics about data
data
query output
John Smith Sally Waters 8
Relational Algebra (abbreviation: RA)
9
The Big Idea › Users request information from a database using a query language › A query that extracts information can be seen as calculating a relation from combining one or more relations in the current state of the database › Relational algebra (RA) defines some basic operators that can be used to express a calculation like that - Each operator takes one or more relation instances, and produces a relation instance (the operation part of the relational data model)
› Why is it important? - Helps us understanding the precise meaning of declarative SQL queries. - Intermediate language used within DBMS (cf. chapter on db tuning)
10
The Role of Relational Algebra in a DBMS
[cf. Kifer/Bernstein/Lewis, Figure 5.1] 11
What is an Algebra? › A language based on operators and a domain of values: - basic operators - atomic operands (constants and variables) - rules to form complex expressions from operators and operands, typically using nesting with parentheses
› Example: Algebra of Arithmetic - Operators for usual arithmetic operations: + -
* /
- Operands: variables (x, y, z, …) and numerical constants - Example arithmetic expression: 100 – ((x + y) / 2)
› In databases, operators and operands are finite relations, which leads to the Relational Algebra - We refer to expression as a query and the value produced as query result 12
Relational Algebra (RA) › 1. Set Operations - Union
( È ) tuples in relation 1 or in relation 2.
- Intersection ( Ç ) tuples in relation 1, as well as in relation 2. - Difference
(-)
tuples in relation 1, but not in relation 2.
› 2. Operations that remove parts of a relation - Selection - Projection
( s ) selects a subset of rows from relation. ( p ) deletes unwanted columns from relation.
› 3. Operations that combine tuples from two relations - Cross-product ( X ) allows us to fully combine two relations. - Join ( ) to combine matching tuples from two relations.
› 4. A schema-level ‘rename’ operation - Rename
(r)
allows us to rename one field or even whole relation
› Domain: set of relations - operators take one/more relations as inputs and gives a new relation as result => RA queries by nesting of multiple operators 13
Visualisation of Relational Algebra Projection ( p )
Selection ( s )
Cross-product ( X )
=
X
Set Union ( È )
È
=
Set Minus ( - )
-
Join (
=
)
=
Set Intersection ( Ç )
Ç
=
14
Running Example sid name
uos_code
Student gender
country
sid
name
semester
creditPoints
Enrolled
Student
title
UnitOfStudy
enrolled
UnitOfStudy
gender
country
sid
uos_code
semester
1001
Ian
M
AUS
1001
COMP5138
2005-S2
1002
Ha Tschi
F
ROK
1002
COMP5702
2005-S2
1003
Grant
M
AUS
1003
COMP5138
2005-S2
1004
Simon
M
GBR
1006
COMP5318
2005-S2
1005
Jesse
F
CHN
1001
INFS6014
2004-S1
1006
Franzisca
F
GER
1003
ISYS3207
2005-S2
uos_code
title
credit Points
COMP5138
Relational DBMS
6
COMP5318
Data Mining
6
INFO6007
IT Project Management
6
SOFT1002
Algorithms
12
ISYS3207
IS Project
4
MIT Research Project
18
COMP5702
15
Set Operations › These operations take two input relations R and S - Set Union R È S - Definition: R U S = { t | t
Î
R Ú tÎS}
- Set Intersection R Ç S - Definition: R Ç S = { t | t Î R Ù t Î S } - Set Difference R - S - Definition: R - S = { t | t
Î
R Ù tÏS}
› Important constraint: R and S have the same schema - R, S have the same arity (same number of fields) - `Corresponding’ fields must have the same names and domains
16
Projection › ‘Deletes’ attributes that are not in projection list. - Schema of result contains exactly the fields in the projection list, with the same names that they had in the input relation.
› Examples: Õ name, country (Student)
Õ title (UnitOfStudy)
Student name
country
Ian
AUS
Ha Tschi
ROK
Grant
AUS
Simon
GBR
Jesse
CHN
Franzisca
GER
17
Selection › Selects rows that satisfy a selection condition. - Example:
s country=‘AUS’ (Student) Student sid
name
gender
country
1001
Ian
M
AUS
1003
Grant
M
AUS
› Result relation can be the input for another relational algebra operation! (Operator composition.) Õ name ( s country=‘AUS’ (Student) ) - Example:
Student name Ian Grant 18
Cross-Product › Defined as:
R x S = {t s | t Î R Ù s Î S}
- each tuple of R is paired with each tuple of S. - Resulting schema has one field per field of R and S, with field names `inherited’ if possible. - It might end in a conflict with two fields of the same name -> rename needed
› Sometimes also called Cartesian product › Example:
R A
B
a
1
b
2
x
C
S D
E
a b b g
10 10 20 10
a a b b
=
A
a a a a b b b b
result B C D 1 a 10 1 b 10 1 b 20 1 g 10 2 a 10 2 b 10 2 b 20 2 g 10
E a a b b a a b b 19
Joins R c S = σ c (R×S) Student Lecturer family _ name =last _name
› Conditional Join: - Example:
sid
given
family_name
gender
country
empid
first_name
last_name
room
1001
Cho
Chung
M
AUS
47112344
Vera
Chung
321
1004
Ciao
Poon
M
CHN
12345678
Simon
Poon
431
1004
Ciao
Poon
M
CHN
99004400
Josiah
Poon
482
1111
Alice
Poon
F
AUS
12345678
Simon
Poon
431
1111
Alice
Poon
F
AUS
99004400
Josiah
Poon
482
…
…
…
…
…
…
…
› Result schema same as the cross-product’s result schema. › Sometimes called theta-join. › Equi-Join: Special case where the condition c contains only equalities. 20
Natural Join › Natural Join:
R S
› Equijoin on all shared-named fields. - Result schema similar to cross-product, but only one copy of fields for which equality is specified. Enrolled
result
UnitOfStudy points
sid
uos_code
Relational DBMS
6
1001
COMP5138
Relational DBMS
6
COMP5318
Data Mining
6
1002
COMP5702
MIT Research Project
18
COMP5138
INFO6007
IT Project Mgmt.
6
1003
COMP5138
Relational DBMS
6
1006
COMP5318
SOFT1002
Algorithms
12
1006
COMP5318
Data Mining
6
1001
INFO6007
ISYS3207
IS Project
4
1001
INFO6007
IT Project Mgmt.
6
1003
ISYS3207
COMP5702
MIT Research Project
18
1003
ISYS3207
IS Project
4
sid
uos_code
uos_code
1001
COMP5138
COMP5138
1002
COMP5702
1003
title
=
title
points
21
Rename Operation › Allows to name, and therefore to refer to, the results of relational-algebra expressions. › Allows to refer to a relation by more than one name. › Notation 1:
r x (E)
- returns the expression E under the name X (rename whole relation)
› Notation 1:
r X (a1 à b1) (E)
- returns the result of expression E under the name X, and with the attributes a1 …. renamed to b1 … (rename individual attributes) - (assumes that the relational-algebra expression E has arity n)
› Example:
r Classlist( sid à student ) ( s uos_code=‘ISYS2120' (Enrolled) ) 22
Basic versus Derived Operators › We can distinguish between basic and derived RA operators › Only 6 basic operators are required to express everything else: - Union
(È)
tuples in relation 1 or in relation 2.
- Set Difference ( - ) - Selection (s) - Projection (p)
tuples in relation 1, but not in relation 2.
- Cross-product ( X ) - Rename (r)
allows us to fully combine two relations.
selects a subset of rows from relation. deletes unwanted columns from relation. allows us to rename one field to another name.
› Additional (derived) operations: - intersection, join, division: - Not essential, but (very!) useful. - Cf. Join:
R c S = σ c (R×S) 23
Relational Expressions › A basic expression in the relational algebra consists of either one of the following: - Variable that refers to a relation of the same name in the database - A constant relation
› Let E1 and E2 be relational-algebra expressions; then the following are all relational-algebra expressions: - E1 È E2 - E1 - E2 - E1 x E2 - E1 E2 - sP (E1), P is a (complex) predicate on attributes in E1 - πS (E1), S is a list consisting of some of the attributes in E1 - rX (E1), x is the new name for the result of E1 24
Exercise 1: Relational Algebra Each of the following queries can be performed with a single Relational Algebra operation. Which operator is required in each case? 1. SELECT name, age FROM Student
2. SELECT * FROM Enrolled WHERE mark BETWEEN 85 AND 100 3. SELECT * FROM Student, Enrolled
4. SELECT * FROM Enrolled E, UosOffering U WHERE U.uoscode=E.uoscode AND U.semester=E.semester AND U.year=E.year
25
Concepts of Query Processing
26
Overview of Query Translation and Execution › SQL gets translated into relational algebra, which can be shown as logical expression tree. - Operators have one or more input sets and return one (or more) output set(s). - Leaves correspond to (base) relations.
› Query Execution Plan: Tree of relational algebra operators, with choice of algorithm for each operator.
› Two main issues in query optimization (cf. week 7): -
For a given query, what plans are considered? - Algorithm to search plan space for cheapest (estimated) plan.
-
How is the cost of a plan estimated?
› Ideally:Want to find best plan. › Practically: Avoid worst plans! 27
Parsing and Translation SELECT name SQL query FROM Student NATURAL JOIN Enrolled WHERE uosCode=‘DATA3404’;
Projection
Join
Selection
Relational Algebra (RA)
pname(Student ⋈ suosCode=‘DATA3404’(Enrolled)) Execution Plan
RA Expression Tree
pname
PROJECT(name)
⋈
NESTED LOOPS
suosCode=‘DATA3404’ Student
INDEX FULL SCAN (Enrolled_idx, uosCode=‘DATA3404')
INDEX UNIQUE SCAN (Student)
Enrolled
Logical Operators
Physical Operators 28
Exercise 2: Relational Algebra Expression Trees Write a relational algebra expression, in the form of an expression tree, equivalent to: SELECT S.name, U.uosName, FROM Student S, Enrolled E, UosOffering U WHERE S.sid = E.sid AND U.uosCode = E.uosCode AND U.semester = E.semester AND U.year = E.year AND E.mark BETWEEN 85 AND 100;
29
Expression Tree vs. Query Execution Plan
Õ
balance
sbalance...