Winter 19 Assignment 3 PDF

Title	Winter 19 Assignment 3
Course	Database Systems
Institution	McGill University
Pages	6
File Size	118.2 KB
File Type	PDF
Total Downloads	105
Total Views	148

Preview

CLICK TO PREVIEW PDF

Summary

aaaaaaaaaaaaaaaaaaaaaaaaaaa...

Description

School of Computer Science, McGill University

COMP-421 Database Systems, Winter 2019 Written Assignment 3: Query Evaluation Solution Outline This is an individual assignment. You are required to work on your own to create the solution. In this assignment, we evaluate queries for a banking - credit card database. We look at three relations Account, Card, and Payment, listed below. Account (accountid:INT, name:VARCHAR(30), email:VARCHAR(50), address:VARCHAR(60), balance:FLOAT , accounttype:VARCHAR(12)) 9012101, ’Joe Rich’, [email protected] ’1180 Rue. St. Mathieu, Montreal’, 2500.0, ’checking’ Card (cardid:INT, owneraccountid:INT, paymentdue:FLOAT, cardtype:CHAR(7)) −→ owneraccountid references Account. 122202, 9012101, 250.0, ’AIRPLAN’ Payment(paymentid:INT, srcaccountid:INT, cardid:INT amount:FLOAT, date:CHAR(10)) −→ srcaccountid references Account, cardid references Card. 6223451, 9012101, 122202, 205.0, ’2019-03-12’ INT and FLOAT occupy 8 Bytes, a char has 1 Byte. Assume for each VARCHAR attribute in the Account table, the average size of a value is half the total capacity allocated for that attribute. (e.g: name is on an average 15 characters). There are 8,000 members in the bank, each with one account (no joint accounts, etc.). Half of the members have exactly two cards under their account. A quarter of the members have exactly one card. The rest have no cards. Payments are made on an average once per month per card. The database has 1 full year of data stored in it (12 months - 365 days). A payment can be made from an account that is not necessarily the account attached to the card. (ie., srcaccountid associated with a particular payment of a card need not be the owneraccountid associated with the card itself. As in, someone else paid off the bill). Account balance and Payment amount can each take on 3,000 different values, uniformly distributed between 0 and 2,999 . cardtype takes on 3 different values and accounttype takes on 5 different values, both uniformly distributed. Payments are equally likely to be made on any of the 365 days of the year. In general all database pages are 4,000 bytes with an average 75% fill factor. The only exception is for certain index leaf pages where a single index data entry will take more than 75% of a page, then the database will fill as much space in that page as is required to keep that data entry completely in that leaf page. Also the root might have any fill factor. Rids are 10 bytes each and internal page pointers are 6 bytes each. A data entriy is NOT split across multiple pages. You can assume that the root and intermediate nodes of an index are always in the memory (so you need not account for them in your memory requirements). You can also assume there are 50 free buffer pages available in any question where this information is required, but you can add a few extra ones for the sake of simplifying calculations. Tip:- Most of the questions below will require the number of records for the tables in our model. The information needed to compute these is given in the description above. Ex. 1 —

(24 Points)

A typical query could be the one below, which searches the Payment table for payments above an amount M that were made on a particular date D. 1

SELECT * FROM Payment WHERE date = D AND amount > M (a)(4 Points) Find the number of data pages to store each of the 3 tables. (b)What is the I/O cost of the above query when: (i)(4 Points) You do not use any index. (ii)(8 Points) You use an unclustered type II index on date? (iii)(8 Points) You use an unclustered type II index on amount? Answer (Ex. 1) — (a)Length of an account record = 8 + 15 + 25 + 30 + 8 + 6 = 92 Bytes. Number of pages for Account = 92 × 8000/(0.75 × 4000) = 246 Pages. Length of a card record = 8 + 8 + 8 + 7 = 31 Bytes. Number of cards = 1/2 × 8000 × 2 + 1/4 × 8000 × 1 + 1/4 × 8000 × 0 = 10, 000. Number of pages for Card = 31 × 10, 000/(0.75 × 4000) = 104 Pages. Length of a payment record = 8 + 8 + 8 + 8 + 10 = 42 Bytes. Number of payment records = Number of cards ×12 = 120, 000 Number of pages for Payment = 42 × 120, 000/(0.75 × 4000) = 1680 Pages. (b) (i)no Index: scan the relation: 1680 pages for any values of D and M. (ii)unclustered index on date: There are 365 days in a year, and hence 365 data entries. Number of rids/Data Entry = 120, 000/365 = 329. Size of a data entry = 10 + 329 × 10 = 3, 300 Bytes. This is slightly more that 75% of a page (> 3000). So we will have one data entry per leaf page. Cost of an index look up on a given date value is 1 leaf page + 329 data pages = 330. cost is the same for any values of D and M. (iii)unclustered index on amount: Number of data entries is 3000 (number of distinct values for amount). Number of rids / data entry = 120, 000/3, 000 = 40. Size of a data entry = 8 + 40 × 10 = 408 Bytes. Data entries / leaf page = 0.75 × 4000/408 = 7.35. Number of leaf pages = 3000/7.35 = 408 leaf pages. IO = (3000 − M )/3000 × 408 leaf pages + (3000 − M )/3000 × 120, 000 data pages.

Ex. 2 —

(20 Points)

Given an unclustered type II index on account id of Payment, calculate the estimated I/O and give an estimate of the number of output tuples for an: (a)(6 Points) index nested loop join between Account and Payment. (b)(6 Points) block nested loop join between Account and Payment and Account is the outer relation. (c)(8 Points) sort merge join between Account and Payment (assume relations are not already sorted on the join attribute). Hint:- Remember we used a technique in class to reduce the cost of merge-join so that it is less than the sum of the individual costs of merge sort and join. 2

Answer (Ex. 2) — We will have 120,000 tuples in the output, same as the cardinality of the Payment table. This is because every payment is related to exactly one account. (a)On an average there is 120, 000/8, 000 − 15 payment records per account. Cost of looking up payments for an account id using the index would be 1 leaf page + 15 data pages = 16. Cost of index join is = Number of account pages + Number of accounts × 16. = 246 + 8000 × 16 = 128, 246. (b)Here block size will be about 50 pages. Using the block nested join equation, we have: 246 + ⌈246/50⌉ × 1680 = 8646. (c)We have only 50 buffers, however, both the tables are more than that, hence we require external sorting with more than one pass. Pass 0 on Account will produce 5 runs, costing 246 × 2. Next, pass 0 on Payment will produce 34 runs, costing 1680 × 2. We will start pass 1 for both Account and Payment simultaneously. This is possible because they require only 5 + 1 and 34 + 1 buffers. Cost will be 246 and 1680 respectively. Pipeline the output page of each pass immediately to the join operator (and hence no write cost for Pass 1) that can then perform the join (no cost for join as the pages are already in the memory). Therefore, the total cost is 3 × (246 + 1680) = 5778

Ex. 3 —

(36 Points)

Consider the following query: SELECT P.paymentid, P.date, A.name, A.accountid, C.cardid, C.owneraccountid, P.amount FROM Account A, Card C, Payment P WHERE P.srcaccountid = A.accountid AND P.cardid = C.cardid AND A.accounttype=’checking’ AND C.cardtype=’rewards’ The relational expression below corresponds to a non-optimised representation of this query. Note that for brevity, we refer to the tables Account as A, Card as C and Payment as P: ρ(t1, ΠP.paymentid,P.date,P.amount,P.srcaccountid,P.cardid,A.name,A.accountid,A.accounttype (P × A)) ρ(t2, σ t1.srcaccountid=t1.accountid∧t1.accounttype=′ checking ′ ∧C.cardtype=′ rewards′ (t1 ⊲⊳ C )) Πpaymentid,date,name,accountid,cardid,owneraccountid,amount (t2) (a)(6 Points) Optimise this expression according to the rules of algebraic optimisation discussed in class. Your answer need not take into account data cardinalities for this sub-question, and you need not draw a tree. (b)(30 Points) Now assume the only existing index is an unclustered type II index on cardid of Payment. Write an execution plan for your optimised expression, making sure to describe the execution of each operator. Draw the execution tree for your plan. Note that it does not need to include the number of rows and byted passing through the operators. You should also give rough estimations of I/O costs for your execution plan. 3

Note:- Remember to apply the techniques we saw in class such as pipelining and ”combining” operations (projection with join, etc) when useful, to reduce the overall I/O cost.

Answer (Ex. 3) — (a)First we will do individual tables, selection followed by projection of attributes. ρ(ta, Πname,accountid (σ accounttype=′ checking ′ (Account))) ρ(tc, Πcardid,owneraccountid (σ cardtype=′ rewards′ (Card))) ρ(tp, Πpaymentid,pdate,srcaccountid,cardid,amount (P ayment)) Next you can chose one of the following approaches. (Since we are not considering the cardinalities in this particular case.) Note :- You can also use srcaccountid instead of accountid to produce the output. ρ(t, Πname,accountid,paymentid,date,amount,cardid (ta ⊲⊳accountid=srcaccountid tp) ⊲⊳ tc) Π(paymentid, date, name, accountid, cardid, owneraccountid, amount)(t) Or ... ρ(t, Πpaymentid,date,amount,srcaccountid,cardid,owneraccountid (tc ⊲⊳ tp) ⊲⊳srcaccountid=accountid ta) Π(paymentid, date, name, accountid, cardid, owneraccountid, amount)(t) (b)If we are joining Account with Payment first, we do not have a way of using the cardid index on Payment for index join. For block nested join, the cost would be 246 + ⌈(246/50) × (1/5)⌉ × 1680 = 1926 I/O. I did not do a projection of columns on account before the join because as you see it will not make any difference. Since the records fit completely in the memory without any explicit projection, we can directly proceed with the join. Sort merge join will obviously take more than 1 pass, so we are not going to consider that. Let us call this output, a temp table ”T”. It will contain 24,000 records (1/5 of 120,000 payments). It’s record length (after projecting only required attributes - paymentid,srcaccountid,date,amount,name,cardid will be 8 + 8 + 10 + 8 + 15 + 8 = 57 bytes. This will occupy 57 × 24, 000/4, 000 = 342 pages. We do not have enough memory, so this needs to be written to disk, incuring 342 I/O. Next we need to join with Card. 1/3 of cards will qualify after the selection 1/3 × 10, 000 = 3, 333 records. We can project just the cardid and owneraccountid (length 8 + 8 = 16 bytes). That will take 3, 333 × 16/4, 000 = 14pages. This can be kept in the memory. We can next do a block nested join between this and ”T”, with the projected output of Card as the outer relation. The cost for this would be: 104 + ⌈14/50⌉ × 342 = 446 I/O. Total I/O cost for the query = 1926 + 342 + 446 = 2, 714 I/O.

4

O/P to Client Πpaymentid,date,name,accountid,cardid,owneraccountid,amount ⊲⊳ (Block Nested Join, Card is outer.)

Πpaymentid,accountid,date,amount,name,cardid

Πcardid,owneraccountid

⊲⊳accountid=srcaccountid (Block Nested Join, Account is outer.)

σ cardtype=′ rewards′ Card

σ accounttype=′ checking ′

Payment

Account If we instead joined Card with Payment first, we could use the index on cardid of Payment. Each cardid will have 12 records in Payment (1 per month for a year). Given there are three types of cards, 1/3 of them will qualify. Cost of this index join would be: 104 + (1/3 × 10, 000) × 12 = 40, 104 I/O. On the other hand, using block nested join where the selected output of Card is the outer: 104 + ⌈(104 × 1/3)/50⌉ × 1680 = 1, 784 I/O. Again, I did not do a projection of columns on Card before the join because as you see it will not make any difference. Since the records fit completely in the memory without any explicit projection, we can directly proceed with the join. Sort will not be cheaper. So we can go with block nested join. Let us call this output of the join ”T”. It will have 40,000 records ( 1/3 of 120,000 payments ). If we project only the necessary attributes, (paymentid,date,srcaccountid,amount,cardid,owneraccountid), the record length will be: 8 + 10 + 8 + 8 + 8 + 8 = 50 Bytes. For a total size of 50 × 40, 000 = 2, 000, 000 Bytes. Which will take 500 pages. We need to write this to disk, so 500 I/O. 1/5 of the accounts will qualify for the selection condition on it. Which will be 1/5 × 8000 = 1, 600 records. If we just project the required columns from it (accountid and name), it will be 8 + 15 = 23 Bytes long. And hence a total of 1, 600 × 23 = 36, 800 Bytes or, 10 pages. We can do block nested with ”T” where the projected output of account is the outer table. The cost would be: 246 + ⌈10/50⌉ × 500 = 746 I/O. The total I/O cost would be 1, 784 + 500 + 746 = 3030 I/O. So we can see that joining account and payment first is the cheaper option.

5

O/P to Client Πpaymentid,date,name,accountid,cardid,owneraccountid,amount ⊲⊳srcaccountid=accountid (Block Nested Join, Account is outer.)

Πpaymentid,srcaccountid,date,amount,cardid,owneraccountid

Πname,accountid

⊲⊳ (Block Nested Join, Card is outer.)

σ accounttype=′ checking ′

σ cardtype=′ rewards′ Card

Ex. 4 —

10.

WHERE

(20 Points)

Payment

Account...