INFS1603 05 Normalization 1 rev13 PDF

Title	INFS1603 05 Normalization 1 rev13
Author	Dongxuan Sun
Course	Information Systems in Business
Institution	University of New South Wales
Pages	65
File Size	2.5 MB
File Type	PDF
Total Downloads	65
Total Views	149

Preview

CLICK TO PREVIEW PDF

Summary

Download INFS1603 05 Normalization 1 rev13 PDF

Description

© Dr. Daniel Schlagwein or the respective right holders. All rights reserved.

UNSW Business School | Information Systems and Technology Management

INFS1603 Business Databases

W5: Normalization 1 Dr. Daniel Schlagwein

Recap: W4 Learnings q Relational Database Modelling § Definition relational model § Relational model integrity § From ER diagram to relational model (from entities to schema/tables) § Mapping ER relationships in the relational model

2

Plan: W5 Learnings q Redundancy (why Normalization?) q Functional Dependencies § Inclusion (or reflexive) rule § Augmentation rule § Transitivity rule, … q Normal Forms § 1NF § 2NF § 3NF § BCNF q Denormalization

3

Logical Data Modelling q A conceptual data model (e.g., ER model) represent the conceptual view of organizational data. q A logical data model (e.g., relational model) describe organizational data in a way to could be used for implementation in a DBMS. (the logical model is still independent of any particular DBMS) We have learnt how to develop ER models (conceptual). We have learnt how to convert ER models to relational schema (logical). How can we measure how good a grouping of attributes into relational schema is?

4

Redundancy q Database designers aim to reduce redundancy (i.e., database should not store same data several times) to save space and prevent problems. q Aim for the rule(s) of one: § One type of item/entity type = (only) one relation/table § One item/entity instance = (only) one tuple/row § One fact/attribute about entity = (only) one attribute/column § Each attribute should explain (only) the entity type (relation/table) it belongs to. q To achieve these aims, we use normalization techniques…

5

Normalization “Normalization is a process for converting complex data structures (relations) into simpler, more stable data structures.” Normalization… q …is a process that is accomplished in stages. q …is a technique that is used to define “goodness” (or “badness”) of a relation. q …results in data structures that have some desirable (“good”) properties. A Normal Form… q …is a certain state of a relation. q …can be determined by applying rules regarding dependencies. q …uses a concept called functional dependency…

6

Functional Dependency

Source:(educationviews.org

7

Functional Dependency (FD) “A Functional Dependency (FD) is a restriction: it expresses the fact that some values for a relation are not possible, given the way the world works.” FDs … q …are relationships between attributes in a relation. q …are the semantics of the attributes in a relation. q …can be inferred in a systematic way by applying a set of inference rules (next slides).

8

Inference Rules This is an inference rule: A®B Read: A determines B. (if A then B; if the premise A holds, then the conclusion B holds; B can be inferred from A; A implies B) In a relation R, an attribute A (or set of attributes) determines an attribute B (or set of attributes) if the values of A uniquely determine the values of B in all cases. In other words, B is “functionally dependent” on A. Postcode®State. (“2052”®“NSW”, not “2052”®”VIC”).

9

Inclusion (Reflexive) Rule Y, X, W, Z are sets of attributes of a relation R. if Y Í X then X®Y (Í = is a subset of) Example: if StateÍPostcode,State then Postcode,State®State

10

Augmentation Rule if X®Y then WX ®WY Example: if Postcode®State then Suburb,Postcode®Suburb,State

11

Transitivity Rule if X®Y and Y®Z then X®Z Example: if Postcode®Suburb and Suburb®State then Postcode®State

12

Armstrong's Inference Rules 1. Inclusion (Reflexive) rule 2. Augmentation rule 3. Transitivity rule q Armstrong's inference rules (or Armstrong’s axioms) are a set of inference rules that can be used to infer all the FDs based on a given set of FDs. q From paper: Armstrong (1974).

13

Extended Inference Rules Additivity (Union) Rule if x®y and x®z then x®yz if Postcode®State and Postcode®Suburb then Postcode®Suburb,State Decomposition (Projective) Rule if x®yz then x®y and x®z if Postcode®Suburb,State then Postcode®Suburb and Postcode®State

14

Extended Inference Rules Pseudotransitivity Rule if x®y and wy®z then wx®z if Suburb®City and Postcode,City®State then Postcode,Suburb®State Accumulation Rule if x®yz and z®bw then x®yzbw if Postcode®Suburb,State and Suburb®City,Country then Postcode®Suburb,State,City,Country

15

FDs q F of is a set of given FDs. q F+ is the set of all implied FDs. F+ is called the closure of F. q Fmin is the minimal set (minimal cover) of FDs that is equivalent to F. This minimal set of FDs is has no redundancies. How to come from a given set F of FDs to F+ or Fmin? Use Armstrong’s inference rules!

16

Exercises

17

Normalization

Source:(wordpress.com

18

Lossless Decomposition q Our aim is to decompose relations/tables so to reduce size/redundancy. q We use inferences rules for this decomposition process. q We need to be sure that the decomposed components (tables/relations) have the lossless join property (i.e., decomposed components could be joined back together to the original table/relation).

19

Decomposition Example Which of the two decompositions of SUPPLIER relation is better? (i.e., which one could be joined back together to the original relation?)

20

Normalization q Normalization is a process for converting a relation to a standard (normal) form. q Normalization is about being able to … § …decompose a relation/table into smaller components… § …in such a way that we could recapture the precise content of the original relation/table… § …if we would join (i.e., natural join) the decomposed components. q Based on paper: Codd (1971).

21

Why Normalization? q The reasons for applying normalization are… § to minimise or eliminate redundancy (duplication of data). § to prevent data inconsistencies through update, deletion, and insertion anomalies. § to make database design consistent. q Redundancy: occurs when data about a one entity is recorded more than once in a database. q Anomalies: § Addition/insertion anomaly: failure to add new data in all the places in the database where the data needs to be added. § Deletion anomaly: failure to remove new data in all the places in the database where the data needs to be removed. § Update anomaly: failure to update new data in all the places in the database where the data needs to be updated.

22

23

Construction Company Example Scenario: database for reports for a construction company § § § §

Building project has: Project number, Name, Employees assigned to the project. Employee has: Employee number, Name, Job classification. The company charges its clients by billing the hours spent on each project. The hourly billing rate is dependent on the employee’s position.

The following slide shows a table with contents correspond to the reporting requirements but is not “normalized.”

24

Non-Normalized Table (with Layout)

25

Non-Normalized Table (without Layout)

26

Example: Table Problems q The project number is intended to be (part of) a PK, but it contains NULLs. q The table has data redundancies. q The table entries invite data inconsistencies and anomalies (addition, deletion, update anomalies).

27

1NF 28

1NF q Aim: creating a valid relation. q A relation / table is in 1NF if: § All attributes contain only atomic values (i.e., no multivalued attributes). § The PK attributes are defined and not NULL (i.e., a valid PK). q Actions to create / check 1NF: § Add the appropriate entry in at least the primary keys column(s). § Avoid / split multivalued attributes and avoid / split repeating groups of data (i.e., transform multivalued attributes in additional columns, or, better, additional rows).

29

Table in 1NF

30

Example: 1NF ALL_IN_ONE (PROJ_NUM, PROJ_NAME, EMP_NUM, EMP_NAME, JOB_CLASS, CHG_HOUR, HOURS)

31

How to Split Multivalued Attributes q This table has too much data per field (multivalued attributes)

Source:(http://www.databasedev.co.uk/1norm_form.html

32

How to Split Multivalued Attributes q Now it is splitted, but we have another problem: repeating groups of fields (esp. a 4th project would lead us to add columns, not rows!)

Source:(http://www.databasedev.co.uk/1norm_form.html

33

How to Split Multivalued Attributes q Here, the multivalued attribute has split in a more useful way!

Source:(http://www.databasedev.co.uk/1norm_form.html

34

2NF 35

2NF q Aim: remove partial dependencies (no repeating values in non-key fields). q A relation/table is in 2NF if each non-key field is functionally dependent on the entire PK (no partial dependencies). Also, § The relation/table must be in 1NF. q Action to create/check 2NF: § Draw FDs and partial dependencies diagrams. § Remove partial dependencies (attributes not functionally dependent on the entire primary key) by separating the data items into a separate relation using appropriate PKs (may need bridge/junction table). Hint: Look for values that occur multiple times in non-key fields. This tells you that you have too many fields in a single table. In a well-designed database, the only data that is duplicated is in key fields used to connect tables.

36

37

Example: 2NF PROJECT (PROJ_NUM, PROJ_NAME) EMPLOYEE (EMP_NUM, EMP_NAME, JOB_CLASS, CHG_HOUR) ASSIGN (PROJ_NUM, EMP_NUM, HOURS)

38

3NF 39

Third Normal Form (3NF) q Aim: remove non-key dependencies, data that is not dependent on other keys. q A relation/table is in 3NF if it has no transitive dependencies (no non-key attributes determined by other non-candidate-key attributes). Also, § …it is in 2NF. q Action to create/check 3NF: § Identify and remove transitive dependency.

40

Example: 2NF PROJECT (PROJ_NUM, PROJ_NAME)

EMPLOYEE (EMP_NUM, EMP_NAME, JOB_CLASS, CHG_HOUR)

ASSIGN (PROJ_NUM, EMP_NUM, HOURS)

41

Example: 3NF PROJECT (PROJ_NUM, PROJ_NAME) EMPLOYEE (EMP_NUM, EMP_NAME, JOB_CLASS) JOB (JOB_CLASS, CGH_HOUR) ASSIGN (PROJ_NUM, EMP_NUM, HOURS)

42

Example: Database in 3NF

43

Example: Database in 3NF

44

BCNF 45

Boyce-Codd Normal Form (BCNF) q Aim: higher normal forms such as BCNF do cover some specific aspects and problems with the 3NF (nonetheless, 3NF is widely considered to be “sufficient” by data based designers). q A relation/table is in BCNF if no non-key attribute determines part of the PK. Also,…it is in 3NF. q Based on paper Boyce & Codd (1974). q Sometimes called 3.5NF. q 3NF is always achievable, BCNF is not always achievable (Beeri & Bernstein 1979).

46

BCNF

F = AB®CD, C®B AB®CD hence AB®C and AB®D C®B hence AC®AB AC®AB and AB®D hence AC®D and C®B

47

3NF to BCNF

48

Denormalization q Normalization is only one of many database design goals. q Normalized (decomposed) tables require additional processing, reducing system speed. q Normalization purity is often difficult to sustain in the modern database environment. q The conflicts between design efficiency, information requirements, and processing speed are often resolved through compromises/trade-offs that include denormalization. q Denormalization is the process of attempting to optimise the performance of a database by (re-)adding redundant data or by grouping data (the reverse process of normalization).

49

Summary q q q q

50

Normalization is a table design technique aimed at minimizing data redundancies. First three normal forms (1NF, 2NF, and 3NF) are most commonly used. Normalization is an important part—but only a part—of the design process. Best practice: continue the iterative ER process until all entities and their attributes are defined and all equivalent tables are in 3NF.

Recap: W5 Learnings q Redundancy (why Normalization?) q Functional Dependencies § Inclusion (or reflexive) rule § Augmentation rule § Transitivity rule, … q Normal Forms § 1NF § 2NF § 3NF § BCNF q Denormalization

51

To Do for W5 q Work on assignment part B. q Read lecture notes and textbook. q Prepare lab, learn Oracle.

52

Homework / Exercises

53

Exercise 1 § We have a relational schema: R = (A, B, C, G, H, I). § We have a set of FDs: F = {A®B, A®C, CG®H, CG®I, B®H}. What is the closure F+ for this relational schema?

54

Your Solution to Exercise 1

55

Exercise 2 We have a set of FDs: F = {A®AC, B®ABC, D®ABC} q What is the minimal cover Fmin for F?

56

Your Solution to Exercise 2

57

Exercise 3 q Identify potential problems with the following table. q Identify the PK. q Draw dependencies diagrams and normalize the table to 3NF.

58

Your Solution to Exercise 3

59

Exercise 4 q For following table: identify the primary key, draw dependencies diagrams and normalize it to 3NF.

60

Your Solution to Exercise 4

61

Exercise 5 Draw the ER model for the following logical model: PROJECT (PROJ_NUM, PROJ_NAME) EMPLOYEE (EMP_NUM, EMP_NAME, JOB_CLASS) JOB (JOB_CLASS, CHG_HOUR) ASSIGN (PROJ_NUM, EMP_NUM, HOURS)

62

Your Solution to Exercise 5

63

Source: murketing.com

64...