DAnotes - Data and Analysis notes PDF

Title	DAnotes - Data and Analysis notes
Author	Jatin Soni
Course	Informatics 1 - Data and Analysis
Institution	The University of Edinburgh
Pages	38
File Size	1.7 MB
File Type	PDF
Total Downloads	110
Total Views	174

Preview

CLICK TO PREVIEW PDF

Summary

Data and Analysis notes...

Description

Data Analy alysis Rev evisio ision Note tess Da ta & An aly sis R ev isio n No te Partt I: SStru tructu ctured Data Par tru ctu red D ata En Entitie titie titiess aand nd R Rela ela elatio tio tions ns nship hip hipss 1. Database Design a) Requirements analysis: Understand what data is to be stored in the database and what operations on it are likely to be needed. b) Conceptual design: Develop a high-level description of data to be stored, and any constraints that might apply to the data. This is the level where we might use an ER data model. c) Logical design: Implement the conceptual design by mapping it to a specific data representation. The outcome is a logical schema. (e.g. From ER data model to a relational data model) 2. Entity-relationship model What is it? The entity-relationship (ER) model is a way to organise the description of entities (individual things) and the relationships between them. Why is it useful? It readily maps into different logical data models, such as the relational model. How is it used? As a graphical notation for visualising the structure of data, to clarify and communicate that structure. 3. Entity-relationship Diagrams ER modelling provides a graphical language for describing entities and the relationships between them in an ER diagram. Relationship

Attribute

Primary Key Entity

Primary Key

1

4. Keys Superkey: Any set of attributes whose values uniquely identify each entity instance in an entity set. Key: A minimal set of attributes whose values uniquely identify each entity instance in an entity set. Candidate Key: When there is more than one key, each forms a candidate key. Composite Key: Any key with more than one attribute. Primary Key: One of the candidate keys is selected as the primary key. 5. Key Constraints An entity E has a key constraint in relationship R if each entity instance x of E can appear in at most one relationship instance from R.

6. Participation Constraints Total Participation of entity E in relationship R means that every entity instance x in E appears in at least one relationship instance of R.

Partial Participation of entity E in relationship R means that some entity instances in E might appear in no relationship instance from R.

2

7. Weak Entities ▪ Sometimes the attributes of an entity are not enough to build a key that uniquely identifies each individual. ▪ We may be able to fix this using attributes from other, related, entities. ▪ For this the entity must have a key constraint and total participation in some identifying relationship. ▪ It is then a weak entity, with borrowed attributes coming from the identifying owner.

8. Entity Hierarchies and Inheritance ▪ Sometimes one entity will refine another: a subclass entity specializes a superclass entity. ▪ Each subclass entity inherits the attributes of the superclass entity, and may add its own attributes, too.

3

Th The e Rel Relatio atio ation nal M Mo odel 9. Relational databases Relational databases take as fundamental the idea of a relation, comprising a schema and an instance. ▪ The schema is the format of the relation. ▪ The instance of a relation is a table.

Every relational database is a linked collection of several tables like this.

10. SQL: Standard Query Language ▪ SQL is the standard language for working with relational database management systems. ▪ Substantial parts of SQL are declarative: code states what should be done, not necessarily how to do it. 11. DDL: SQL Data Definition Language The Data Definition Language is a portion of SQL used to declare the schemas for relations. Schema Types: INTEGER, FLOAT, VARCHAR(n)

4

Example: CREATE TABLE Student ( uun VARCHAR(8), name VARCHAR(20), age INTEGER, email VARCHAR(25), PRIMARY KEY (uun) ) CREATE TABLE Takes ( uun VARCHAR(8), code VARCHAR(20), mark INTEGER, PRIMARY KEY (uun, code), FOREIGN KEY (uun) REFERENCES Student, FOREIGN KEY (code) REFERENCES Course )

Fro From m EER R Diag Diagra ra rams ms to R Relat elat elation ion ional al Mo Mode de dels ls 12. From ER to Relational Models ▪ ER modelling gives a high-level conceptual design for a database system, showing what things are to be recorded and how they are connected. ▪ The relational model supports a more explicit logical design, closer to implementation. ▪ The translation will be approximate: some constraints expressed in an ER diagram might not naturally fit into relational schemas. 13. Mapping an Entity CREATE TABLE Student ( uun VARCHAR(8), name VARCHAR(20), age INTEGER, email VARCHAR(25), PRIMARY KEY (uun) )

5

14. Mapping a Relationship

CREATE TABLE Takes ( uun VARCHAR(8), year INTEGER, code VARCHAR(20), mark INTEGER, PRIMARY KEY (uun, code, year), FOREIGN KEY (uun) REFERENCES Student, FOREIGN KEY (code, year) REFERENCES Course ) 15. Mapping Key Constraints

Method 1: CREATE TABLE Joins ( uun VARCHAR(8), time VARCHAR(20), place VARCHAR(32), PRIMARY KEY (uun), FOREIGN KEY (uun) REFERENCES Student, FOREIGN KEY (time, place) REFERENCES InfPALSGroup )

6

Method 2: CREATE TABLE Student ( uun VARCHAR(8), age INTEGER, name VARCHAR(20), email VARCHAR(25), time VARCHAR(10), place VARCHAR(32), PRIMARY KEY (uun), FOREIGN KEY (time, place) REFERENCES InfPALSGroup ) 16. Mapping Key and Participation Constraints

CREATE TABLE Student ( uun VARCHAR(8), age INTEGER, name VARCHAR(20), email VARCHAR(25), pt VARCHAR(8) NOT NULL, PRIMARY KEY (uun), FOREIGN KEY (pt) REFERENCES Staff(staff_id) ) 17. Mapping Weak Entities and Identifying Relationships

CREATE TABLE Room ( number VARCHAR(8), building_name VARCHAR(20), capacity INTEGER, PRIMARY KEY (number, building_name), FOREIGN KEY (building_name) REFERENCES Building(name) ON DELETE CASCADE )

7

18. Mapping Entity Hierarchies and Inheritance

CREATE TABLE Student ( uun VARCHAR(8), age INTEGER, name VARCHAR(20), email VARCHAR(25), PRIMARY KEY (uun) ) CREATE TABLE FullTimeStudent ( uun VARCHAR(8), PRIMARY KEY (uun), FOREIGN KEY (uun) REFERENCES Student ) CREATE TABLE PartTimeStudent ( uun VARCHAR(8), fraction FLOAT, PRIMARY KEY (uun), FOREIGN KEY (uun) REFERENCES Student )

Rela Relatio tio tiona na nall Alg Algebr ebr ebraa 19. Relational Algebra Relational Algebra is a mathematical language for describing certain operations on the schemas and tables of a relational model. Each of these operations takes one or more tables and returns another.

8

20. Basic operations Selection: 𝜎𝑃 (𝑅) is the table of rows in 𝑅 which satisfy predicate 𝑃. e.g. 𝜎𝑎𝑔𝑒>18 (𝑆𝑡𝑢𝑑𝑒𝑛𝑡) Projection: 𝜋𝛼1 ,…,𝛼𝑛 (𝑅) is the table with tuples of fields 𝛼1 , … , 𝛼𝑛 taken from all rows of 𝑅. e.g. 𝜋𝑛𝑎𝑚𝑒,𝑎𝑔𝑒 (𝑆𝑡𝑢𝑑𝑒𝑛𝑡) Renaming: Changes the names of some or all fields in a table, giving a schema of the same arity and type. The contents of the rows are unchanged. e.g. 𝜌𝑒𝑚𝑎𝑖𝑙→𝑎𝑑𝑑𝑟𝑒𝑠𝑠 (𝑆𝑡𝑢𝑑𝑒𝑛𝑡) Union: 𝑅1 ∪ 𝑅2 contains every tuple that appears in either 𝑅1 or 𝑅2 .

Difference: 𝑅1 − 𝑅2 contains every tuple that appears in 𝑅1 but not in 𝑅2 . Cross Product: 𝑅1 × 𝑅2 combines every row of 𝑅1 with every row of𝑅2 .

Intersection: 𝑅1 ∩ 𝑅2 contains every tuple that appears in 𝑅1 and also in 𝑅2 . 21. Derived operations

𝑅1 ∩ 𝑅2 = 𝑅1 − (𝑅1 − 𝑅2 )

Join: 𝑅 ⋈𝑃 𝑆 of relations 𝑅 and 𝑆 with predicate 𝑃 defined on the fields of both relations. 𝑅 ⋈𝑃 𝑆 = 𝜎𝑃 (𝑅 × 𝑆)

Equijoin: Any join where the predicate only involves checking that some fields in 𝑅 are equal to some other fields in 𝑆.

e.g. 𝑆𝑡𝑢𝑑𝑒𝑛𝑡 ⋈𝑆𝑡𝑢𝑑𝑒𝑛𝑡.𝑚𝑛=𝑇𝑎𝑘𝑒𝑠.𝑚𝑛 𝑇𝑎𝑘𝑒𝑠 is an equijoin between these tables on the two 𝑚𝑛 fields.

Natural join: 𝑅 ⋈ 𝑆 of tables 𝑅 and 𝑆 is the equijoin requiring equalities between any fields in the two tables that share the same name, followed by a projection to remove duplicate columns. e.g. 𝑆𝑡𝑢𝑑𝑒𝑛𝑡 ⋈ 𝑇𝑎𝑘𝑒𝑠 = 𝜋𝑚𝑛,𝑛𝑎𝑚𝑒,𝑎𝑔𝑒,𝑒𝑚𝑎𝑖𝑙,𝑐𝑜𝑑𝑒,𝑚𝑎𝑟𝑘 (𝜎𝑆𝑡𝑢𝑑𝑒𝑛𝑡.𝑚𝑛=𝑇𝑎𝑘𝑒𝑠.𝑚𝑛 (𝑆𝑡𝑢𝑑𝑒𝑛𝑡 × 𝑇𝑎𝑘𝑒𝑠))

9

Tu Tuple ple R Rela ela elatio tio tional nal Ca Calcu lcu lculus lus 22. Queries in tuple relational calculus Queries in TRC all have the general form:

{ 𝑇 | 𝑃(𝑇) }

where 𝑇 is a tuple variable and 𝑃(𝑇) is a predicate, a logical formula.

The result of the query is the set of all possible tuple values for 𝑇 such that 𝑃(𝑇) is true.

Examples:

All records for students more than 19 years old:

{ 𝑆 | 𝑆 𝜖 𝑆𝑡𝑢𝑑𝑒𝑛𝑡 ∧ 𝑆. 𝑎𝑔𝑒 > 19 }

All records for students taking math1:

{ 𝑆 | 𝑆 𝜖 𝑆𝑡𝑢𝑑𝑒𝑛𝑡 ∧ ∃𝑇 . (𝑇 𝜖 𝑇𝑎𝑘𝑒𝑠 ∧ 𝑇. 𝑐𝑜𝑑𝑒 = math1 ∧ 𝑆. 𝑢𝑢𝑛 = 𝑇. 𝑢𝑢𝑛) }

{ 𝑇 | ∃𝑆 . 𝑆 𝜖 𝑆𝑡𝑢𝑑𝑒𝑛𝑡 ∧ 𝑆. 𝑎𝑔𝑒 > 19 ∧ 𝑇. 𝑛𝑎𝑚𝑒 = 𝑆. 𝑛𝑎𝑚𝑒 ∧ 𝑇. 𝑎𝑔𝑒 = 𝑆. 𝑎𝑔𝑒 }

Names and ages of all students over 19:

SQ SQLL Dat Dataa Ma Manip nip nipula ula ulatio tio tion n La Lang ng nguag uag uage e & SSQL QL Q Queries ueries 23. Inserting Data into a Table

INSERT INTO Student (uun, name, age, email) VALUES (‘s1428751’, ‘Bob’, 19, ‘[email protected]’)

10

24. Update and Delete Rows in a Table UPDATE Student SET name = ‘Bobby’ WHERE uun = ‘s1428571’ DELETE FROM Student WHERE name = ‘Bobby’ 25. Writing SQL Queries SELECT field-list FROM table-list [ WHERE qualification ] Example 1 (Variations of the same simple query) SELECT * FROM Student WHERE age > 19 SELECT uun, name, age, email FROM Student WHERE age > 19 SELECT Student.uun, Student.name, Student.age, Student.email FROM Student WHERE Student.age > 19 SELECT S.uun, S.name, S.age, S.email FROM Student AS S WHERE S.age > 19 SELECT S.uun, S.name, S.age, S.email FROM Student S WHERE S.age > 19 Example 2 Multiset Result:

Set Result:

SELECT age FROM Student

SELECT DISTINCT age FROM Student

11

Example 3 Find the names and email addresses of all students taking Mathematics 1.

SELECT Student.name, Student.email FROM Student, Takes, Course WHERE Student.uun = Takes.uun AND Takes.code = Course.code AND Course.title = ‘Mathematics 1’ Example 4 (Conjunction query) Find the names of all students who are taking both Informatics 1 and Mathematics 1 SELECT S.name FROM Student S, Takes T1, Course C1, Takes T2, Course C2 WHERE S.uun = T1.uun AND T1.code = C1.code AND S.uun = T2.uun AND T2.code = C2.code AND C1.title = ‘Informatics 1’ AND C2.title = ‘Mathematics 1’

SELECT S.name FROM Student S, Takes T, Course C WHERE S.uun = T.uun AND T.code = C.code AND C.title = ‘Informatics 1’ INTERSECT SELECT S.name FROM Student S, Takes T, Course C WHERE S.uun = T.uun AND T.code = C.code AND C.title = ‘Mathematics 1’

12

Example 5 (Disjunction Query) Find the names of all students who are taking either Informatics 1 or Mathematics 1 SELECT S.name FROM Student S, Takes T, Course WHERE S.uun = T.uun AND T.code = C.code AND (C.title = ‘Informatics 1’ OR C.title = ‘Mathematics 1’)

SELECT S.name FROM Student S, Takes T, Course C WHERE S.uun = T.uun AND T.code = C.code AND C.title = ‘Informatics 1’ UNION SELECT S.name FROM Student S, Takes T, Course C WHERE S.uun = T.uun AND T.code = C.code AND C.title = ‘Mathematics 1’ Example 6 (Difference Query) Find the names of all students who are taking Informatics 1 but not Mathematics 1 SELECT S.name FROM Student S, Takes T, Course C WHERE S.uun = T.uun AND T.code = C.code AND C.title = ‘Informatics 1’ EXCEPT SELECT S.name FROM Student S, Takes T, Course C WHERE S.uun = T.uun AND T.code = C.code AND C.title = ‘Mathematics 1’

13

Example 7 (Comparison Query) Find the names of all students who are taking Informatics 1 but not Mathematics 1 SELECT S1.name AS “Higher”, S2.name AS “Lower” FROM Student S1, Takes T1, Student S2, Takes S2, Takes T2, Course C WHERE S1.uun = T1.uun AND T1.code = C.code AND S2.uun = T2.uun AND T2.code = C.code AND C.title = ‘Informatics 1’ AND T1.mark > T2.mark 26. Aggregates: Operations on multiple values COUNT(val) SUM(val) AVG(val) MAX(val) MIN(val) SUM(DISTINCT val)

The number of values in the val field The total of all values in the val field The mean of all values in the val field The greatest value in the val field The least value in the va l field The total of the distinct values in the val field

Example Find the number of students taking Informatics 1, their mean mark and the highest mark. SELECT COUNT(DISTINCT T.uun) AS “Number”, AVG(T.mark) AS “Mean Mark”, MAX(T.mark) AS “Highest” FROM Student S, Takes T, Course C WHERE S.uun = T.uun AND T.code = C.code AND C.title = ‘Informatics 1’ 27. Executing SQL Queries SQL SELECT queries are very close to a programming-language form for the expressions of the tuple relational calculus, describing the information desired but not dictating how it should be computed. To do that, we need something more like relational algebra. A single SELECT statement combines the operations of join, selection and projection. This immediately suggests one possible strategy: ▪ Compute the complete cross product of all the FROM tables; ▪ Select all the rows which match the WHERE condition; ▪ Project out only the columns named on the SELECT line. In reality, database engines prepare multiple query plans and estimate their cost (in memory space, disk I/O time) before choosing one to execute.

14

28. ACID properties of a transaction A transaction is a single coherent operation on a database. The features that characterise a reliable implementation of transactions are standardly initialized as the ACID properties. A-

Atomicity

C-

Consistency

I -

Isolation

D-

Durability

Atomicity: All-or-nothing: a transaction either runs to completion, or fails and leaves the database unchanged. Consistency: Applying a transaction in a valid state of the database will always give a valid result state. Isolation: Concurrent transactions have the same effect as sequential ones: the outcome is as if they were done in order. Durability: Once a transaction is committed, it will not be rolled back. 29. Who writes SQL? ▪ ▪ ▪ ▪

Hand-written by a programmer Generated by some interactive visual tool Generated by an application to fetch an answer for a user Generated by one program to request information from another.

Most SQL is written by programs, not directly by programmers.

Par mis tru ctu red D ata Partt II: Se Semis mistru tructu ctured Data Tr Trees ees an and d XM XMLL 30. Situations in which it is helpful to structure data in a less rigid way than relational databases ▪ When the data has no strong inherent structure; ▪ Where there is structure in the data, but it varies from item to item; ▪ When we wish to mark up (annotate) existing unstructured data (say, English text) with additional information (such as linguistic structure, or meaning); ▪ When the structure of the data changes as more data accumulates over time.

15

31. Trees ▪ A tree is a set of linked nodes, with a single root node. ▪ Each node is linked to a set of zero or more children, also nodes in the tree. ▪ Every node has exactly one parent, except for the root node which has none. ▪ A node with no children is a leaf; other nodes are internal. ▪ Two nodes with a common parent are sibling nodes. 32. XPath Node Trees

Root node: This is the root of the tree, labelled /. Element nodes: These are labelled with element names, categorising the data below them. In these example the element names are: Factbook, Country, Name, Population, Capital, State and Feature. In the XPath data model, internal nodes other than the root are always element nodes. The root node must have exactly one element node as a child, called the root element. Here the root element is Factbook. Text Nodes: Leaves of the tree storing textual information. For example, “Australia”

16

Attribute nodes: Leaves assigning a value to some attribute of an element node. In the example, we use the @ symbol to identify attributes. In the XPath data model, attribute nodes are treated differently from other node types. For example, although the parent of an attribute node is an element node, when we talk about the children of this parent node we generally don’t include the attribute nodes. 32. Understanding an XPath Data Tree The meaning of data at a text node depends on all the element nodes that appear along the path from the root of the tree to the next node, and on the values of their associated attributes. We usually write these paths with a / separator, beginning at the root. For example, /Factbook/Country/State/Feature/ and the value of the type attribute of the associated Feature element is “Lake”. This tells us that Glenrock Lagoon is a feature in a state in a country in the factbook and that the type of the feature is a lake. 33. XML: Extensible Markup Language XML is a formal language for presenting the kind of semistructured data we have just seen. It is a markup language in that it provides a way to mark up ordinary text with additional information. Example:

Australia 24,360,000 Canberra

New South Wales Glenrock Lagoon Barren Mountain Brokers Nose

(this is a comment)

Elements in XML are case sensitive. A single element may have multiple different attributes: ...

17

Str Struct uct uctur ur uring ing XML 34. Constraints on XML files A file satisfying these constraints is a well-formed XML document: ▪ ▪ ▪ ▪

The tree structure can always be extracted from textual nesting; Elements are always given with their complete name; Attributes are all named; Everything else is unstructured text.

35. Schema Languages for XML In relational databases, a schema specifies the content of a relation. A schema language for XML is any language for specifying similar kinds of structure in XML documents. There are a number of different schema languages in common use. Using a formal schema language means: ▪ Schemas are precise and unambiguous ▪ A machine can validate whether or not a document satisfies a certain schema. If a well-formed XML document D matches the format specified by schema S then we say D is valid with respect to S. One document may be valid with respect to several different schemas; it is also possible to have an XML document that is well-formed but not valid. 36. Document Type Definitions. Document Type Definition or DTD is a basic schema mechanism for XML. The DTD schema language is simple, widely used, and has been an integrated feature of XML since its inception. A DTD includes information about: ▪ Which elements can appear in a document; ▪ The attributes of those elements; ▪ The relationship between different elements such as their order, number and possible nesting.

18

Example:

Name ( #PCDATA > Population ( #PCDATA ) > Capital ( #PCDATA ) > State ( Name, Feature∗ ) > Feature ( #PCDATA ) >

#PCDATA is “parsed character data”. The process of parsing involves a stream of characters for some local structure. This is appropriate for larger amounts of texts, as appears inside an XML element or a text node as a leaf of a tree. CDATA is “character data”. This means a string of characters. This is appropriate for short pieces of text, such as an attribute of an XML element. 37. Element Declarations An element declaration has this form:

There are four possible content types a) b) c) d)

EMPTY indicating that the element has no content. ANY meaning that any content is allowed Mixed content where the element contains text, and possible also child elements. A child declaration using a regular expression.

A mixed content element declaration has one of these forms:

The first of these means that the element can contain arbitrary text as a child node, but no further element nodes. The second form allows text interspersed with any of the child element nodes named in the declaration, in any order....