Data Flow and Data Warehouse - Notes PDF

Title	Data Flow and Data Warehouse - Notes
Course	Data Management
Institution	University of Auckland
Pages	10
File Size	746 KB
File Type	PDF
Total Downloads	31
Total Views	132

Preview

CLICK TO PREVIEW PDF

Summary

Data flow and data warehouse course content notes covering eveything you need to learn for data flow and data warehouse course work...

Description

04. Data Flow and Data Warehouse Contents 1.

Introduction...................................................................................................................................................2

2.

Data flow.......................................................................................................................................................2

3.

Data warehouse.............................................................................................................................................3 3.1

Characteristics.......................................................................................................................................3

3.2

Compare data warehouse with operational/transactional databases.......................................................4

3.3

DW design and structure........................................................................................................................5

3.3.1

Multidimensional model................................................................................................................5

3.3.2

Logical DW design: star schema....................................................................................................6

3.4

DW architecture.....................................................................................................................................8

3.5

ETL........................................................................................................................................................9

4.

Example data warehouse...............................................................................................................................9

5.

Document change history............................................................................................................................10

Page 1 of 10

1.

Introduction

Organizations today are facing increasingly complex challenges in terms of management and problem solving in order to achieve their operational goals. This situation compels people in those organizations to utilize analysis tools that can better support their decisions. Business intelligence comprises a collection of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information for decision making. Business intelligence and decision-support systems provide assistance to managers at various organizational levels for analysing strategic information. These systems collect vast amounts of data and reduce them to a form that can be used to analyse organizational behaviour. This data transformation comprises a set of tasks that take the data from the sources and, through extraction, transformation, integration, and cleansing processes, store the data in a common repository called a data warehouse. Data warehouses have been developed and deployed as an integral part of decision support systems to provide an infrastructure that enables users to obtain efficient and accurate responses to complex queries. Although data warehouses are, in the end, a particular kind of databases, there are significant differences between the development of operational databases and data warehouses (see 1.2.2 for more details below). A key one is the fact that data in a warehouse are extracted from several source systems. Thus, data must be taken from these sources, transformed to fit the data warehouse model, and loaded into the data warehouse. This process is called extraction, transformation, and loading (ETL). Refernce: Vaisman, A., & Zimányi, E. (2014). Data warehouse systems. Data-Centric Systems and Applications. Springer-Verlag Berlin Heidelberg 2014.

2.

Data flow

Digital data follows a life cycle. We started the course by discussing phases of a generic life cycle. As data moves from phase to phase it must be managed. One aspect of management is the “movement” (or flow) of data itself.

Any transmission or receipt of data can be considered a data flow. Data flows exist for numerous reasons; we will only explore flows required to move data from typical (tabular) transaction formats to a structure suitable for analysis. Such a flow involves extracting transactional data from several sources, transforming it to suit the destination’s structure, then loading the transformed data into the destination. Data sources are varied. Data can be sourced from: 

(Text) Files o Flat tabular formats such as DSV are commonly used for small data dumps or exports. o JSON and XML formats are commonly used for flexibly-structured data.



Databases o Relational: manipulated using SQL or SQL-like logic. o Non-relational: often based on JSON or XML-like structures. “Streams” o From sensors, e.g. in an IoT system.



Page 2 of 10

o

“Live data feeds”, e.g. on a platform such as Twitter, or YouTube

In the context of business analytics, we are interested in managing data flows from these sources so the data can be transformed for analysis (for business value). Other contexts may require data flows for operational, legal, UX, or any number of reasons.

3.

Data warehouse

The goals of business analytics, business intelligence, business data analysis, reporting, data mining, online analytical processing, predictive analytics, etc. are the same: gain (more) value from data by making better business decisions. A data warehouse (DW) is a database that is designed for analysis. It is a read-only database meant for decision analysis consisting of time stamped operational and external data. A classic data warehouse definition characterizes a data warehouse as a collection of subject-oriented, integrated, time-varying and nonvolatile data to support management decisions. 3.1 Characteristics  1) Subject Oriented Data warehouse are organized around major subjects, such as customer, product, sales. It focuses on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Thus, data warehouse provides a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. 

2) Integrated

Data warehouse are constructed by integrating multiple, heterogeneous data sources such as relational databases, flat files, on-line transaction records. Data cleaning and data integration techniques are applied in order to ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources. Thus, once data resides in the data warehouse, it has a single physical corporate image. Integration is the most important of all the aspects of a data warehouse.

Page 3 of 10

Source: W. H. INMON. Building the Data Warehouse. New York: Wiley, 2002. v. 3rd ed



3) Time variant

The time horizon for the data warehouse is significantly longer than that of operational systems. o o

Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

Thus, a data warehouse keeps track of how its data has evolved over time, e.g., to know the evolution of sales over the last months or years. 

4) Nonvolatile

The DW contains a physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment. Thus, the DW does not require transaction processing, recovery, and concurrency control mechanisms. Data warehouse requires only two operations in data accessing: o o

initial loading of data access of data.

3.2 Compare data warehouse with operational/transactional databases Since the DW is analysis-oriented, its structure and the way it operates will typically differ from databases designed for transactional data (which are transaction– or operation-oriented). A transactional database (or OLTP systems, online transactional process) is: o

Highly tuned Page 4 of 10

o

Real time Data

o o o

Detailed records Current values Accesses small amounts of data in a predictable manner

A DW (or OLAP systems, online analytical processing) is: o o o o

Flexible access Consistent timing Summarized as appropriate Historical

o

Access large amounts of data in unexpected ways

3.3 DW design and structure 3.3.1 Multidimensional model Data warehouses and OLAP systems are based on the multidimensional model, which views data in an ndimensional space, usually called a data cube. A data cube is defined by dimensions and facts. Dimensions are perspectives used to analyse the data. A dimension level represents the granularity, or level of detail, at which measures are represented for each dimension of the cube. These levels are hierarchical and allow analysts to view and summarize the data at several levels of detail. The hierarchical structure of a dimension is called the dimension schema. Instances of a dimension are called members. Dimensions also have associated attributes describing them. On the other hand, the cells of a data cube, or facts, often have associated numeric values, called measures. These measures are used to evaluate quantitatively various aspects of the analysis at hand. Each measure in a cube is associated with an aggregation function that combines several measure values into a single one. Aggregation of measures takes place when one changes the level of detail at which data in a cube are visualized. This is performed by traversing the hierarchies of the dimensions. The multidimensional model allows people to view data from multiple perspectives (dimensions) and at several levels of detail. The OLAP operations allow these perspectives and levels of detail to be materialized by exploiting the dimensions and their hierarchies, thus providing an interactive data analysis environment. Some OLAP operations include: o o o o

Roll-up operation: from a lower level to a higher level Drill-down operation: from a higher level to a lower level Sort: sorting along axis Pivot: rotate the axes of the cube without changing granularities

o

Slice and dice: obtain subcubes with the same dimensional levels.

Page 5 of 10

For example, the data cube in the above figure is based on three dimensions (Product, Time, Customer) and a measure (Quantity). We can use this cube to analyze sales figures. Sale figures are aggregated to the dimension levels (Category, Quarter, City) respectively. Seafood and Beverages are members of the Product dimension at the Category level. Thus, in the Product dimension, products are grouped in categories, indicating a hierarchical structure of this dimension. The Product dimension could contain attributes such as ProductNumber and UnitPrice, which are not shown in the figure. Each number shown in a cell of the data cube represents a measure Quantity, indicating the number of units sold by category, quarter, and customer’s city. The example above uses the aggregated function SUM. Thus, if we use the Customer hierarchy for changing the granularity level from City to Country, the sales figures for all customers in the same country will be aggregated using summation. Source: W. H. INMON. Building the Data Warehouse. New York: Wiley, 2002. v. 3rd ed 3.3.2 Logical DW design: star schema From a methodological point of view, DW must be designed in a similar way as operational databases, following the steps of conceptual, logical and physical design. However, there is still no widely accepted conceptual model for data warehouse applications. Thus, data warehouse design is usually performed at the logical level, At the logical level, one possible relational representation of the multidimensional model is based on the star schema, where there is one central fact table, and a set of dimension tables, one for each dimension. Star schema is a typical structure of DW.

Page 6 of 10

•

Facts are numeric measurements (values) that represent a specific business aspect or activity

•

Dimensions qualify characteristics that provide additional perspectives to a given fact

•

Fact and dimension tables are normally represented by physical tables in the data warehouse

•

Fact table has many-to-one (M:1) relationship to each dimension and they are subject to primary/foreign key constraints Examples of a DW star schema structure with 3 dimensions is in the image below.

Page 7 of 10

3.4 DW architecture A typical (traditional) DW architecture has several components other than the central database (which is the DW itself). The architecture consists of several tiers:  Data sources: feed data into DW from operational databases and other internal and external data sources.  Back-end tier (Data staging area): an intermediate database (staging tables and intermediate tables) where all the data integration and transformation processes are run prior to the loading of the data into the data warehouse.  Data warehouse tier: the DW itself, also called the Enterprise Data Warehouse (EDW), and/or several data marts and a metadata repository storing information about the DW and its content.  OLAP tier: contain OLAP server; provide a multidimensional view of the data, regardless of the actual way in which data are stored in the underlying system.  Front-end tier: contain client tools for users to exploit the contents of the DW. Data analysis and visualization tools such as OLAP tools, reporting tools, statistical tools and data mining tools.

A data mart is a subset of a data warehouse oriented to a specific business line. It is a specialized data warehouse targeted toward a particular functional or departmental area in an organization. For example, a sales department may only need sales data, while a human resources department may need demographic data and data about the employees. These departmental data warehouses are called data marts. However, these data marts are not necessarily private to a department; they may be shared with other interested parts of the

Page 8 of 10

organization. A data mart can be seen as a small, local data warehouse. Data in a data mart can be either derived from an enterprise data warehouse or collected directly from data sources. 3.5 ETL Data warehousing is a business analytics approach which involves acquiring and analysing transactional data (historical facts). To enable this analysis, the data warehouse (DW) must be populated with suitable data. Population involves a process called ETL: extract transactional data from multiple sources, then transform that data, and load the transformed data into the DW. ETL is a logical process. What physically takes place may look more like “ETTTETLLTTETL”. ETL also does not prescribe where the transformations take place.

Extracting the data from source systems (SAP, ERP, other operational systems), data from different source systems is identified and extracted. Transforming the data may involve the following tasks: o

Applying business rules (e.g., calculating new measures and dimensions)

o o o

Cleaning (e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.) Filtering (e.g., selecting only certain columns to load) Joining together data from multiple sources (e.g., lookup, merge)

Loading the data into a data warehouse or other reporting applications.

4.

Example data warehouse

A DW with the “star” structure below has been created in SQL Server. It has also been populated with some sample data. You can obtain a copy of this DW on Canvas: 04-DemoDW_Data.zip. A demonstration of how to use this DW with a simple analysis tool will be shown in class.

Page 9 of 10

5.

Document change history

v1.1 2020/10/03  Update material  References: o W. H. INMON. Building the Data Warehouse. New York: Wiley, 2002. v. 3rd ed o Vaisman, A., & Zimányi, E. (2014). Data warehouse systems. Data-Centric Systems and Applications. Springer-Verlag Berlin Heidelberg 2014. v1.0 2019/05/22  Initial release

Page 10 of 10...