Exam, answers PDF

Title Exam, answers
Course Data Mining
Institution Birla Institute of Technology and Science, Pilani
Pages 62
File Size 5 MB
File Type PDF
Total Downloads 30
Total Views 145

Summary

Solved papers...


Description

SOLVED PAPERS OF DATA WAREHOUSING & DATA MINING (DEC-2013 JUNE-2014 DEC-2014 & JUNE-2015)

For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/

DATA WAREHOUSING & DATA MINING SOLVE SOLVED D PAPER DEC-2013 1a. What is operational data store (ODS)? Explain with neat diagram. (08 Marks) Ans: ODS (OPERATIONAL DATA STORE) • ODS is defined as a subject-oriented, integrated, volatile, current-valued data store, containing only corporate-detailed data. → ODS is subject-oriented i.e. it is organized around main data-subjects of a company. → ODS is integrated i.e. it is a collection of data from a variety of systems. → ODS is volatile i.e. data changes frequently, as new information refreshes ODS. → ODS is current-valued i.e. it is up-to-date & reflects current status of information. → ODS is detailed i.e. it is detailed enough to serve needs of manager. ODS DESIGN & IMPLEMENTATION • The extraction of information from source-databases should be efficient. • The quality of data should be maintained (Figure 8.1). • Suitable-checks are required to ensure quality of data after each refresh. • The ODS is required to → satisfy integrity constraints. Ex: existential-integrity, referential-integrity. → take appropriate actions to deal with null values. • ODS is a read only database i.e. users shouldn‟t be allowed to update information. • Populating an ODS involves an acquisition-process of extracting, transforming & loading data from source systems. This process is called ETL. (ETL = Extraction, Transformation and Loading). • Before an ODS can go online, following 2 tasks must be completed: i) Checking for anomalies & ii) Testing for performance.

• Why an ODS should be separate from the operational-databases? Ans: Because from time to time, complex queries are likely to degrade performance of OLTP systems. The OLTP systems have to provide a quick response to operational-users. The businesses cannot afford to have response-time suffer when a manager is running a complex-query.

A-1 For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/

DATA WAREHOUSING & DATA MINING SOLVE SOLVED D PAPER DEC-2013 1b. What is ETL? Explain the steps in ETL. (07 Marks) Ans: ETL (EXTRACTION, TRANSFORMATION & LOADING) • The ETL process consists of → data-extraction from source systems → data-transformation which includes data-cleaning & → data-loading in the ODS or the data-warehouse. • Data cleaning deals with detecting & removing errors/inconsistencies from the data. • Most often, the data is sourced from a variety of systems. PROBLEMS TO BE SOLVED FOR BUILDING INTEGRATED-DATABASE 1) Instance Identity Problem • The same customer may be represented slightly differently in different sourcesystems. 2) Data-error s • Different types of data-errors include: i) There may be some missing attribute-values. ii) There may be duplicate records. 3) Record Linkage Problem • This deals with problem of linking information from different databases that relates to the same customer. 4) Semantic Integration Problem • This deals with integration of information found in heterogeneous-OLTP & legacy sources. • For example, → Some of the sources may be relational. → Some sources may be in text documents. → Some data may be character strings or integers. 5) Data Integrity Problem • This deals with issues like i) referential integrity ii) null values & iii) domain of values. STEPS IN DATA CLEANING 1) Parsing • This involves → identifying various components of the source-files and → establishing the relationships b/w i) components of source-files & ii) fields in the target-files. • For ex: identifying the various components of a person„s name and address. 2) Correcting • Correcting the identified-components is based on sophisticated techniques using mathematical algorithms. • Correcting may involve use of other related information that may be available in the company . 3) Standardizing • Business rules of the company are used to transform data to standard form. • For ex, there might be rules on how name and address are to be represented. 4) Matching • Much of the data extracted from a number of source-systems is likely to be related. Such data needs to be matched. 5) Consolidating • All corrected, standardized and matched data can now be consolidated to build a single version of the company-data.

A-2 For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/

DATA WAREHOUSING & DATA MINING SOLVE SOLVED D PAPER DEC-2013 1c. What are the guide lines for implementing the data-warehouse. (05 Marks) Ans: DW IMPLEMENTATION GUIDELINES Build Incrementally • Firstly, a data-mart will be built. • Then, a number of other sections of the company will be built. • Then, the company data-warehouse will be implemented in an iterative manner. • Finally, all data-marts extract information from the data-warehouse. Need a Champion • The project must have a champion who is willing to carry out considerable research into following: i) Expected-costs & ii) Benefits of project. • The projects require inputs from many departments in the company. • Therefore, the projects must be driven by someone who is capable of interacting with people in the company. Senior Management Support • The project calls for a sustained commitment from senior-management due to i) The resource intensive nature of the projects. ii) The time the projects can take to implement. Ensure Quality • Data-warehouse should be loaded with i) Only cleaned data & ii) Only quality data. Corporate Strategy • The project must fit with i) corporate-strategy & ii) business-objectives. Business Plan • All stakeholders must have clear understanding of i) Project plan ii) Financial costs & iii) Expected benefits. Training • The users must be trained to i) Use the data-warehouse & ii) Understand capabilities of data-warehouse. Adaptability • Project should have build-in adaptability, so that changes may be made to DW as & when required. Joint Management • The project must be managed by both i) IT professionals of software company & ii) Business professionals of the company. 2a. Distinguish between OLTP and OLAP. (04 Marks) Ans:

A-3 For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/

DATA WAREHOUSING & DATA MINING SOLVE SOLVED D PAPER DEC-2013 2b. Explain the operation of data-cube with suitable examples. (08 Marks) ROLL-UP • This is like zooming-out on the data-cube. (Figure 2.1a). • This is required when the user needs further abstraction or less detail. • Initially, the location-hierarchy was "street < city < province < country". • On rolling up, the data is aggregated by ascending the location-hierarchy from the level-ofcity to level- of-country.

Figure 2.1a: Roll-up operation DRILL DOWN • This is like zooming-in on the data. (Figure 2.1b). • This is the reverse of roll-up. • This is an appropriate operation → when the user needs further details or → when the user wants to partition more finely or → when the user wants to focus on some particular values of certain dimensions. • This adds more details to the data. • Initially, the time-hierarchy was "day < month < quarter < year”. • On drill-up, the time dimension is descended from the level-of-quarter to the level-of-month.

Figure 2.1b: Drill-down operation A-4 For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/

DATA WAREHOUSING & DATA MINING SOLVE SOLVED D PAPER DEC-2013 PIVOT (OR ROTATE) • This is used when the user wishes to re-orient the view of the data-cube. (Figure 2.1c). • This may involve → swapping the rows and columns or → moving one of the row-dimensions into the column-dimension.

Figure 2.1c: Pivot operation SLICE & DICE • These are operations for browsing the data in the cube. • These operations allow ability to look at information from different viewpoints. • A slice is a subset of cube corresponding to a single value for 1 or more members of dimensions. (Figure 2.1d). • A dice operation is done by performing a selection of 2 or more dimensions. (Figure 2.1e).

Figure 2.1d: Slice operation

Figure 2.1e: Dice operation A-5

For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/

DATA WAREHOUSING & DATA MINING SOLVE SOLVED D PAPER DEC-2013 2c. Write short note on (08 Marks) i) ROLAP ii) MOLAP iii) FASMI iv) DATACUBE Ans:(i) For answer, refer Solved Paper June-2014 Q.No.2b. Ans:(ii) For answer, refer Solved Paper June-2014 Q.No.2b. Ans:(iii) For answer, refer Solved Paper June-2015 Q.No.2a. Ans:(iv) For answer, refer Solved Paper Dec-2014 Q.No.2a. 3a. Discuss the tasks of data-mining with suitable examples. (10 Marks) Ans: DATA-MINING • Data-mining is the process of automatically discovering useful information in large datarepositories. DATA-MINING TASKS 1) Predictive Modeling • This refers to the task of building a model for the target-variable as a function of the explanatory-variable. • The goal is to learn a model that minimizes the error between i) Predicted values of target-variable and ii) True values of target-variable (Figure 3.1). • There are 2 types: i) Classification: is used for discrete target-variables Ex: Web user will make purchase at an online bookstore is a classification-task. ii) Regression: is used for continuous target-variables. Ex: forecasting the future price of a stock is regression task.

Figure 3.1: Four core tasks of data-mining

A-6 For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/

DATA WAREHOUSING & DATA MINING SOLVE SOLVED D PAPER DEC-2013 2) Association Analysis • This is used to find group of data that have related functionality. • The goal is to extract the most interesting patterns in an efficient manner. • Ex: Market based analysis We may discover the rule {diapers} -> {Milk} This suggests that customers who buy diapers also tend to buy milk. • Useful applications: i) Finding groups of genes that have related functionality. ii) Identifying web pages that are accessed together. 3) Cluster Analysis • This seeks to find groups of closely related observations so that observations that belong to the same cluster are more similar to each other than observations that belong to other clusters. • Useful applications: i) To group sets of related customers. ii) To find areas of the ocean that has a significant impact on Earth's climate. • For example: Collection of news articles in Table 1.2 shows → First 4 rows speak about economy & → Last 2 lines speak about health sector.

4) Anomaly Detection • This is the task of identifying observations whose characteristics are significantly different from the rest of the data. Such observations are known as anomalies. • The goal is to i) Discover the real anomalies & ii) Avoid falsely labeling normal objects as anomalous. • Useful applications: i) Detection of fraud & ii) Network intrusions.

A-7 For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/

DATA WAREHOUSING & DATA MINING SOLVE SOLVED D PAPER DEC-2013 3b. Explain shortly any five data pre-processing approaches. (10 Marks) Ans: DATA PRE-PROCESSING • Data p re-processing is a data-mining technique that involves transforming raw data into an understandable format. Q: Why data pre-processing is required? • Data is often collected for unspecified applications. • Data may have quality-problems that need to be addressed before applying a DMtechnique. For example: 1) Noise & outliers 2) Missing values & 3) Duplicate data. • Therefore, preprocessing may be needed to make data more suitable for data-mining. DATA PRE-PROCESSING APPROACHES 1. Aggregation 2. Dimensionality reduction 3. Variable transformation 4. Sampling 5. Feature subset selection 6. Discretization & binarization 7. Feature Creation 1) AGGREGATION • This refers to combining 2 or more attributes into a single attribute. For example, merging daily sales-figures to obtain monthly sales-figures. Purpose 1) Data reduction: Smaller data-sets require less processing time & less memory. 2) Aggregation can act as a change of scale by providing a high-level view of the data instead of a low-level view. E.g. Cities aggregated into districts, states, countries, etc. 3) More “stable” data: Aggregated data tends to have less variability. • Disadvantage: The potential loss of interesting-details. 2) DIMENSIONALITY REDUCTION • Key Benefit: Many DM algorithms work better if the dimensionality is lower. Curse of Dimensionality • Data-analysis becomes much harder as the dimensionality of the data increases. • As a result, we get i) reduced classification accuracy & ii) poor quality clusters. Purpose • Avoid curse of dimensionality. • May help to i) Eliminate irrelevant features & ii) Reduce noise. • Allow data to be more easily visualized. • Reduce amount of time and memory required by DM algorithms. 3) VARIABLE TRANSFORMATION • This refers to a transformation that is applied to all the values of a variable. Ex: converting a floating point value to an absolute value. • Two types are: 1) Simple Functions • A simple mathematical function is applied to each value individually. • For ex: If x is a variable, then transformations may be ex, 1/x, log(x) 2) Normalization (or Standardization) • The goal is to make an entire set of values have a particular property. • If x is the mean of the attribute-values and sx is their standard deviation, then the transformation x'=(x- x )/sx creates a new variable that has a mean of 0 and a standard-deviation of 1. A-8 For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/

DATA WAREHOUSING & DATA MINING SOLVE SOLVED D PAPER DEC-2013 4) SAMPLING • This is a method used for selecting a subset of the data-objects to be analyzed. • This is used for i) Preliminary investigation of the data & ii) Final data analysis. • Q: Why sampling? Ans: Obtaining & processing the entire-set of “data of interest” is too expensive or time consuming. • Three sampling methods: i) Simple Random Sampling • There is an equal probability of selecting any particular object. • There are 2 types: a) Sampling without Replacement • As each object is selected, it is removed from the population. b) Sampling with Replacement • Objects are not removed from the population, as they are selected for the sample. The same object can be picked up more than once. ii) Stratified Sampling • This starts with pre-specified groups of objects. • Equal numbers of objects are drawn from each group. iii) Progressive Sampling • This method starts with a small sample, and then increases the sample-size until a sample of sufficient size has been obtained. 5) FEATURE SUBSET SELECTION • To reduce the dimensionality, use only a subset of the features. • Two types of features: 1) Redundant Features duplicate much or all of the information contained in one or more other attributes. For ex: price of a product (or amount of sales tax paid). 2) Irrelevant Features contain almost no useful information for the DM task at hand. For ex: student USN is irrelevant to task of predicting student‟s marks. • Three techniques: 1) Embedded Approaches • Feature selection occurs naturally as part of DM algorithm. 2) Filter Approaches • Features are selected before the DM algorithm is run. 3) Wrapper Approaches • Use DM algorithm as a black box to find best subset of attributes. 6) DISCRETIZATION AND BINARIZATION • Classification-algorithms require that the data be in the form of categorical attributes. • Association analysis algorithms require that the data be in the form of binary attributes. • Transforming continuous attributes into a categorical attribute is called discretization. And transforming continuous & discrete attributes into binary attributes is called as binarization. • The discretization process involves 2 subtasks: i) Deciding how many categories to have and ii) Determining how to map the values of the continuous attribute to the categories. 7) FEATURE CREATION • This creates new attributes that can capture the important information in a data-set much more efficiently than the original attribut es. • Three general methods: 1) Feature Extraction • Creation of new set of features from the original raw data. 2) Mapping Data to New Space • A totally different view of data that can reveal important and interesting features. 3) Feature Construction • Combining features to get better features than the original. A-9 For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/

DATA WAREHOUSING & DATA MINING SOLVE SOLVED D PAPER DEC-2013 4a. Develop the Apriori Alogorithm for generating frequent-itemset. (08 Marks) Ans: APRIORI ALOGORITHM FOR GENERATING FREQUENT-ITEMSET • Let Ck = set of candidate k-itemsets. Let Fk = set of frequent k-itemsets.

• The algorithm initially makes a single pass over the data-set to determine the support of each item. After this step, the set of all frequent 1-itemsets, F1, will be known (steps 1 & 2). • Next, the algorithm will iteratively generate new candidate k-itemsets using frequent (k - 1) - itemsets found in the previous iteration (step 5). Candidate generation is implemented using a function called apriori-gen. • To count the support of the candidates, the algorithm needs to make an additional pass over the data-set (steps 6–10). The subset function is used to determine all the candidate itemsets in C k that are contained in each transaction „t‟. • After counting their supports, the algorithm eliminates all candidate itemsets whose support counts are less than minsup (step 12). • The algorithm terminates when there are no new frequent-itemsets generated. 4b. What is association analysis? (04 Marks) Ans: ASSOCIATION ANALYSIS • This is used to find group of data that have related functionality. • The goal is to extract the most interesting patterns in an efficient manner. • Ex: Market based analysis We may discover the rule {diapers} -> {Milk} which suggests that customers who buy diapers also tend to buy milk. • Useful applications: i) Finding groups of genes that have related functionality. ii) Identifying web pages that are accessed together.

A-10 For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/

DATA WAREHOUSING & DATA MINING SOLVE SOLVED D PAPER DEC-2013 4c. Consider the transaction data-set:

Construct the FP tree transaction. (08 Marks) Ans:

by

showing

the

trees

separately

after

reading

each

Procedure: 1. A scan of T1 derives a list of frequent-items, h(a:8); (b:5); (c:3); (d:1); etc in which items are ordered in frequency descending order. 2. Then, the root of a tree is created and labeled with “null”. The FP-tree is constructed as follows: (a) The scan of the first transaction leads to the construction of the first branch of the tree: h(a:1), (b:1) (Figure 6.24i). The frequent-items in the transaction are listed according to the order in the list of frequent-items. (b) For the third transaction (Figure 6.24iii). → since its (ordered) frequent-item list a, c, d, e shares a common prefix „a‟ with the existing path a:b: → the count of each node along the prefix is incremented by 1 and → three new nodes (c:1), (d:1), (e:1) is created and linked as a child of (a:2) (c) For the seventh transaction, since its frequent-item list contains only one item i.e. „a‟ shares only the node „a‟ with the f-prefix subtree, a‟s count is incremented by 1. (d) The above process is repeated for all the transactions.

A-11 For Solved Question Papers of UGC-NET/GATE/SET/PGCET in Computer Science, visit http://victory4sure.weebly.com/

DATA WAREHOUSING & DATA MINING SOLVE SOLVED D ...


Similar Free PDFs