KDD process steps PDF

Title	KDD process steps
Course	DATA ANALYTICS
Institution	University of Surrey
Pages	6
File Size	172.2 KB
File Type	PDF
Total Downloads	101
Total Views	194

Preview

CLICK TO PREVIEW PDF

Summary

Summary of steps of KDD process...

Description

Knowledge Discovery in Databases (KDD) takes the form of a roadmap (Figure below). The term KDD refers to the broad process of finding knowledge in data and emphasizes the "high-level" application of particular data mining methods. The unifying goal of the KDD process is to extract knowledge from data in the context of large databases’

Step 1: Problem Specification The purpose of this phase is to move from a loosely defined problem description to a tightly defined problem specification.

The outputs should be a list of resource requirements, identification of the high and low level tasks to be undertaken, a data dictionary, the feasibility of the project and a travel log which will be updated during the process for each operation performed.

Inputs: The loosely defined problem description

Travel log: A travel log must be initiated and continuously updated throughout the process. Each operation performed will be time-stamped

Preliminary database examination: 

Number of records



Number of fields



Check for missing values



Accessibility of the database



Noise level determination (Noisy data is data with a large amount of additional meaningless information in it called noise. The term has often been used as a synonym for corrupt data. It also includes any data that cannot be understood and interpreted correctly by machines, such as unstructured text)



Database familiarisation:



Field types Distributions of records (through visualisation) Data dictionary preparation



High level task determination (Prediction, Description or both)



Low level task determination: Classification, Association rules, Clustering, etc



Software and hardware requirements:

 

o Hardware o Database software o Spreadsheet software o Software to support pre-processing operations o Software to support the data mining algorithms 

Feasibility determination (in terms of data reliability, system performance, personnel, size of database regions of interest, low level task feasibility and cost)

Results of this phase: •

The feasibility of the project has been determined

•

A detailed problem specification has been produced

Step 2: Resourcing

The main activity within this phase is to gather all the resources that are necessary for the project. The list of resource requirements from Step 1 is taken as input. The output from this phase is an operational database, i.e. a complete and consistent database. The main resource which is also usually the most time consuming to gather is the data itself.

Operations performed: 

Cost determination: o Personnel o Software and hardware o The cost of data purchasing/collection



Time determination



Schedules should be created



All software and hardware resources should be made accessible



Data formal control: All data has been derived by a single source, values for each attribute are probably in the same format

Step 3: Data cleansing

The goal is to prepare the data for subsequent phases that involve learning. However, learning is never performed within this phase. The output is a cleansed operational database of highquality data.

Operations performed:



Removal of errors – Outlier handling



Dealing with missing values (via removal of missing data or via missing data estimation)



Balancing (via Data Deletion or via Data Duplication) (Definition: Imbalanced data sets are a special case for classification problem where the class distribution is not uniform among the classes. Typically, they are composed by two classes: The majority (negative) class and the minority (positive) class)



Random Sampling: Partitioning the data into a training and testing subset. The data mining algorithms to be used at a later stage will be applied to the training subset during the Data Mining Step. The testing subset will be used for Step 6, “Evaluation of results”.

Data reduction operations performed at this stage are as follows:

o Dimension reduction: through correlation analysis o Generalisation: Low-level concepts are replaced with higher-level concepts

Step 4: Pre-processing Techniques applied here may yield much more effective models. The first phase of the process within which learning may occur.



Feature construction: Applying a set of constructing operators to a set of existing database features to construct one or more new features



Feature subset selection: can only be applied to the training dataset The aim is the production of a powerful subset, by reducing the number of attributes in the database



Discretisation: can only be applied to the training dataset

Step 5: Data mining Data mining is the discovery of interesting, unexpected or valuable structures in large datasets. It consists of the application of the selected algorithms and subsequent parameter optimisation to obtain the best possible model. The initial estimation of model fit is assessed on the training data. The data mining algorithm(s) to be used need to be determined, if not already determined. The selection is made based on the project’s data mining tasks. For each data mining task to be performed, there is a wide range of algorithms available. Each algorithm will typically have its own strengths and weaknesses. The Algorithm parameters and the Problem parameters of each algorithm must be set before the algorithm is executed.

This phase’s primary purpose is to determine whether the discovered knowledge is worthy of closer scrunity. If this is found to be the case, then no more work needs to be done within this phase. However, such an outcome at the first attempt is extremely rare. Reasons why discovered knowledge could be found to be unusable: 

Being overly complex



Being at an unsuitable level of accuracy



Being of insufficient quality

This process repeats until satisfactory results are produced.

Step 6: Evaluation of results

The input is the discovered knowledge produced during the Data mining phase. Within this phase, the testing data subset is used for the evaluation of the discovered knowledge within the following areas:



Performance on testing dataset



Simplicity: Depends on the end-user (e.g. if the discovered knowledge will be presented to domain experts, the level of simplicity would not be of major concern)



Application area suitability: (e.g. the discovered knowledge is expected to have useful marketing application and is therefore considered as suitable)



Generality: Determining whether the generality level of the discovered knowledge is acceptable



Visualisation – a potentially useful evaluation tool

Step 7: Interpretation of results

Evaluation is performed by domain experts, examining how closely the Discovered knowledge/patterns matches the existing knowledge and how could be used alongside it. Some additional testing may be performed by the client on specific interesting areas of the report and previously unknown knowledge.

Step 8: Exploitation of results Upon client’s decision, the knowledge that has been discovered in previous steps will be put into practice. In this case, cost-benefit analysis will be performed in order to ensure that risk is minimised and potential benefit is maximised. The KDD process undertaken during the project may be integrated within the organisation and become automated with the use of the existing travel log, in order to support future projects to be carried out....