ITECH1103 question pool PDF

Title	ITECH1103 question pool
Author	Chang Wang
Course	Big Data and Analytics
Institution	Federation University Australia
Pages	13
File Size	423.2 KB
File Type	PDF
Total Downloads	10
Total Views	144

Preview

CLICK TO PREVIEW PDF

Summary

ITECH 1103 Big Data and Analytic summary...

Description

lOMoA RcP S D| 7787346

目录 Week 1 ................................................................................................................................................................................................ 2 1 What is data redundancy, and which characteristics of the file system can lead to it? .......................................... 2 2 What is data independence, and why is it lacking in file systems. ............................................................................. 2 3 What is a DBMS and what are its functions? ................................................................................................................ 2 4 Explain the difference between data and information. What is metadata in the context of a database system? 2 Week 5 ................................................................................................................................................................................................ 3 1 What is Big Data? .............................................................................................................................................................. 3 2 Refer to the names of the characterization/types (‘V’s) of the Big Data and explain. ............................................. 3 3 How is it different from the traditional data? .................................................................................................................. 3 4 Discuss the difference between structured and unstructured data with the help of examples. ............................. 3 5 Discuss various critical success factors of Big Data Analytics. .................................................................................. 3 Week 6 ................................................................................................................................................................................................ 4 1 What are the main data pre-processing steps? Briefly describe each step and provide relevant examples. ..... 4 2 Categorical vs numeric data. Name and describe two sub types and include an example for each.................... 4 3 Discuss various stages of CRISP-DM process CRISP-DM (Cross-Industry Standard Process for Data Mining) 4 4 Discuss various stages of SEMMA process .................................................................................................................. 5 Week 7 ................................................................................................................................................................................................ 6 1.What is MapReduce .............................................................................................................................................................. 6 2.What does it do? .................................................................................................................................................................... 6 3.How does it do it? .................................................................................................................................................................. 6 4.What is Hadoop? ................................................................................................................................................................... 6 5.Pros and cons of Hadoop ..................................................................................................................................................... 6 6.Parallel processing and its important in Big Data ............................................................................................................. 7 Week 8 ................................................................................................................................................................................................ 8 1.Data warehouse ..................................................................................................................................................................... 8 2.Data Mining versus text mining ........................................................................................................................................... 8 3.Web mining versus text mining ............................................................................................................................................ 8 4.What is the role of natural language processing in text mining? Discuss the capabilities and limitations of NLP in the context of text mining. ........................................................................................................................................................ 8 5.List and discuss three prominent application areas for text mining. .............................................................................. 9 Week 9 .............................................................................................................................................................................................. 10 1 What is data visualization and why is it important? .................................................................................................... 10 2 List and discuss different benefits of data visualization. ............................................................................................ 10 3 List and discuss different Tufte’s principles of data visualization. ............................................................................ 10 4 Content-Driven Design .................................................................................................................................................... 10 Week 10............................................................................................................................................................................................ 11 1 What is an Internet of Things (IoT)? ............................................................................................................................. 11 2 Various characteristics of IoT............................................................................................................................................. 11 design characteristics(IoT): ........................................................................................................................................................ 11 3 Discuss five (5) different system-level features of IoT with the help of examples. ................................................ 11 4 Discuss four (4) different security challenges in IoT. ................................................................................................. 12 Week 11............................................................................................................................................................................................ 13 1 Define big data governance and why it is important? ................................................................................................ 13 2 List and discuss three (3) components of data governance. ............................................................................................... 13 3 List and discuss with the help of examples, three (3) benefits of big data governance. ...................................... 13

1

lOMoA RcP S D| 7787346

Week 1 1 What is data redundancy, and which characteristics of the file system can lead to it? Data redundancy exists when unnecessarily duplicated data are found in the database. For example, a customer's telephone number may be found in the customer file, in the sales agent file, and in the invoice file. Data redundancy is symptomatic of a (computer) file system, given its inability to represent and manage data relationships. Data redundancy may also be the result of poorly-designed databases that allow the same data to be kept in different locations. 2 What is data independence, and why is it lacking in file systems. File systems exhibit data dependence because file access is dependent on a file's data characteristics. Therefore, any time the file data characteristics are changed, the programs that access the data within those files must be modified. Data independence exists when changes in the data characteristics don't require changes in the programs that access those data. 3 What is a DBMS and what are its functions? A DBMS is best described as a collection of programs that manage the database structure and that control shared access to the data in the database. Current DBMSs also store the relationships between the database components; they also take care of defining the required access paths to those components. The functions of a currentgeneration DBMS may be summarized as follows: ⚫ stores the definitions of data and their relationships (metadata) in a data dictionary; any changes made are automatically recorded in the data dictionary. ⚫ creates and manages the complex structures required for data storage. ⚫ transforms entered data to conform to the data structures in item 2. ⚫ creates a security system and enforces security within that system. ⚫ creates complex structures that allow multiple-user access to the data. ⚫ performs backup and data recovery procedures to ensure data safety. ⚫ promotes and enforces integrity rules to eliminate data integrity problems. ⚫ provides access to the data via utility programs and from programming languages interfaces. ⚫ provides end-user access to data within a computer network environment. 4 Explain the difference between data and information. What is metadata in the context of a database system? Data are raw facts. Information is processed data to reveal the meaning behind the facts. The following points summarize some key points: Data constitute the building blocks of information. • • Information is produced by processing data. Information is used to reveal the meaning of data. • Good, relevant, and timely information is the key to good decision making. • • Good decision making is the key to organizational survival in a global environment.

2

lOMoA RcP S D| 7787346

Week 5 1 What is Big Data? • “Big data is high-volume, high-velocity and high-variety information assets that demand cost- effective, innovative forms of information processing for enhanced insight and decision making. 2 Refer to the names of the characterization/types (‘V’s) of the Big Data and explain. high-Volume -- refers to the large magnitude of data (> Terabytes to Petabytes) • • High-Velocity -- refers to the rate at which data are generated and the speed at which it should be analysed and acted upon. Generally generated from sensors -- with a need for real time analytics high-Variety - refers to the structural heterogeneity in a dataset. i.e. can be found in avariety of forms • structured, semi-structured, un- structured high-Veracity -- data can have uncertainty or be unreliable • high-Variability (and complexity) -- based on a variation in the data flow rates --due to different sources. • Complexity refers to the fact that big data are generated through a myriad of sources. high-Volatility -- how long the data will be available or stored • high-value -- value in terms of what information it can provide from data to information form. • 3 How is it different from the traditional data? Difference to traditional data : Generally used centralized database architecture where large complex problems were solved by a single • computer system and hard to scale Storage systems only able to store smaller volumes of data (~around a terabyte size) • Generally structured i.e. fixed formats or fields or structured files complying to standard data types • • Generally stored using fixed schemas, data normally static in nature i.e. stored in a set form Storage and exploration can be stored in terms of relationships, as data volume is much smaller • • Generally, from more predictable and less velocity based standard sources like surveys etc., transitional data... unlikely to be data video/audio streams. 4 Discuss the difference between structured and unstructured data with the help of examples. a. Structured data: are computer-readable and usable. Data sourced from databases, spreadsheets, flat files, and other systems with fields or cells, rows, and columns are organized so that a computer can understand and use the data values. structured files like csv, json, xml etc. b. Unstructured data: In contrast, are just that, “unstructured,” meaning that they do not conform to data models and associated metadata. Computers have a harder time reading and understanding these data. They may be of varying lengths and content type. Unstructured data include pictures, audio recordings, and videos, although they commonly consist of blocks of text. 5 Discuss various critical success factors of Big Data Analytics. A clear business need -- address a business need such as solving a problem or seizing an opportunity • • Strong committed sponsorship -- Without solid sponsorship, it is difficult to succeed with any IT project Alignment between the business and strategy -- important to make sure that big data analytics projects support • the business strategy A fact based decision making culture -- decisions must be based on “the facts” (generated by analytics) and • there should be constant experimentation to see what works best A strong data infrastructure -- When a strong data infrastructure is in place, applications can often be developed • in days. The right analytics tools -- Data mining requires tools that incorporate algorithms • And processes that are designed specifically to find hidden relationships in data • Personnel with advanced analytical skills -- It is useful to consider a continuum of • Analytics users, anchored at one end by end users, with analysts in the middle and • Data scientists at the other end • Group requires different skills when it comes to working with big data, including a mixture of business, data, and • analytics expertise.

3

lOMoA RcP S D| 7787346

Week 6 1 What are the main data pre-processing steps? Briefly describe each step and provide relevant examples. 1. Data Consolidation Collecting data - from different sources e.g. surveys, sensors etc.. • Select data - go through the data and find what data is most relevant and useful • • Integrate data - bring together the relevant data from the different sources 2. Data Cleaning Impute missing values - employ methods to replace the missing values (generally statistical) • • Reduce noise in data - depending on the domain of the problems removing data points because they are extreme outliers, incorrect values (measurements), incorrectly tagged etc. Eliminate inconsistencies or contradictions • 3. Data Transformation Normalise data – number of forms/ways of scaling • e.g. zi=ximin(x)/max(x)-min(x) i.e. set a scaled range 0.0 – 1.0 etc.. • Discretise/aggregate data – Dividing the range of a continuous attribute into intervals – binning. Construct new attributes – to help the mining process • 4. Data Reduction • Reduce the number of variables - Feature reduction techniques - e.g. Principle Component Analysis (PCA) Reduce the number of cases • Balance skewed data: e.g. Even out classes that have a disproportional number of cases compared to cases • with fewer 2 Categorical vs numeric data. Name and describe two sub types and include an example for each. A. Numerical (Quantitative data) a) Interval –  Interval data lacks the absolute zero point,  All quantitative attributes can be measured in interval scales.  Measurements belonging to this category can be counted, ranked, added, or subtracted to take the difference.  The distances between each value on the interval scale are meaningful and equal.  You cannot calculate Ratios.  Example : A measure in Fahrenheit or in time b) Ratio/continuous – • Measurements – Ratio data has a defined zero point. As an analyst you can say that a crime rate of 10% is twice that of 5%, or • annual sales of $2 million are 25% greater than $1.5 million. Ratio data can be transformed using logarithms, square roots, etc. to create ‘normal’ data. Example : Income, height, weight, annual sales, market share, product defect rates, time to repurchase, unemployment rate, and crime rate B. Categorical (discrete data): a) Nominal - A nominal number simply names something without assigning it to an order in relation to other numbered objects or pieces of data. Also just naming categories without order like male/female , pass/fail b) Ordinal - quantities with a natural ordering. Values are ordered/ranked with meaning. You cannot know with certainty if the intervals between each value are equal. e.g. a scale from of stars to rate a movie -- 0 (lowest) - 4 (highest). 3 Discuss various stages of CRISP-DM process CRISP-DM (Cross-Industry Standard Process for Data Mining) Business Understanding This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives. Data Understanding The data understanding phase starts with initial data collection and proceeds with activities that enable you to become familiar with the data, identify data quality problems, discover first insights into the data, and/or detect interesting subsets to form hypotheses regarding hidden information. Data Preparation The data preparation phase covers all activities needed to construct the final dataset [data that will be fed into the modelling tool(s)] from the initial raw data. Model Building In this phase, various modelling techniques are selected and applied, and their parameters are calibrated to optimal values. Testing and Evaluation Before proceeding to final deployment of the model, it is important to thoroughly evaluate it and review the steps executed to create it, to be certain the model properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this

4

lOMoA RcP S D| 7787346

phase, a decision on the use of the data mining results should be reached. Deployment Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. However, even if the analyst will carry out the deployment effort, it is important for the customer to understand up front what actions need to be carried out in order to actually make use of the created models. 4 Discuss various stages of SEMMA process SEMMA (Sample, Explore, Modify, Model, and Assess) • Sample - Generate a representative sample of the data. Explore - Visualisation and basic description of the data. • Modify - Select variables, transform variable representations. • • Model - Use variety of statistical and machine learning models. T Assess - Evaluate the accuracy and usefulness of the models. •

5

lOMoA RcP S D| 7787346

Week 7 1.What is MapReduce MapReduce is the heart of Apache. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. It is the data processing layer of Hadoop. MapReduce is a programming model and an associated implementation for processing and generating large data sets 2.What does it do? It processes large structured and unstructured data stored in Hadoop Distributed File System (HDFS). It processes a huge amount of data in parallel and enables automatic parallelization and distribution of large-scale computations. 3.How does it do it? It does this by dividing the job (submitted job) into a set of independent tasks (sub-job). It works by breaking the processing into phases: Map and Reduce. Map – It is the first phase of processing, where we specify all the complex logic code. Every record/item in the original data is mapped by zero or more key ‐ value pairs. Map is written by the user, takes an input pair and produces a set of intermediate key/value pair...