Assignment 2 PDF

Title Assignment 2
Author Breana Payne
Course Data Structures I
Institution DePaul University
Pages 2
File Size 90.3 KB
File Type PDF
Total Downloads 74
Total Views 124

Summary

assignment...


Description

CSC367-Spring 2017, Assignment 2, Page 1 of 2

Assignment 2 (60 points): Due Date: Thursday, April 27th, 2017, by midnight Problem 1 (5 points): Suppose the data for the analysis includes the attribute age. The age values for the data tuples are (in increasing order): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,33,33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. a. Use min-max normalization to transform the value 35 for age onto the range [0.0, 1.0]. b. Use z-score normalization to transform the value 35 for age c. Use normalization by decimal scaling to transform the value 35 for age. d. Comment on which method you would prefer to use for the given data, giving reasons as to why. Problem 2 (10 points): For a given dataset X of three dimensional samples, X=[{1,2,0}, {3,1,4}, {2,1,5}, {0,1,6}, {2,4,3}, {4,4,2}, {5,2,1}, {7,7,7}, {0,0,0},{3,3,3}] a. find the outliers using the distance based technique if i. the threshold distance is 4, and threshold fraction p of non-neighbor samples is 3 ii. the threshold distance is 6, and threshold fraction p of non-neighbor samples is 2 b. find the outliers based on the 1.5*IQR criterion for each dimension separately c. interpret the differences between the two techniques used for outlier detection Problem 3 (5 points): If your data contains missing values, discuss the basic analyses and corresponding decisions you will take in the preprocessing phase of the data mining process. Problem 4 (40 points): The dataset stored under cpu_problem.xls (posted under the course documents) contains 8 attributes (6 predictive attributes, 2 non-predictive) used to predict the relative CPU performance (the ninth attribute in the dataset). The description of the attributes is as follows: v1. vendor name: 30 (adviser, amdahl,apollo, basf, bti, burroughs, c.r.d, cambex, cdc, dec, dg, formation, four-phase, gould, honeywell, hp, ibm, ipl, magnuson, microdata, nas, ncr, nixdorf, perkin-elmer, prime, siemens, sperry, sratus, wang) The description can be also downloaded from http://archive.ics.uci.edu/ml/machine-learning-databases/cpuperformance/machine.names along with the dataset at http://archive.ics.uci.edu/ml/machine-learning-databases/cpuperformance/machine.data

CSC367-Spring 2017, Assignment 2, Page 2 of 2

v2. Model Name: many unique symbols v3. MYCT: machine cycle time in nanoseconds (integer) v4. MMIN: minimum main memory in kilobytes (integer) v5. MMAX: maximum main memory in kilobytes (integer) v6. CACH: cache memory in kilobytes (integer) v7. CHMIN: minimum channels in units (integer) v8. CHMAX: maximum channels in units (integer) v9. PRP: published relative performance (integer) a) Import the Excel file in SPSS and make sure that the types of the variables in SPSS matches the types from the description of the attributes above (if they do not, you can use the Variable View to make any appropriate changes; also add labels to your variables using the description above) b) Perform a correlation analysis. Interpret the correlation matrix and summarize the relationships among the variables based on this analysis. Are there any variables strongly correlated (correlation greater than 0.8)? c) Using attributes v3 to v8, predict the CPU relative performance (v9: PRP) using a linear regression model. Use the forward selection approach to build your regression model. From your SPSS output, report and interpret the following: I. the standardized coefficients of the regression line II. the adj-R2 of the final forward selection model III. the variables that were selected in the regression model d) Perform principal component analysis on the data. I. How many components should be extracted in order to preserve 80% variance in the data. II. How many components should be extracted such that the error in the new representation to be less than 10%.

Submission Instructions 1. Answer the problems and write your answers in a Word document. For full credit per problem, make sure that you explain each one of your answers. 2. Submit your file online at the website at http://d2l.depaul.edu and check your submission 3. Keep a copy of all your submissions! 4. If you have questions about the homework, email me BEFORE the deadline. 5. Late submissions are allowed with a 5%, 10%, and 15% penalty for a one day, two days, and three days, respectively. 6. No late work will be accepted after three days since the assignment was due.

The description can be also downloaded from http://archive.ics.uci.edu/ml/machine-learning-databases/cpuperformance/machine.names along with the dataset at http://archive.ics.uci.edu/ml/machine-learning-databases/cpuperformance/machine.data...


Similar Free PDFs