MAT 240 Module 4 Project One PDF

Title MAT 240 Module 4 Project One
Author Malykah Sillo
Course Applied Statistics
Institution Southern New Hampshire University
Pages 9
File Size 361.3 KB
File Type PDF
Total Downloads 24
Total Views 141

Summary

Applied stats course work module 4...


Description

Median Housing Price Prediction Model for D. M. Pan National Real Estate Company

Report: Median Housing Price Prediction Model for D. M. Pan National Real Estate Company Maleka Monono Southern New Hampshire University

1

Median Housing Price Model for D. M. Pan National Real Estate Company

2

Introduction This report predicts the median listing price based on the median square footage using linear regression, graphs, tables, and a sample to product an analysis. Using linear regression is most appropriate when predicting the strength and trend of a set of data. For example, when determine how much the square footage affects the listing price. One variable the (predictor variable) is used to predict and the other variable (response variable) we are trying to predict. The median square feet is our predictor and the median listing price is our response. To determine the median listing price, we need to use the median square footage. Data Collection I obtained the sample data by using the =RAND() function in excel. Starting by inserting a column to the left of the column marked ‘Region’. Next, I entered the function =RAND() into the first cell in my inserted column. I proceeded to copy and paste the function down until the last cell in the data set. Then, I used the sort function to randomize the data. Finally, I selected the first 50 regions. Response variable, Y = Median Listing Price. Predictor Variable X = Median Square Feet.

Median Housing Price Model for D. M. Pan National Real Estate Company

3

Data Analysis To use a linear regression for the data set, multiple requirements must be met. To do so a scatterplot must be made (sometimes a residual plot is required for further inquiry). The first requirement is that the variables have a true linear relationship. This can be determined by the slope. If a data set is truly linear the slope never changes. Next, we check to see if our errors are normally distributed. We can note this through looking at a histogram of the variables. If they are normal or obvious pattern, then they be normally distributed. We also look for equality of variance. If the dots on the plot do not curve or fan out into a megaphone pattern it has equal variance. We can determine independence of residuals, depending od the type of data. If it is cross – sectional data, it is “assumed”. My data set sample meets all requirements.

Median Housing Price Model for D. M. Pan National Real Estate Company When looking at a linear regression involving two variables the response variable is the variable being modeled or predicted, while the predictor variable is the variable used to predict

the response. The response variable will respond to a change in the predictor variable. In a scatterplot, the predictor variable is on the X axis and the response variable on the Y. In the case of this data set, the response variable is median listing price and the predictor variable is median square footage.

4

Median Housing Price Model for D. M. Pan National Real Estate Company

5

For the first histogram, median square feet have a unimodal symmetric shape. It only peaks (modality) around the square footage of 1,839 to 2,169. The data has a center of 2,043 which is between 1,839 to 2,169 square feet. The data has a smaller spread (variation/standard deviation). From this data we can interpret that most of the data comes from a square footage around 1,839 to 2,169. The standard deviation of this data set is relatively small, 349.28…, and the data is concentrated close to the mean. This produces little variation. The second histogram, median listing price unimodal skewed right distribution. It only peaks (modality) around the listing price of $234k, $374k. The data has a center (median) of $283,575 which is between $234k, $374k. The data has a smaller spread (variation/standard deviation). This data set has outliers that differ greatly from the data set resting between $654k, $794k. Keeping in mind that these outliers can greatly affect the mean. From this data we can interpret that most of the data is listed at $90k to $514k. The standard deviation of the data set is $145,381 which is relatively small, and the data is concentrated close to the mean, producing little variation. Overall, we can say that majority of the listings are between $94k and 514k with a square footage of 1,509 and 2,499 based on the shape, center, spread and outliers. Also, this data set

contains little variance.

Median Housing Price Model for D. M. Pan National Real Estate Company

6

The overall characteristics between the national population and the national sample are relatively close. The histogram of the national population’s median listing price and the national sample are both right skewed with outliers or gaps. The histogram of the national population’s median square footage and the national sample are both unimodal and symmetrical. The only difference between the two histograms of the national population and the national sample is that they have greater spreaders leading to greater variation, due to having more data than the sample. Looking at the summary statistics of both the national population and national sample, concludes that the national sample could be used to represent the national population. The difference between the measure of center/spread between the median listing price of both national population and nation sample are: Mean: sample is 5.5% more than population, Median: sample is 10.4% more than the population, Standard deviation: sample is 11.3% less than population. The difference between the measure of center/spread between the median square footage of national population and nation sample are: Mean: sample is 3.2% more than population, Median: sample is 7.5% more than population, Standard deviation: sample is 4.9% less than population. From this analysis its very easy to see that there is not much difference between the national sample and national population. Looking at the difference in standard deviations (both from median square footage and median listing price) it confirms that the spread in the population more than the sample (greater variation).

The Regression Model

Median Housing Price Model for D. M. Pan National Real Estate Company

7

A regression model can be developed for the data set. Regression models are used for making observations and predictions between the relationship of independent and depended variables. The analysis of this data set is being used to determine the use of square footage as a benchmark for listing prices on homes. To develop a regression model for this data set you must create a regression equation which consists of the dependent variable (Y), independent variable (X), coefficient and intercept. All these factors can be determined using the data analysis function in excel and going under regression. Based on the scatter plot presented, it depicts a moderate positive linear correlation. The scatter plot also shows 3 outliers that have high leverage that are possibly throwing of the trendline and heavily influencing the mean of the data. After removing the three outliers doesn’t change much on my graph, it simply removes the points without making a change to my trendline or correlation. Looking at the summary statistics once the outliers are removed, it makes minor changes in the measures of center/spread. Overall, whether I decided to keep or remove the three outliers they do not heavily influence my data analysis.

Median Housing Price Model for D. M. Pan National Real Estate Company

8

The (correlation coefficient/multiple R in excel) R – value is 0.584. I got this value by using regression under data analysis in excel. This value determines the strength of the correlation. This neither close to 1 (strong correlation) or -1 (weak correlation), it’s in the middle. Therefore, it is moderate as I stated previously.

The regression equ

242.88 (the square foot)

and the intercept is 182,801 (the listing price). Therefore, for every increase in 242.88sqft the listing price increases by $182,801. R-squared is 0.340. I came to this value by using regression under data analysis in excel. This means that about 34% of the variability in the median listing price can be explained by the regression model. Based on my analysis I will be making two predictions. If I wanted to find the median listing price of a home with 2222sqft (Y = 242.88*2222 – 182801), I would get a listing price of $356,878. If I wanted to find the median listing price of a home with 1967sqft (Y = 242.88*1967 – 182801), I would get a listing price of $294,943.

Median Housing Price Model for D. M. Pan National Real Estate Company

9

Conclusions This analysis conveys that this sample can be used to represent the national population. Based on the summary statistics, histograms and scatterplot, the national population has slight difference compared to the sample. The regression equation can be used to make predictions for further listing price of homes based on their square footage. This sample aligns perfectly with my expectations based on the scatterplot. In this sample, outliers were not a problem at all. Perhaps, in a different sample with larger outliers (that greatly affect the summary statistics) it would affect the trendline and correlation of the scatterplot. For follow up research it would be interesting to look median dollars per square foot to see if it would affect the median listing price....


Similar Free PDFs