In a previous post, I showed a pretty simple regression analysis of housing prices and house size for my Zip code. The zip code was used as an easy way to include location in the output. Using PostGIS and geographic data from the City of Sacramento, this post will show a regression analysis ( using the R statistical programming project) using the city’s designated neighborhoods. The raw data real estate data comes from the Sacramento Bee. After describing the model, I’ll apply it the last few months of home sales (not used in developing the model), and see how well it does at predicting results.
The model is based on the number of bedrooms, size in square feet, and neighborhood the house is in. Including the number of bathrooms had no impact on improving the explanatory power of the model.
|Cannon Industrial Park||−39,781||71,512||−0.556||0.58|
|Central Oak Park||−4,059||51,335||−0.079||0.94|
|East Del Paso Heights||−18,106||51,520||−0.351||0.73|
|Florin Fruitridge Industrial Park||1,188,225||87,657||13.555||0.00||***|
|Golf Course Terrace||6,828||51,980||0.131||0.90|
|Midtown / Winn Park /Capital Avenue||225,673||62,032||3.638||0.00||***|
|North City Farms||35,728||53,016||0.674||0.50|
|North Oak Park||26,439||52,270||0.506||0.61|
|Raley Industrial Park||16,773||57,312||0.293||0.77|
|South City Farms||21,527||55,405||0.389||0.70|
|South Land Park||153,143||51,730||2.96||0.00||**|
|South Oak Park||−7,677||51,940||−0.148||0.88|
|Tahoe Park East||12,716||71,477||0.178||0.86|
|Tahoe Park South||82,362||54,625||1.508||0.13|
|Upper Land Park||270,586||71,548||3.782||0.00||***|
|Valley Hi / North Laguna||21,471||50,755||0.423||0.67|
|West Del Paso Heights||−344||52,897||−0.006||0.99|
|West Tahoe Park||83,351||57,350||1.453||0.15|
Residual standard error: 71470 on 2693 degrees of freedom
Multiple R-squared: 0.5953, Adjusted R-squared: 0.5803
F-statistic: 39.61 on 100 and 2693 DF, p-value: < 2.2e-16
So what do all the numbers mean?
Size and location are the two most important factors in the data determining house price. Not all locations statistically impact the sales price. In fact, only 21 of the 99 locations are statistically signficant as marked by asterisks. The more asterisks, the more statistically significant. The analysis relies on the city’s definition of neighborhoods which are radically different in size and number of houses. This variance means that some of the statistically significant neighborhoods are significant because a very small number of homes (less than 10) sold. But some of the best known neighborhoods, East Sacramento, Land Park, and the Pocket all both positively impact the price of a home and have a significant number of sales.
For ever additional square foot of space, the price goes up by $87 dollars. 87 is the coefficient by which the size the house is multiplied by. So for a 1400 square foot home, the size of the house adds $121,800 to the price of the home. The standard error amounts to a range that the model supports. With standard error of +- 4, the value of the aforementioned house could vary between $116,200 and $127,000. To predict the value of another home based on this model, the values for bed, size, and neighborhood by their respective coefficients to determine the mean value of the prediction.
One interesting part of the output is the coefficient for the number of bedrooms. A coefficient of -9,222 suggests that a home’s price could be improved simply by knocking down walls creating bedrooms. This is clearly not the case. The reason the regression analysis produced a negative coefficient in this case is that the number of bedrooms is closely correlated with size. This situation is called: multicollinearity. While multicollinarity makes it hard to understand how the component contributes to the result, it does improve an estimate or forecast based on the model.
Overall, the model is quite accurate. The adjusted R^2 .5803 means that the model explains 58% of the variation in house prices. The remaining 42% of the variation in a house’s price are things like lot size or quality.
To demonstrate the model, I’ve nearly randomly picked a house from the ones on sale on Yahoo’s real estate site. The house is 1482 square feet, with 3 bedrooms, and is in the East Sacramento neighborhood.
To translate these statistics into an estimate the formula:
Intercept Coefficient +# of bedrooms * bed coefficient + size coefficient * size + neighborhood value
11,775 + 3 * −9,222 + 1482 * 87 + 271,534 = 384,577
One important thing to understanding the output from a regression analysis is that while it is often expressed as rather precise result, in this case $384,577, a better way to interpret the results is to use the standard error of the coefficient to produce a range of output. The high of the range is:
Intercept Coefficient +# of bedrooms * (bed coefficient + standard error) + size coefficient * (size + standard error) + neighborhood value + standard error
Or in numbers:
11,775 + 3 * (−9,222 + 2,365) + 1482 * (87 + 4) + (271,534 + 51,255) = 448,875
The low of the range is calculated by subtracted the standard error from the coefficient:
11,775 + 3 * (−9,222 – 2,365) + 1482 * (87 – 4) + (271,534 – 51,255) = 320, 299
So the model’s range is $448k to $320k with a likely outcome of $384k. Zillow’s estimate for the property is $331,800 and reports an asking price of $357,950. For this house, both are within the range, if on the low end. I think the Zillow zEstimate is a bit misleading, since they only give a number and not a range. In another post I will apply the model to July, August, and September data (the model data was from June 10 to June 11).