The Sacramento Bee has a home sale database for looking up the prices of home sales. I was a little disappointed that it only listed the sales and didn’t do any Zwillow style statistical analysis. So I dumped the data into R and started playing around with it. Home price versus size versus number of bedrooms is always one of the text book examples for multiple regression, so I thought it would be pretty easy. It turns out that the replicating the Zwillow estimate with real world data, is harder than it looks. So, I’m going to start with smallest model I can get decent results with, and add build it up over a series of posts. Below is a regression for Price and Size in the 95834 zip code.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3683.955 11305.372 -0.326 0.745
Size 100.959 7.226 13.971 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 41770 on 165 degrees of freedom
Multiple R-squared: 0.5419, Adjusted R-squared: 0.5391
F-statistic: 195.2 on 1 and 165 DF, p-value: < 2.2e-16
Basically, the above says that for every additional 1 square feet, there is an additional $100 dollars in the price. So a house with 2000 square feet should cost $200,000 dollars. Over all, the statistical model indicates that size accounts for 54% of the variation in the price. Adding number of bedrooms, bathrooms, or condo vs. standalone house do not add more explanatory power even though in most text book examples they do. Next time I’m going to try and post a model that lets me expand beyond one zip code and increase the explanatory model.