# loading packages
library(readxl)
library(ggplot2)
library(tidyverse)
library(stargazer)
# loading data. this is a large file
data<-data.frame(read_excel('housesample25.xlsx')) %>%
na.omit() %>% # dropping missing values
mutate(rand=runif(length(salesprice2011),min=0,max=1)) # random number that I will use to select samples
My goal is to demonstrate three things:
I will demonstrate the first point by estimating regressions that sequentially add independent variables. Our goal is to explain the sales price of a house. The full model I will estimate is: \(salesprice2011=\beta_0+\beta_1bedrooms+\beta_2fullbaths+\beta_3partialbaths+\beta_4age+\beta_5abvgroundsqft+\epsilon\).
# estimating the regressions
reg1<-lm(salesprice2011~bedrooms, data=data)
reg2<-lm(salesprice2011~bedrooms+fullbaths+partialbaths, data=data)
reg3<-lm(salesprice2011~bedrooms+fullbaths+partialbaths+age, data=data)
reg4<-lm(salesprice2011~bedrooms+fullbaths+partialbaths+age+abvgroundsqft, data=data)
# combining the results
stargazer(reg1,reg2,reg3,reg4,
type='text', # text output
omit.stat = c('f'), # omitting f statistic to save space
digits = 2) # rounding to two digits
##
## ================================================================================================================
## Dependent variable:
## --------------------------------------------------------------------------------------------
## salesprice2011
## (1) (2) (3) (4)
## ----------------------------------------------------------------------------------------------------------------
## bedrooms 94,146.12*** 19,724.24*** 19,722.14*** -10,115.30***
## (822.88) (814.00) (812.54) (694.85)
##
## fullbaths 104,817.00*** 108,195.20*** 37,390.28***
## (810.44) (862.08) (872.43)
##
## partialbaths 77,138.89*** 78,665.63*** 19,168.72***
## (1,032.75) (1,039.65) (950.49)
##
## age 218.79*** -203.27***
## (19.29) (15.96)
##
## abvgroundsqft 134.47***
## (0.99)
##
## Constant -121,677.50*** -98,779.65*** -112,660.20*** -73,981.78***
## (2,739.88) (2,181.98) (2,498.41) (2,046.79)
##
## ----------------------------------------------------------------------------------------------------------------
## Observations 35,488 35,488 35,488 35,488
## R2 0.27 0.55 0.55 0.71
## Adjusted R2 0.27 0.55 0.55 0.71
## Residual Std. Error 121,093.90 (df = 35486) 94,816.12 (df = 35484) 94,646.09 (df = 35483) 76,782.93 (df = 35482)
## ================================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The estimates in column 1 are from a model that only includes bedrooms as an independent variable. The \(R^2\) from this regression is .27 which means the variable bedrooms explains 27% of the variation in sales price. On average, sales price increases by $94,146 when the number of bedrooms increases by one. This estimate is implausibly large because we are not holding constant any other housing characteristics. Homes with more bedrooms are likely larger (sqft), have more bathrooms, and are on larger lots. All of these variables are correlated with a higher sales price.
The estimates in column 2 are from a model that includes full and partial bathrooms in addition to bedrooms. Notice the impact of bedrooms is much smaller now: $19,724 per additional bedroom compared to $94,146. This coefficient changed because we are now holding constant the number of bathrooms when we add an additional bedroom. The \(R^2\) for this regression is .55 meaning the variables bedrooms, fullbaths, and partialbaths explain 55% of the variation in sales price. We can say fullbaths and partialbaths together explain an additional 55%-27%=28% of the variation in sales price beyond what bedrooms alone explained.
Notice that the standard error for bedrooms decreased slightly when we added fullbaths and partialbaths to the model. Recall that the standard errors will decrease slightly when the residual variance decreases. When we only included bedrooms in column 1 all other variables that impact sales price were in the error term. When we pull those variables out of the error term, and when those variable explain additional variation in the dependent variable, we are reducing the residual variance. Notice that the residual standard error fell from 121,094 to 94,816 when we included fullbaths and partiabaths to the regression. This is how adding additional variables to a regression can improve the precision of our estimates for the other variables.
Column 3 adds age to the model. Notice the \(R^2\) increased only slightly. This tells us that, holding constant bedroooms, fullbaths, and partialbaths, adding age explains very little of the remaining variation in sales price. Notice the residual standard error fell only slightly when we included age to the regression.
The coefficients for bedrooms, fullbaths, and partialbaths were fairly stable when we included age and the the standard errors for fullbaths, and partialbaths increased slightly. In an extreme case, including a variable that explains no additional variation in the dependent variable can significantly increase the standard errors for the other variables. This is why we do not attempt to maximize the \(R^2\) by adding as many independent variables as possible. In smaller samples, including a variable that does not explain any additional variation in the dependent variable will cause the adjusted \(R^2\) to be much lower than the \(R^2\).
Lastly, above ground square footage is included in column 4. The \(R^2\) increased to .71. This says the variables bedrooms, fullbaths, partialbaths, age, and abvgroundsqft explain 71% of the variation in sales price. abvgroundsqft explained and additional 16% of the variation in sales price beyond what was explained by bedrooms, fullbaths, partialbaths, and age. Notice that the residual standard error fell significantly when abvgroundsqft was included.
The coefficients for the other variables changed significantly when abvgroundsqft was added. This tells us that avggroundsqft is highly correlated with those variables and sales price. Some of the standard errors decreased while others increased. Standard errors in a multiple regression are partially determined by the correlation among all of the variables included in a regression. If you add a variable to a regression and you see the standard errors for the other variables increase substantially (to the point you lose statistical significance), you may have a problem. We will talk more about this later.
Every variable is statistically significant with 99% or more confidence. This is largely caused by the really large sample size. The next set of results will estimate a series of regressions that use smaller and smaller sample sizes. The following code filters the original data and produces 5%, 1%, and .05% random samples. I will replicate the regressions in column 4 above using different sample sizes.
# selecting 5%, 1%, and .05% random samples.
data5<-data %>%
filter(rand<.05) # 5% random sample
data1<-data %>%
filter(rand<.01) # 1% random sample
data05<-data %>%
filter(rand<.005) # .05% random sample
# estimating the regressions
reg5<-lm(salesprice2011~bedrooms+fullbaths+partialbaths+age+abvgroundsqft, data=data) # full sample
reg6<-lm(salesprice2011~bedrooms+fullbaths+partialbaths+age+abvgroundsqft, data=data5) # 5% sample
reg7<-lm(salesprice2011~bedrooms+fullbaths+partialbaths+age+abvgroundsqft, data=data1) # 1% sample
reg8<-lm(salesprice2011~bedrooms+fullbaths+partialbaths+age+abvgroundsqft, data=data05) # .05% sample
# combining the results
stargazer(reg5,reg6,reg7,reg8,
type='text',
omit.stat = c('f','ser'),
digits=2)
##
## =====================================================================
## Dependent variable:
## -------------------------------------------------------
## salesprice2011
## (1) (2) (3) (4)
## ---------------------------------------------------------------------
## bedrooms -10,115.30*** -6,044.29** -7,802.75 -23,178.56**
## (694.85) (2,448.78) (7,714.10) (9,097.46)
##
## fullbaths 37,390.28*** 31,294.70*** 58,647.57*** 54,114.21***
## (872.43) (3,237.56) (8,908.10) (10,501.08)
##
## partialbaths 19,168.72*** 18,088.11*** 10,529.40 5,440.62
## (950.49) (3,638.83) (9,681.82) (11,271.09)
##
## age -203.27*** -255.00*** -57.15 -393.95*
## (15.96) (59.29) (173.70) (207.30)
##
## abvgroundsqft 134.47*** 126.34*** 123.97*** 108.14***
## (0.99) (3.72) (9.61) (10.38)
##
## Constant -73,981.78*** -62,694.39*** -100,864.40*** -13,303.24
## (2,046.79) (7,492.72) (21,962.94) (29,259.64)
##
## ---------------------------------------------------------------------
## Observations 35,488 1,734 325 173
## R2 0.71 0.74 0.72 0.70
## Adjusted R2 0.71 0.74 0.72 0.69
## =====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The estimates in column 1 are the same as column 4 above. All variables are statistically significant at the 99% level or greater. Once our sample size is reduced bedrooms and age are no longer statistically significant. This means we are not confident that the slopes (\(\beta\)’s) for these variables are statistically different from zero. Fullbaths, partialbaths, and abvgroundsqft are still statistically significant at the 99% level or greater. Even though the \(\hat{\beta}\)’s were estimated precisely enough to be statistically different from zero, the standard errors are getting larger as we reduce the sample size.
Notice the \(R^2\) values change as we change the sample size. The smallest sample produced the highest \(R^2\). This does not mean column 4 is a better model than column 1. Since the sample size is changing the total amount of variation in the dependent variable is also changing. This means we cannot compare the \(R^2\) across these different samples.