# setting the directory
setwd('/Users/stp48131/Desktop/ECON 465') 

# Loading the packages
library(readxl)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
# loading the data
housedata<-data.frame(read_excel('RealEstate.xlsx'))

Modeling House Prices

Our goal is to use regression to model the sales price of a home. We could use this model to predict the sales price of homes with certain characteristics. We can also use the regression model to determine how much each housing characteristics contributes to the sales price. We do this using the estimated slope coefficients.

Lets start by estimating a regression with a single independent variable. We will assume the only variable that determines the sales price of a house is the size of the house. The regression we want to estimate is \(Price=\beta_0+\beta_1sqft+\epsilon\). \(Price\) is the dependent variable. The independent variable is \(sqft\). \(\beta_0\) and \(\beta_1\) are the intercept and slope parameters in the population. We never observe these values but estimate them using regression.

We can plot the relationship between price and square footage to visualize the relationship.

# plotting price against square footage
ggplot(housedata,aes(x=sqft, y=price)) + 
  geom_point() +
  labs(x="Square Footage",
       y="Sales Price")

As expected, there is a positive relationship between the size of a house and the sales price. The correlation between sales price and square footage is 0.59. This give us some information about the direction and strength of the relationship but we cannot use this value to predict or estimate the impact of square footage on house prices. We estimate the equation that best approximates this relationship using the lm() function.

# estimating the regression
reg_results<-lm(price~sqft,data=housedata)
# displaying the results
summary(reg_results)
## 
## Call:
## lm(formula = price ~ sqft, data = housedata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -77683 -30522   2576  30912  73371 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30920.276   9336.991   3.312 0.000961 ***
## sqft           85.973      3.679  23.367  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33940 on 998 degrees of freedom
## Multiple R-squared:  0.3536, Adjusted R-squared:  0.353 
## F-statistic:   546 on 1 and 998 DF,  p-value: < 2.2e-16

The “Estimate” column contains our estimates of \(\beta_0\) (\(\hat{\beta_0}\)) and \(\beta_1\) (\(\hat{\beta_1}\)). The estimated regression equation is \(\hat{Price}=30,920+86sqft\). The estimate of \(\beta_0\) is 30,920 and the estimate of \(\beta_1\) is 86.

We can add this regression line to our scatter plot to see how well it fits. If we were to extend the line to the left where sqft=0, we the intercept would be 30920. The slope of this line is 86 (\(\Delta \hat{price} / \Delta sqft\)).

ggplot(housedata,aes(x=sqft, y=price)) + 
  geom_point() +
  geom_line(aes(y=reg_results$fitted.values)) +
  labs(x="Square Footage",
       y="Sales Price")

Remember that regression is another way of estimating averages. For any given value of square footage, the regression line estimates the average sales price. Let’s estimate the the sales price of a house with 2,500 square feet: \(\hat{Price}=30,920+86*2,500=\$245,920\). This does not say that all homes with 2,500 square feet will sell for $245,920. Some will sell for more and some will sell for less but we expect the average sales price of homes with 2,500 square feet will be $245,920. Notice that at every value of square footage, the height of the regression line is approximately the average value of sales price.

We can use the predict function in R to predict the value of \(\hat{Y}\) for any value of the independent variable. The code below tells R to use the results from the previous regression that we stored as the list named reg_results for the prediction. The data.frame() function tells the predict command to make the prediction using values we specify. The following code predicts the sales price for homes with 2500 and 2750 square feet. Notice the prediction at 2,500 square feet is slightly different that what we estimated above. This is due to rounding error.

predict(reg_results, 
        newdata = data.frame("sqft" = c(2500,2750)), 
        type = "response")
##        1        2 
## 245853.3 267346.6

We can also estimate the average change in sales price if we increase square footage by some amount. \(\hat{\beta_1}\) is the average change in sales price when we increase square footage by one unit. If we increase square footage by one, sales price increases by $86, on average. If we were to increase square footage by 100, sales price increase by $86*100=$8,600, on average.

The generic formula for estimating changes in the dependent variable is: \(\Delta \hat{Y}=\hat{\beta_1}*\Delta x\). This is the amount \(\hat{Y}\) changes when we increase \(x\) by one unit multiplied by the total change (if \(\Delta x\) is 10, we get the one-unit change 10 times).

Estimating a Regression with Multiple Indepenent variables

We typically estimate regressions that have multiple independent variables. Assume we want to estimate the sales price of a house and include square footage, age, whether or not the house has a pool, or fireplace, and whether or not the house is in a college town as independent variables. The regression model we want to estimate is now: \(price=\beta_0+\beta_1sqft+\beta_2age+\beta_3pool+\beta_4fireplace+\beta_5college+\epsilon\). Square footage and age are continuous independent variables and pool, fireplace, and college are discrete. We will use the lm() function to estimate this model.

# estimating the regression
reg_results2<-lm(price~sqft+age+pool+fireplace+college, data=housedata)
# displaying the results
summary(reg_results2)
## 
## Call:
## lm(formula = price ~ sqft + age + pool + fireplace + college, 
##     data = housedata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47971 -10411    198  10438  44759 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6911.880   4289.365   1.611 0.107410    
## sqft           83.183      1.672  49.759  < 2e-16 ***
## age          -192.991     51.567  -3.743 0.000193 ***
## pool         4352.570   1205.261   3.611 0.000320 ***
## fireplace    1398.810    976.807   1.432 0.152452    
## college     60196.233    971.531  61.960  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15330 on 994 degrees of freedom
## Multiple R-squared:  0.8686, Adjusted R-squared:  0.8679 
## F-statistic:  1314 on 5 and 994 DF,  p-value: < 2.2e-16

One again, our estimates of the \(\beta\)’s are in the “Estimate” column. \(\hat{\beta_0}\) is always in the row labeled “(Intercept)”. \(\hat{\beta_1}-\hat{\beta_5}\) are in following rows. \(\hat{\beta_1}\) is in the row labeled “sqft”, \(\hat{\beta_2}\) is in the row labeled “age”,\(\hat{\beta_3}\) is in the row labeled “pool”,\(\hat{\beta_4}\) is in the row labeled “fireplace”, and \(\hat{\beta_5}\) is in the row labeled “college.”

We can use these estimates to write the estimated regression equation: \(\hat{price}=6912+83*sqft-193*age+4353*pool+1399*fireplace+60196*college\).

Regressions with multiple independent variables are difficult to visualize on a two-dimensional plot.

We can also use the predict function to estimate the sales price for a house with certain attributes. The code below predicts the sales price of a house with 2,500 square feet, is 25 years old, has a pool, fireplace, and is in a college town. We expect a house with these characteristics to sell for $275,993.

predict(reg_results2, 
        newdata = data.frame("sqft" = 2500,
                             "age"=25,
                             "pool"=1,
                             "fireplace"=1,
                             "college"=1),
        type = "response")
##        1 
## 275992.8

Plotting a Regression Line with Two Intercepts

Assume we want to estimate the following regression: \(price=\beta_0+\beta_1sqft+\beta_2college+\epsilon\).

reg_results3<-lm(price~sqft+college,data=housedata)
summary(reg_results3)
## 
## Call:
## lm(formula = price ~ sqft + college, data = housedata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47536 -10144    544  10271  44469 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5680.860   4290.153   1.324    0.186    
## sqft           83.557      1.683  49.642   <2e-16 ***
## college     60369.029    982.696  61.432   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15520 on 997 degrees of freedom
## Multiple R-squared:  0.8649, Adjusted R-squared:  0.8647 
## F-statistic:  3192 on 2 and 997 DF,  p-value: < 2.2e-16

The estimated regression equation is \(\hat{price}=5681+84*sqft+60370*college\). Since college is either 0 or 1, we can evaluate the estimated regression equation when college is 0 and again when college is 1. If we let college equal 0, the estimated regression equation becomes \(\hat{price}=5681+84*sqft\). The intercept is 5681 and the slope is 84. If we let college equal 1, the estimated regression equation becomes \(\hat{price}=5681+84*sqft+60370*1=66051+84*sqft\). The slope is the same as before but the intercept has increased by 60370. This says the impact of square footage does not differ for homes in and out of college towns. Homes in college town sell for $60,370 more than a home with the same square footage that is not in a college town. Dummy variables simply cause the intercept to shift.

ggplot(housedata,aes(x=sqft,y=price)) +
  geom_point(aes(color=college)) +
  geom_line(aes(y=reg_results3$fitted.values,group=college))

Combining Regression Results

We have estimated three regressions and stored the results as reg_results, reg_results2, and reg_results3. We can combine all of these results into a single table using the stargazer() package. This works best when you have a similar set of independent variables across the different regressions. The table below contains all of the relevant information we need and is also easier to read.

stargazer(reg_results,reg_results2,reg_results3, type='text')
## 
## ==================================================================================================
##                                                  Dependent variable:                              
##                     ------------------------------------------------------------------------------
##                                                         price                                     
##                               (1)                       (2)                        (3)            
## --------------------------------------------------------------------------------------------------
## sqft                       85.973***                 83.183***                  83.557***         
##                             (3.679)                   (1.672)                    (1.683)          
##                                                                                                   
## age                                                 -192.991***                                   
##                                                       (51.567)                                    
##                                                                                                   
## pool                                                4,352.570***                                  
##                                                     (1,205.261)                                   
##                                                                                                   
## fireplace                                            1,398.810                                    
##                                                      (976.807)                                    
##                                                                                                   
## college                                            60,196.230***              60,369.030***       
##                                                      (971.531)                  (982.696)         
##                                                                                                   
## Constant                 30,920.280***               6,911.880                  5,680.860         
##                           (9,336.991)               (4,289.365)                (4,290.153)        
##                                                                                                   
## --------------------------------------------------------------------------------------------------
## Observations                 1,000                     1,000                      1,000           
## R2                           0.354                     0.869                      0.865           
## Adjusted R2                  0.353                     0.868                      0.865           
## Residual Std. Error  33,938.410 (df = 998)     15,334.440 (df = 994)      15,522.320 (df = 997)   
## F Statistic         546.037*** (df = 1; 998) 1,313.837*** (df = 5; 994) 3,192.103*** (df = 2; 997)
## ==================================================================================================
## Note:                                                                  *p<0.1; **p<0.05; ***p<0.01