# loading packages
library(readxl)
library(tidyverse)
library(stargazer)
# loading data
housedata<-data.frame(read_excel('house_sample.xlsx')) %>%
  na.omit()

Dummy Variables

We have discussed how dummy variables allow the intercept of the estimated regression equation to differ across groups. For example, assume we want to estimate the following regression: \(SalesPrice=\beta_0+\beta_1sqft+\beta_2Franchise+\epsilon\), where \(SalesPrice\) is the price of a house, \(sqft\) is the square footage, and \(Franchise\) is a dummy variable equal to one if the house is listed by a franchised broker and zero otherwise. \(\beta_1\) is the average change in sales price when we increase square footage by one unit. \(\beta_2\) is the average difference in sales price for franchised and non-franchised listings holding constant square footage. This model may be appropriate if we believe all franchised firms are equal. What if one franchise is better or worse than another? What if some perform better than non-franchised firms? We can allow for this by creating dummy variables for each franchised firm.

We can use the unique() function to see how many listing firms are in our sample:

unique(housedata$listingoffice)
## [1] "OTHER"           "RE/MAX"          "CENTURY 21"      "SEMONIN"        
## [5] "COLDWELL BANKER" "KELLER WILLIAMS"

We want to create dummy variables for all firms excluding “OTHER.” The general rule is that if you have \(G\) unique categories you can include \(G-1\) dummy variables. Including a dummy variable for every category will cause perfect multicollinearity. This is called the dummy variable trap.

# creating dummy variables for the franchised firms
# I am going to add these to the housedata data frame we created above

housedata<- housedata %>%
  mutate(remax=ifelse(listingoffice=="RE/MAX",1,0),
         c21=ifelse(listingoffice=="CENTURY 21",1,0),
         semonin=ifelse(listingoffice=="SEMONIN",1,0),
         cb=ifelse(listingoffice=="COLDWELL BANKER",1,0),
         kw=ifelse(listingoffice=="KELLER WILLIAMS",1,0),)

We can use these variables to estimate the following regression: \(Sales Price=\beta_0+\beta_1remax+\beta_2c21+\beta_3semonin+\beta_4cb+\beta_5kw+\epsilon\). \(\beta_0\) is the average sales price for homes sold by “OTHER” firms. \(\beta_1\) is the average difference in sales price for RE/MAX and OTHER listings. \(\beta_2\) is the average difference in sales price for Century 21 and OTHER listings. \(\beta_3\) is the average difference in sales price for Semonin and other listings. \(\beta_4\) is the average difference in sales price for Coldwell Banker and OTHER listings. \(\beta_5\) is the average difference in sales price for Keller Williams and OTHER listings.

We can estimate the average sales price for properties listed by each firm:

housedata %>%
  group_by(listingoffice) %>%
  summarise_at(vars(salesprice2011), list('Average Sales Price' = mean))
## # A tibble: 6 x 2
##   listingoffice   `Average Sales Price`
##   <chr>                           <dbl>
## 1 CENTURY 21                    144364.
## 2 COLDWELL BANKER               163047.
## 3 KELLER WILLIAMS               178938.
## 4 OTHER                         153230.
## 5 RE/MAX                        162665.
## 6 SEMONIN                       186230.

The intercept in the following regression should be 153,230 (the averag sales price for “OTHER”). \(\hat{\beta_1}\) should be 162,665-153,230=9,435. The other coefficients can be estimated using the same process.

# estimating the regression
reg1<-lm(salesprice2011~remax+c21+semonin+cb+kw, data=housedata)
stargazer(reg1, type='text')
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                           salesprice2011       
## -----------------------------------------------
## remax                      9,435.129***        
##                             (2,067.513)        
##                                                
## c21                        -8,865.892**        
##                             (3,789.409)        
##                                                
## semonin                    33,000.700***       
##                             (2,638.286)        
##                                                
## cb                         9,817.729***        
##                             (3,313.405)        
##                                                
## kw                         25,708.610***       
##                             (4,080.292)        
##                                                
## Constant                  153,229.700***       
##                             (1,164.433)        
##                                                
## -----------------------------------------------
## Observations                   9,412           
## R2                             0.021           
## Adjusted R2                    0.020           
## Residual Std. Error   79,087.210 (df = 9406)   
## F Statistic          40.323*** (df = 5; 9406)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Compared to listings by “other” firms, on average:

The previous regression allowed us to compare the average sales price for each franchised firm to the “other” firms. Each difference was statistically significant with at least 95% confidence. It is possible that the differences in sales price are being caused by the types of homes each firm lists. If we want to know how well one firm performs relative to the others holding the type of house constant, we can include other house characteristics in the regression. We will add square footage to the regression:

reg2<-lm(salesprice2011~remax+c21+semonin+cb+kw+abvgroundsqft, data=housedata)
stargazer(reg2, type='text')
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                           salesprice2011       
## -----------------------------------------------
## remax                       2,498.391*         
##                             (1,378.067)        
##                                                
## c21                        -5,899.354**        
##                             (2,523.206)        
##                                                
## semonin                    18,466.430***       
##                             (1,761.703)        
##                                                
## cb                          5,417.760**        
##                             (2,206.498)        
##                                                
## kw                         16,352.600***       
##                             (2,718.098)        
##                                                
## abvgroundsqft               107.154***         
##                               (0.986)          
##                                                
## Constant                  -14,726.090***       
##                             (1,728.924)        
##                                                
## -----------------------------------------------
## Observations                   9,412           
## R2                             0.566           
## Adjusted R2                    0.566           
## Residual Std. Error   52,657.730 (df = 9405)   
## F Statistic         2,044.540*** (df = 6; 9405)
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Notice the coefficients for the firm dummy variables are closer to zero than they were before. Some of the differences are no longer statistically different from zero and conventional levels (RE/MAX and Coldwell Banker). This tell us that the large differences in average sales price was being driven by the types of homes each firm lists for sale.

We can plot the sales price against square footage. It is a little difficult to see but each firm now has a separate intercept. The intercept for “other” firms is the constant (\(\hat{\beta_0}\)). The intercept for each firm is \(\hat{\beta_0}\) plus the coefficient for that firm. For example, the intercept for Keller Williams is -14,726+16,353=1,627.

\(\Delta \hat{SalesPrice} / \Delta sqft\) =107 for all firms.

ggplot(housedata, aes(x=abvgroundsqft,y=salesprice2011, color=listingoffice)) +
  geom_point() +
  geom_line(aes(y=reg2$fitted.values, group=listingoffice)) +
  scale_color_brewer(palette = 'Paired') +
  theme_minimal() +
  labs(x='Square Footage',
       y='Sales Price') +
  scale_y_continuous(labels=function(x) format(x, scientific = FALSE))