There are a number of OLS assumptions that must be satisfied before we can be confident that our estimates are reliable and precisely estimated:

  1. The regression is linear, is correctly specified, and has an additive error term.
  2. The error term has constant variance
  3. Observations in the error term are uncorrelated with each other. 4. No perfect multicollinearity
  4. There is no omitted variable bias

I will demonstrate the importance of assumptions 2, 4, and 5 and what happens when they are violated.

# loading packages
library(readxl) # for loading data
library(AER) # test score data and vif()
library(estimatr) # lm_robust()
library(sandwich) # robust standard errors
library(lmtest) # coeftest()
library(ggplot2) # for plots
library(stargazer) # nice table
library(tidyverse) # cleaning data
# Wage data
wagedata<-data.frame(read_excel('Wages.xlsx'))

Assumption 2 says that the variance of the residuals is constant. Plotting the dependent and independent variable can help us see if this assumption has been violated. If the variance in \(Y\) is the same for all values of \(X\), we have some confidence that this assumption is satisfied.

# scatter plot 
ggplot(wagedata,aes(x=educ,y=wage)) +
  geom_point()

Notice how the variance in wages is small for lower levels of education. Once education is greater than 5 years, the variance in wages begins to increase. We can estimate a regression and plot the fitted line to see how the residuals increase with education.

# regression that ignores heteroskedasticity
reg1<-lm(wage~educ, data=wagedata)
summary(reg1)
## 
## Call:
## lm(formula = wage ~ educ, data = wagedata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3396 -2.1501 -0.9674  1.1921 16.6085 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.90485    0.68497  -1.321    0.187    
## educ         0.54136    0.05325  10.167   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.378 on 524 degrees of freedom
## Multiple R-squared:  0.1648, Adjusted R-squared:  0.1632 
## F-statistic: 103.4 on 1 and 524 DF,  p-value: < 2.2e-16
# plot with fitted values
ggplot(wagedata,aes(x=educ,y=wage)) +
  geom_point() +
  geom_line(aes(y=reg1$fitted.values))

Notice that the residuals (the difference between the points and the line) increase with the level of education. This is heteroskedasticity. The standard errors that we estimated in reg1 are unreliable if we do not correct for this problem. We can estimate a regression with heteroskedasticity corrected (robust) standard errors using the lm_robust() function.

# regression with robust standard errors
reg2<-lm_robust(wage~educ,data=wagedata)
summary(reg2)
## 
## Call:
## lm_robust(formula = wage ~ educ, data = wagedata)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
## (Intercept)  -0.9049     0.7287  -1.242 2.149e-01  -2.3364   0.5267 524
## educ          0.5414     0.0615   8.803 1.947e-17   0.4205   0.6622 524
## 
## Multiple R-squared:  0.1648 ,    Adjusted R-squared:  0.1632 
## F-statistic: 77.49 on 1 and 524 DF,  p-value: < 2.2e-16

Notice that the standard errors in reg2 are slightly larger compared to reg1. The reg2 standard errors are more reliable than the ones presented in reg1 because they account for the non-constant variance. The corrected and non-corrected standard errors were similar in this example. In some instances, the standard errors will change enough that the result from the hypothesis test that \(\beta=0\) will change. Variables that are statistically significant without robust standard errors may not be when you use robust standard errors. The opposite can also happen.

We can also compute the robust standard errors using the coeftest() function and the reg1 results (results that we get when we use lm()). Notice these standard errors are identical to what we see in the lm_robust() results.

# robust standard errors using the coeftest() function
coeftest(reg1,vcov = vcovHC(reg1,type='HC2'))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.90485    0.72870 -1.2417   0.2149    
## educ         0.54136    0.06150  8.8026   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The benefit of using the coeftest() method is that those results can be stored as a list and displayed in a stargazer() table. You cannot include results from lm_robust() function that were stored in reg2 in a stargazer table. I will store the coeftest() results and display the non-robust and robust results in a table. Notice the coefficients are the same but the standard errors are different.

# robust standard errors
reg3<-coeftest(reg1,vcov = vcovHC(reg1,type='HC2'))
# displaying the reg1 and reg3 results
stargazer(reg1,reg3, type='text')
## 
## ========================================================
##                             Dependent variable:         
##                     ------------------------------------
##                               wage                      
##                               OLS            coefficient
##                                                 test    
##                               (1)                (2)    
## --------------------------------------------------------
## educ                        0.541***          0.541***  
##                             (0.053)            (0.061)  
##                                                         
## Constant                     -0.905            -0.905   
##                             (0.685)            (0.729)  
##                                                         
## --------------------------------------------------------
## Observations                  526                       
## R2                           0.165                      
## Adjusted R2                  0.163                      
## Residual Std. Error     3.378 (df = 524)                
## F Statistic         103.363*** (df = 1; 524)            
## ========================================================
## Note:                        *p<0.1; **p<0.05; ***p<0.01

Next, we will look at the consequences of multicollinearity. Perfect multicollinearity is when you have two or more independent variables that are perfectly correlated. This is a violation of assumption 4 and the model cannot be estimated. Imperfect multicolinearity can still cause problems when the correlation among variables is high enough.

I will demonstrate this with the the test score data from the MASchools dataset. This data comes with the AER package.

# Loading the data
data(MASchools)

This dataset contains 5 variables that measure expenditures per student. expreg measures regular expenditures, expspecial measures special needs expenditures, expbil measures bilingual expenditures, expocc measures occupational expenditures, and exptot measures total expenditures. I will create a correlation matrix for total and regular expenditures since these are the largest categories. This will help us see the correlation among these variables. I am also adding the per capita income in the school district and fourth grade test scores.

# correlation matrix
cor(select(MASchools,'score4','expreg','exptot','income'))
##           score4    expreg    exptot    income
## score4 1.0000000 0.1794931 0.1089520 0.6234206
## expreg 0.1794931 1.0000000 0.9659857 0.4328565
## exptot 0.1089520 0.9659857 1.0000000 0.3894616
## income 0.6234206 0.4328565 0.3894616 1.0000000

Notice the correlation between regular and total expenditures is 0.97. If we include expreg and exptot as independent variables we will have a collinearity problem. Notice what happens when we include expreg and exptot separately and together.

# estimating the regressions
reg3<-lm(score4~income+expreg, data=MASchools)
reg4<-lm(score4~income+exptot, data=MASchools)
reg5<-lm(score4~income+expreg+exptot, data=MASchools)

# combining output
# dropping some statistics to save space
stargazer(reg3,reg4,reg5, type='text', omit.stat=c('f','ser'))
## 
## =============================================
##                    Dependent variable:       
##              --------------------------------
##                           score4             
##                 (1)        (2)        (3)    
## ---------------------------------------------
## income        1.749***   1.784***   1.686*** 
##               (0.152)    (0.147)    (0.150)  
##                                              
## expreg        -0.002*               0.010*** 
##               (0.001)               (0.004)  
##                                              
## exptot                  -0.002***  -0.011*** 
##                          (0.001)    (0.003)  
##                                              
## Constant     685.837*** 689.505*** 690.781***
##               (4.326)    (4.504)    (4.462)  
##                                              
## ---------------------------------------------
## Observations    220        220        220    
## R2             0.399      0.410      0.430   
## Adjusted R2    0.393      0.404      0.422   
## =============================================
## Note:             *p<0.1; **p<0.05; ***p<0.01

The coefficient and standard error for income is approximately the same in all three regressions. This is because the expenditure variables are not highly correlated with income. When entered separately, the coefficients for expreg and exptot are identical. When we include both in the same regression, the coefficients changed and the standard errors were 3-4 times larger. This is what can happen if we include highly correlated dependent variables in a regression.

Multicollinearity only impacts the coefficients and standard errors of the correlated variables. The impact of collinearity is also worse in smaller samples. We can check to see if we have a collinarity problem by estimating the variance inflation factor (vif) after the regression that includes all of the variables.

vif(reg5)
##    income    expreg    exptot 
##  1.249466 15.850485 15.183738

Larger numbers indicate more collinearity. There is no objective threshold for what is considered problematic. You may have a problem if you see values for some variables that are substantially larger than others.

There are only two solutions for multcollinearity: increase the sample size or drop one of the correlated variables. Increasing the sample size is not always possible due to data limitations. Dropping one of the correlated variables may cause omitted variable bias. The best solution may be to do nothing.

Assume we want to estimate the following regression: \(Y=\beta_0+\beta_1x_1+\beta_2x_2+\epsilon\). \(x_1\) is the variable of interest and \(\beta_1\) is the coefficient of interest. \(x_2\) is a variable we must hold constant to ensure our estimate of \(\beta_1\) is unbiased.

If we are unable to estimate this regression because \(x_2\) is not observed we have to estimate this regression: \(Y=\beta_0^*+\beta_1^*x_1+\epsilon^*\). The \(^*\)’s mean the estimated slope and intercept may be different when we omit \(x_2\) from the model.

Our estimate of \(\beta_1\) will be biased (\(\beta_1^*\ne\beta_1\)) when we omitted \(x_2\) from the model if: 1) \(\beta_2\ne0\) and 2) \(x_1\) and \(x_2\) are correlated with each other. Assume we can relate the variable \(x_2\) to \(x_1\) with the following regression: \(x_2=\alpha_0+\alpha_1x_1 +\epsilon\). \(x_1\) and \(x_2\) are correlated with each other if \(\alpha_1\ne0\).

If \(\beta_2\ne0\) and \(\alpha_1\ne0\), we have an omitted variable bias problem and the size of the bias is \(\beta_1^*-\beta_1=\beta_2*\alpha_1\). This says the difference between our estimated value and the true value is equal to the coefficient on the omitted variable in the population (\(\beta_2\)) multiplied by the coefficient on the variable of interest in a regression where the omitted variable (\(x_2\)) is the dependent variable and our variable of interest is the independent variable (\(x_1\)).

Let’s look at an example to see how our estimates change when we omit an important variable. The regression we want to estimate is: \(wage=\beta_0+\beta_1female+\beta_2educ+\beta_3exper+\beta_4tenure+\beta_5married+\epsilon\). The variable of interest is female and \(\beta_1\) is the average difference in wages for males and females. The estimates for this regression are in column 1 below.

# "Correct" regression
reg6<-lm(wage~female+educ+exper+tenure+married, data=wagedata)
# Checking to see if female and education are correlated
reg7<-lm(educ~female+exper+tenure+married, data=wagedata)
# regression with education omitted
reg8<-lm(wage~female+exper+tenure+married, data=wagedata)


stargazer(reg6,reg7,reg8, type='text', omit.stat=c('f','ser'))
## 
## ==========================================
##                   Dependent variable:     
##              -----------------------------
##                wage      educ      wage   
##                 (1)       (2)       (3)   
## ------------------------------------------
## female       -1.741***  -0.310   -1.914***
##               (0.266)   (0.234)   (0.296) 
##                                           
## educ         0.556***                     
##               (0.050)                     
##                                           
## exper          0.019   -0.082*** -0.027** 
##               (0.012)   (0.010)   (0.013) 
##                                           
## tenure       0.139***   0.036*   0.159*** 
##               (0.021)   (0.019)   (0.023) 
##                                           
## married       0.559*   0.930***  1.076*** 
##               (0.286)   (0.248)   (0.314) 
##                                           
## Constant     -1.618**  13.353*** 5.802*** 
##               (0.723)   (0.248)   (0.313) 
##                                           
## ------------------------------------------
## Observations    526       526       526   
## R2             0.368     0.131     0.217  
## Adjusted R2    0.362     0.124     0.211  
## ==========================================
## Note:          *p<0.1; **p<0.05; ***p<0.01

Let’s assume the estimates in column 1 are the ``correct” estimates that do not suffer from omitted variable bias. On average, females earn $1.741 less per hour compared to a male worker with the same level of education, experience, job tenure, and marital status. What happens to this estimate if we omit education from the model? First, it depends on how education impact wages. Second, it depends on the level of education attainment for males and females.

We can see how educational attainment depends on gender using this regression: \(educ=\alpha_0+\alpha_1female+\alpha_2exper+\alpha_3tenure+\alpha_4married+\epsilon\). \(\alpha_1\) is the average difference in educational attainment for males and females holding the other independent variables constant. Results from this regression are in column 2 above.

We are interested in the coefficient for female. On average, holding the other independent variables constant, females have .31 less years of education compared to males. We can now estimate the bias in our estimate of the male-female wage gap that would occur if we omit education from the model. Using the omitted variable bias formula from above, the bias will be \(\hat{\beta_2}*\hat{\alpha_1}\)=.556*-.31=-.172.

The estimates in column 3 are from a model that estimates the male-female wage gap when education is omitted. On average, holding the other independent variables constant, females earn $1.914 less per hour than males. Notice the difference is -1.914-(-1.741)=-0.173. This is exactly what we estimated the bias would be (the small difference is due to rounding error).

The discussion above details the technical details behind omitted variable bias. Thankfully, there is also an intuitive explanation. We know that, all else constant, higher levels of education are correlated with higher wages. In this example, females have slightly lower levels of education compared to males. When education is omitted from the model, the female dummy variable partially captures the impact of lower levels of education as well and the impact of being female. This is why the coefficient on female is more negative (smaller) when education is omitted compared to when education was included.

If, on average, females had more education than males, omitting education from the model would have made the coefficient larger (less negative). This is because the female dummy variable would partially be capturing the impact of higher levels of education in addition to the impact of beign female.

In practice, you would not be able to work through all of these steps to estimate the bias. We would only be able to estimate the regression in equation 3 and the regressions in columns 1 and 2 would be left to a thought exercise. For example, you would have to rely on theory, intuition, and empirical evidence to determine the correlation between education and wages and the difference in educational attainment for males and females conditional on all of the other independent variables. Once you have established these relationships, you can make inference about the size and direction of the bias.