# loading packages
library(tidyverse)
library(readxl)
library(AER)
library(stargazer)
# loading data
ivdata<-data.frame(read_excel('MROZ.xlsx'))

Assume we want to estimate the following regression: \(ln(wage)=\beta_0+\beta_1educ+\beta_2age+\beta_3exper+\beta_4exper^2+\epsilon\)

If we assume that all of the independent variables are exogenous (uncorrelated with the error term) we can estimate the model using OLS. We will assume that education is endogenous while age and experience are exogenous so OLS will not produce an unbiased estimate of \(\beta_1\).

If it were possible, we would include enough independent variables to make sure education is no longer correlated with the error term. This is not possible if the variables in the error terms are unobservable or impossible to measure. We can use instrumental variables in this situation.

A valid instrumentalist variable (\(z\)) is a variable that is correlated with the endogenous variable but is uncorrelated with \(\epsilon\) (the error term in the regression we want to estimate). In this example, we need a variable that is correlated with a person’s level of education that is also uncorrelated with their wage. We will use the education level of a worker’s father’s as our instrument.

Before we can use these as instruments, we have to convince ourselves and whoever is reading the paper that a person’s wage is uncorrelated with their father’s level of education. This is an assumption that cannot be proven. We can test whether or not education is correlated with father’s level of education.

I will first estimate the model using OLS so we can compare the biased and unbiased estimates.

ols<-lm(log(wage) ~ educ+age+exper+expersq, data=ivdata)
stargazer(ols, type='text')
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                              log(wage)         
## -----------------------------------------------
## educ                         0.108***          
##                               (0.014)          
##                                                
## age                           0.0003           
##                               (0.005)          
##                                                
## exper                        0.042***          
##                               (0.013)          
##                                                
## expersq                      -0.001**          
##                              (0.0004)          
##                                                
## Constant                      -0.533*          
##                               (0.278)          
##                                                
## -----------------------------------------------
## Observations                    428            
## R2                             0.157           
## Adjusted R2                    0.149           
## Residual Std. Error      0.667 (df = 423)      
## F Statistic           19.669*** (df = 4; 423)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

On average, each additional year of education increases wages by 10.8%. If, on average, individuals with higher ability (ability is in the error term) have higher levels of education, and if those with higher levels of ability have higher wages, 10.8% is an over-estimate of the true impact of education on wages. We are giving education too much credit because education is also partially capturing the positive effect of ability on wages.

We will estimate the model using the instrument discussed above to get a better estimate of the causal effect of education on wages. We use the ivreg() function to estimate this model.

Notice the ivreg function requires variables to be listed in two places. The variables on the left side of the “|” are the independent variables in the regression equation above. This part looks like what you would see if you estimated the model using the lm() function. We put all of the exogenous independent variables and the instruments on the right side of the “|”. Notice education does not show up on the right side of the “|”.

We interpret the results the same way we do in a typical regression.

# estimating the model
iv<-ivreg(log(wage)~ 
                educ+age+exper + expersq |
                age+exper+expersq+fatheduc, 
          data=ivdata)
# combining the iv and ols results 
stargazer(ols, iv, type='text')
## 
## ===================================================================
##                                        Dependent variable:         
##                                ------------------------------------
##                                             log(wage)              
##                                          OLS           instrumental
##                                                          variable  
##                                          (1)               (2)     
## -------------------------------------------------------------------
## educ                                  0.108***           0.070**   
##                                        (0.014)           (0.035)   
##                                                                    
## age                                    0.0003            -0.0002   
##                                        (0.005)           (0.005)   
##                                                                    
## exper                                 0.042***           0.044***  
##                                        (0.013)           (0.013)   
##                                                                    
## expersq                               -0.001**           -0.001**  
##                                       (0.0004)           (0.0004)  
##                                                                    
## Constant                               -0.533*            -0.051   
##                                        (0.278)           (0.494)   
##                                                                    
## -------------------------------------------------------------------
## Observations                             428               428     
## R2                                      0.157             0.143    
## Adjusted R2                             0.149             0.135    
## Residual Std. Error (df = 423)          0.667             0.673    
## F Statistic                    19.669*** (df = 4; 423)             
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01

Notice that the impact of education on wages is smaller when we use the instrumental variables method. On average, each additional year of education increases wages by 7%. This estimate is no longer biased by the correlation between education and unobserved variables in the error term that may also be correlated with wages.

Checking for a Valid Instrument

While it is impossible to prove an instrument is valid, there are a few things we can do to give us confidence that our instrument is valid. First we will regress the endogenous variable on the instrument and all of the other exogenous variables. If the instrument is statistically significant we have confidence that our instrument is relevant (correlated with the endogenous variable).

# checking for relevance
summary(lm(educ~age+exper+expersq+fatheduc, data=ivdata))
## 
## Call:
## lm(formula = educ ~ age + exper + expersq + fatheduc, data = ivdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0946 -1.1412 -0.0272  1.0580  6.0088 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.2130890  0.5015335  20.364  < 2e-16 ***
## age         -0.0228267  0.0099893  -2.285  0.02259 *  
## exper        0.0839520  0.0263680   3.184  0.00151 ** 
## expersq     -0.0016518  0.0008691  -1.901  0.05774 .  
## fatheduc     0.2777188  0.0209436  13.260  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.022 on 748 degrees of freedom
## Multiple R-squared:  0.2175, Adjusted R-squared:  0.2134 
## F-statistic: 51.99 on 4 and 748 DF,  p-value: < 2.2e-16

Holding constant age and experience, father’s education is a statistically significant determinant of education. This gives us confidence that out instrument is relevant.

We can add the instrument to the regression equation. Father’s education shoudl not be a statistically significant determinant of wages if it is exogenous.

# checking for exogeneity
summary(lm(log(wage) ~ educ +age+exper+expersq+fatheduc, data=ivdata))
## 
## Call:
## lm(formula = log(wage) ~ educ + age + exper + expersq + fatheduc, 
##     data = ivdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.06069 -0.29725  0.03668  0.39669  2.34750 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5091223  0.2784409  -1.828  0.06818 .  
## educ         0.1152067  0.0155556   7.406 7.13e-13 ***
## age          0.0000834  0.0048557   0.017  0.98630    
## exper        0.0415645  0.0131842   3.153  0.00173 ** 
## expersq     -0.0008316  0.0003996  -2.081  0.03803 *  
## fatheduc    -0.0121598  0.0101658  -1.196  0.23231    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6669 on 422 degrees of freedom
##   (325 observations deleted due to missingness)
## Multiple R-squared:  0.1597, Adjusted R-squared:  0.1497 
## F-statistic: 16.04 on 5 and 422 DF,  p-value: 1.761e-14

Holding constant the other independent variables, father’s education is not a statistically significant determinant of wages. This does not prove it is exogenous but it does give us some confidence that it is.