 ## Jeffrey Zax

Print publication date: 2011

Print ISBN-13: 9780804772624

Published to Stanford Scholarship Online: June 2013

DOI: 10.11126/stanford/9780804772624.001.0001

Show Summary Details
Page of

PRINTED FROM STANFORD SCHOLARSHIP ONLINE (www.stanford.universitypressscholarship.com). (c) Copyright Stanford University Press, 2022. All Rights Reserved. An individual user may print out a PDF of a single chapter of a monograph in SSO for personal use. Subscriber: null; date: 04 July 2022

# What If the Disturbances and the Explanatory Variables Are Related?

Chapter:
(p.363) Chapter 10 What If the Disturbances and the Explanatory Variables Are Related?
Source:
Introductory Econometrics
Publisher:
Stanford University Press
DOI:10.11126/stanford/9780804772624.003.0010

# Abstract and Keywords

This chapter deals with the possibility that the disturbances and the explanatory variables are related. This problem is called endogeneity or simultaneity. In this case, ordinary least squares (OLS) estimators are biased and inconsistent. Worse, they are unsalvageable. The solution can be thought of as a two-step procedure consisting of two applications of OLS. For this reason, it is often called two-stage least squares. It is also called, more generally, instrumental variables. This technique provides estimators that are not unbiased, only consistent. Therefore, the success of the solution is much more sensitive to the size of the sample than it was in the previous two chapters.

# 10.0 What We Need to Know When We Finish This Chapter

This chapter deals with the possibility that the disturbances and the explanatory variables are related. This problem is called endogeneity or simultaneity. If we have it, our ordinary least squares (OLS) estimators are biased and inconsistent. Worse, they are unsalvageable. As in the previous two chapters, (p.364) the solution can be thought of as a two-step procedure consisting of two applications of OLS. For this reason, it is often called two-stage least squares. It is also called, more generally, instrumental variables. This technique provides estimators that are not unbiased, only consistent. Therefore, the success of the solution in this chapter is much more sensitive to the size of the sample than it was in the previous two chapters. Here are the essentials.

1. 1. Section 10.2: The three most common sources of endogeneity are reverse causality, measurement error in xi, and dynamic choice.

2. 2. Equation (10.12), section 10.2: Endogeneity means that the explanatory variable is a random variable and it's correlated with the disturbances:

$Display mathematics$

3. 3. Equation (10.13), section 10.3: The consequence of endogeneity is that b is neither unbiased nor consistent for β:

$Display mathematics$

4. 4. Equation (10.18), section 10.3: With measurement error, b is approximately equal to

$Display mathematics$

OLS tends to understate the true magnitude of β.

5. 5. Equations (10.19) and (10.20), section 10.4: An instrument, or an instrumental variable, zi has, roughly speaking, the following properties, at least as the sample size approaches infinity:

$Display mathematics$

and

$Display mathematics$

6. (p.365) 6. Equation (10.21), section 10.4: The first step of our two-stage procedure is the OLS instrumenting equation

$Display mathematics$

It provides an estimator of xi, , that is purged of the correlation between xi and εi.

7. 7. Equation (10.23), section 10.4: We obtain the two-stage least squares (2SLS) estimators of α and β through OLS estimation of our second-step equation,

$Display mathematics$

8. 8. Equation (10.24), section 10.4, and equations (10.30) and (10.32), section 10.5: The instrumental variables (IV) estimator of β is the same as the 2SLS estimator,

$Display mathematics$

9. 9. Section 10.6: The IV estimator bIV is a consistent estimator of β.

10. 10. Equation (10.41), section 10.6, and equation (10.45), section 10.7: The estimated variance of the IV slope estimator is

$Display mathematics$

11. 11. Section 10.7: It's often difficult to find an appropriate instrument. The more likely it is that zi satisfies one of the assumptions in equations (10.19) and (10.20), the less likely it is that it satisfies the other. The best instruments are moderately correlated with xi, in order to satisfy equation (10.19) without invalidating equation (10.20). The Staiger-Stock rule of thumb stipulates that the F-statistic for the first-stage regression should exceed 10 in order for bIV to be useful.

12. (p.366) 12. Equation (10.47), section 10.8: The Hausman test for endogeneity consists of the auxiliary regression

$Display mathematics$

Endogeneity is present if the coefficient on , b2 is statistically significant.

13. 13. Section 10.9: It's always necessary to have a behavioral argument as to why an instrument might be appropriate. It's usually possible to offer a counterargument as to why it might not be. Great instruments are hard to come by. Even acceptable instruments often require considerable ingenuity to construct and justify.

# 10.1 Introduction

This chapter considers the fourth assumption that we made about the disturbances at the end of section 5.3: “The disturbance and the explanatory variable for the same observation are unrelated.” There is one very important similarity between the analysis here and those in the previous two chapters. As when the assumptions of equations (5.6) and (5.11) are violated, the most convenient remedy will be to rearrange the sample so that the formulas of equations (4.35) and (4.40) can be applied appropriately.

However, the assumption that the disturbances and explanatory variables are unrelated is the first of our assumptions that is not about the disturbances, alone. Consequently, the analysis of this assumption also differs in important ways from those in chapters 8 and 9. In general, the consequences of relationships between the disturbances and the explanatory variables are more severe, and harder to overcome, than of violations of the other assumptions.

First, we'll find that we can't fix the OLS estimates. Second, we'll learn that we have to solve the problem before we can diagnose it. Finally, the effectiveness of our solution depends much more heavily on sample size than did those proposed in the last two chapters.

# 10.2 How Could This Happen?

There may be many circumstances in which the explanatory variable and the disturbances are related in the population. The three most common are probably those that involve reverse causality, measurement error, or dynamic choice.1

(p.367) Reverse causality arises when the dependent variable is both affected by and affects the explanatory variable. This possibility may be present in many economic contexts. For example, if we have a friend who smokes, that may affect the probability that we will also smoke. However, our friend's behavior probably depends, to some extent, on whether or not we smoke. So who went first? In this situation, we often can't tell whose behavior is predetermined and whose is dependent.

This little scenario is an example of a much larger class of issues, those of peer effects. The study of peer effects addresses the question of how choices of an individual or an entity depend on parallel choices made by similar individuals or entities. Inevitably, this question provokes the concern that the individuals or entities are simultaneously affecting each other.2

Macroeconomics provides another, familiar example of reverse causality in the relationship between aggregate consumption and aggregate income. This example is a convenient context in which to demonstrate formally the possibility that xi and εi may be correlated.

Aggregate consumption is an important component of aggregate income. As a behavioral matter, individual producers determine output levels based on their understanding of the demand for their products. This means that aggregate production depends, correspondingly, on aggregate consumption. In turn, the payments for aggregate production become aggregate income. Therefore, aggregate consumption also helps determine aggregate income.3 This relationship can be represented as the population relationship

(10.1)
$Display mathematics$

where Yi represents aggregate income and Ci represents aggregate consumption.

At the same time, individuals determine their consumption levels based on their incomes. This implies that aggregate consumption depends on aggregate income. We can represent this in the consumption function

(10.2)
$Display mathematics$

where λ represents the marginal propensity to consume and κ, the Greek letter kappa, is a constant.

Realistically, the consumption function should also allow for the possibility of random variations in aggregate consumption. We'll ignore this for our present purpose. The deterministic representation in equation (10.2) is sufficient to demonstrate the relationship between Ci and εi that's built into the relationship between aggregate income and aggregate consumption.

Substituting equation (10.1) for Yi into equation (10.2), we obtain

$Display mathematics$

(p.368) Expanding, we get

$Display mathematics$

We combine terms in Ci to obtain

$Display mathematics$

Finally, we solve for Ci:

(10.3)
$Display mathematics$

Equation (10.3) demonstrates that, when combined, the population relationships of equations (10.1) and (10.2) imply that Ci depends on εi. Even though the structural equation for Ci, equation (10.2), is deterministic, and even though the disturbance εi enters into the system only as a component of Yi in equation (10.1), the reciprocal causality between Ci and Yi implies that εi affects Ci as well.

This implies, first, that Ci is actually a random variable. It contains a random component, (λ/(1−λβ))εi, which represents the part of aggregate consumption that is generated by the random component of aggregate income. Second, because Ci is a random variable, it is meaningful to speak of Ci as having covariances with other random variables.

Finally, because equation (10.3) demonstrates that Ci depends on εi, COV(Ci, εi) ≠ 0. Consequently, the population relationship of equation (10.1) violates the fourth assumption that we made in chapter 5: Its explanatory variable, Ci, is correlated with its disturbance, εi.

The second source of correlations between explanatory variables and disturbances is measurement error in the explanatory variable, or errors in variables. These occur when the explanatory variables that we actually observe are only approximations of the explanatory variables that we believe belong in the population relationship. The appendix to chapter 3 suggests an example: We use years of schooling as our explanatory variable for earnings, but the Census actually reports levels of schooling. The translation of schooling levels into schooling years is, as discussed there, open to error.

Section 1.10 suggested another example of measurement error. The regressions of chapter 1 classified individuals additively into each of the six broad racial or ethnic groups with which they might identify: black, Hispanic, American Indian or Alaskan Native, Asian, Native Hawaiian or other Pacific Islander, and the residual category of whites. However, the discussion reminded us that if the effects of these identities aren't additive, the treatment in chapter 1 isn't correct. Again, the variables measuring racial or ethnic identity would be measured with error.

(p.369) Yet another example occurs in section 7.7. The explanatory variable there measures the degree of corruption within a country. It is easy to imagine that corruption is difficult to measure. After all, it's corruption because it's illegal. Participants don't have much incentive to advertise their activities. Consequently, the absolute level isn't readily known.

There could be many different ways to approximate the level of corruption, such as surveys of businesspeople or comparisons between reported government revenues and expenditures. But the key word there is “approximate.” Any feasible method will probably end up making some mistakes, or, in the vocabulary of this section, “measuring the underlying concept with error.”

What are the consequences of measurement error? Let's return to the population relationship of equation (5.1). This time, we'll represent the true explanatory variable as (10.4)
$Display mathematics$

This population relationship satisfies all of the assumptions of chapter 5.

However, what we actually observe is xi, which measures with random errors:

(10.5)
$Display mathematics$

The random error νi has the same properties that we conferred on εi in chapter 5. It has zero expectation, it is uncorrelated with νj, and it is unrelated to . In addition, νi is uncorrelated with εj for all pairs of values for i and j.4

The equation that we're actually going to estimate is going to relate, by necessity, yi to the observed xi rather than to the true explanatory variable . Consequently, we need to combine equations (10.4) and (10.5) in order to find out what population relationship will underlie this estimation. Equation (10.5) implies that

(10.6)
$Display mathematics$

Substituting equation (10.6) for into equation (10.4) yields

(10.7)
$Display mathematics$

The population relationship in equation (10.7) is reminiscent of that in equation (9.29). As there, this equation has two random components. Here, the first component is εi, the random element that directly affects yi. This disturbance term isn't a problem. It is uncorrelated with and one of its parts, νi. Consequently, as we prove in exercise 10.2, it has to be uncorrelated with the other part of , xi.

(p.370) However, the second random component in the population relationship of equation (10.7) contains the measurement error νi. Based on our assumptions about it, this may also seem pretty innocuous. It's not. Equation (10.5) states explicitly that it affects xi. This implies that xi is a random variable and that COV(xi, νi) ≠ 0. As xi appears in the deterministic part of equation (10.7) and νi appears in its unobserved part, this covariance violates the fourth assumption of chapter 5.

The third common context in which disturbances and explanatory variables are correlated is when behavior evolves dynamically. These are situations in which pre-existing circumstances or conditions first help to determine the value of the explanatory variable and then contribute to the value of the dependent variable. In these situations, the underlying population relationships are sure to be more complicated than that of equation (5.1).

Once again, our relationship between education and earnings provides a useful, if highly stylized example. Imagine that the deterministic component of earnings, yi, depends only on underlying ability, ai, and the random component, εi:

(10.8)
$Display mathematics$

Imagine also that individual ability can't be observed.

Finally, imagine that the only purpose of education is to reveal ability. Workers with more ability acquire more education so as to distinguish themselves, in the eyes of employers, from workers with less. Consequently, education, xi, depends only on ability and a random component, νi:

(10.9)
$Display mathematics$

Algebraically, equation (10.9) implies that

(10.10)
$Display mathematics$

In order to derive the population relationship between the two observed variables, education, xi, and earnings, yi, we have to substitute equation (10.10) into equation (10.8). The result is

$Display mathematics$

Rearranging, we obtain

(10.11)
$Display mathematics$

(p.371) The explanatory variable in equation (10.11) is xi. The disturbance term in that equation contains νi. Equation (10.9) tells us that xi is determined in part by, and is therefore correlated with, νi. Consequently, the fourth assumption of chapter 5 is again invalid.

There may be many other examples in which the behavior under examination gives rise to the suspicion that the explanatory variable is determined, in part, by the unpredictable part of the dependent variable. Regardless of the specifics, all of these examples share a common element. In each, xi has become a random variable because it contains a piece of the disturbance term.

This is why this situation is referred to as endogeneity. The explanatory variable is no longer exogenous. Its value is not determined outside of the relationships that determine the value of the dependent variable. To the contrary, these relationships determine its value as well. The observed values of both yi and xi are determined by the assignment of a disturbance or disturbances to the ith observation.

This is also why this situation is sometimes referred to as simultaneity. The explanatory variable, xi, is not predetermined. Instead, its value is determined simultaneously with that of yi, at the moment when the disturbance terms acquire their random values.

Finally, this is why we were so coy about the fourth assumption at the end of section 5.3. It didn't make sense to specify this assumption formally as COV(xi, εi) = 0. There, we were treating xi as exogenous, not random, and therefore not exactly appropriate as an argument to a population covariance.

Here, the random assignment of disturbances that has helped determine the values of the yi's since chapter 5 for the first time helps to determine the values of the xi's as well. In this case, xi is not exogenous and is a random variable. It therefore can be spoken of as having interesting population covariances. For this reason, we can now specify violations of our fourth assumption as

(10.12)
$Display mathematics$

even if we weren't comfortable describing that assumption itself as COV(xi, εi) = 0.

# 10.3 What Are the Consequences?

As we showed in equation (5.28), the OLS slope can be written as

$Display mathematics$

(p.372) Equation (5.33) implies that

(10.13)
$Display mathematics$

The equation following equation (5.35) shows that this can be reformulated as

(10.14)
$Display mathematics$

Equation (10.14) isn't quite right here. Equation (5.35) pulled the denominator out of the expected value of the second term to the right of the equality in equation (10.13) on the grounds that it was not random. In this chapter, we can't really do that because we've assumed that xi is a random variable. This already suggests the problem: The second term to the right of the equality in equation (10.13) is not going to just evaporate, as it did in chapter 5.

However, the real obstacle to achieving this is the numerator of the second term following the equality in equation (10.14), . In chapter 5, we assumed that the xi's were predetermined and unrelated to the εi's. This allowed us to treat both the xi's and their average as constants, for the purposes of calculating the expected value. We were able to use equation (5.34) to rewrite this expected value in terms of the expected value of εi alone.

Under the assumption of equation (10.12), xi and its average are both random variables. Therefore, the rules for deriving expectations require us to leave the term within the expectation . Moreover, we can rewrite

(10.15)
$Display mathematics$

because equation (5.5), which states that E(εi) = 0, is still good.

The term after the last equality in equation (10.15) should be approximately equal to

$Display mathematics$

because, as we may have learned if we've had a course in statistics, is an unbiased estimator for E(xi). However, as in equation (5.8),

(10.16)
$Display mathematics$

(p.373) We've just assumed in equation (10.12) that the term to the right of the equality in equation (10.16) isn't zero. Therefore, the term to the left isn't, either. This suggests that the expectation in equation (10.15) is almost surely nonzero itself. Consequently,

$Display mathematics$

almost surely does not vanish in equation (10.14).

If this expectation isn't zero, then the second term to the right of the equality in equation (10.14) doesn't equal zero. Therefore, we can't prove that E(b) = β. To the contrary, it probably doesn't. When the explanatory variable is endogenous, the OLS estimate b is a biased estimator of β.5

This is why violations of the fourth assumption in chapter 5 are so much more serious than violations of the first three. Throughout the discussions of equations (5.5), (5.6), and (5.11) in chapters 8 and 9, we have been able to rely on the fact that b is unbiased for β regardless of our assumptions. Here, under equation (10.12), we can't.

If we can't have unbiasedness, can we at least have consistency? No. Remember, from our discussion in section 5.10, consistency is the property that an estimator gets better as the sample size goes up. A consistent estimator should be perfect, so to speak, when the sample is infinite.

As we may also have learned if we've had a statistics course, is a consistent estimator of E(xi). The only thing that happens to equation (10.15) as the sample size increases is that is more and more surely close to E(xi). When the sample is infinite, they're equal. That just means that equation (10.15) becomes equation (10.16). In other words, the only thing we ensure as we increase the sample size is that isn't zero. When the sample is infinite, b is definitely not β.

Let's see what this can look like in the example of measurement error from the previous section. We return, yet again, to the simulations that we began in section 5.7. Once again, we create a sample using the values α = −20,000, β = 4,000, and σ = 25,000. For all of the simulations here, n = 100,000.

In this example, represents the true amount of education. Its distribution is the same as that for xi that we established in chapter 5. As we found in exercise 5.15, this implies that

$Display mathematics$

and that the population variance of .

(p.374) For our first simulation, we construct xi from by adding a random term νi to the latter with SD(νi) = V(νi) = 1. The variance of the measurement error is only slightly more than one-tenth as great as the variance of the value of the true explanatory variable. In other words, we've introduced a modest amount of measurement error. We then regress yi on xi rather than on .

This regression yields the following results:

(10.17)
$Display mathematics$

We can see that the OLS estimator of β is about 10% smaller than the true value of 4,000. The OLS estimator of α is more than 25% smaller, in absolute value, than the true value of −20,000.

Is equation (10.17) just randomly different from the population relationship or does it represent some systematic bias imparted by the presence of measurement error? One way to answer this question is to examine some additional simulations. The first row of table 10.1 reproduces equation (10.17). The subsequent rows give the results of OLS regressions on simulated samples afflicted with increasing amounts of measurement error, as indicated by the values for SD(νi) and V(νi) in the second and third columns. Otherwise, these samples are constructed identically to that of equation (10.17).

There's a striking pattern in the results of table 10.1. First, all of the estimates are pretty far from the population values. This suggests that measurement error always distorts OLS estimates. Second, values for b are always smaller than the true value for β. Third, values for a are always larger, algebraically, than the true values for α.

Perhaps most interesting, the distortions in a and b change systematically as the variance of the measurement error increases. As V(νi) goes up, values of b decline monotonically. Values of a increase monotonically.

How can we explain these results? Let's focus on b, because it estimates the most important population parameter, β. With measurement error, the value of b tends to converge on

(10.18)
$Display mathematics$

in large samples, where the arrow represents “converges to.”6 What's a large sample? Well, all of the results in this chapter are asymptotic. This means that they are exactly right when the sample is infinite.

How close to infinite does the sample have to be before we can expect our estimates to look something like what equation (10.18) says that they should? That depends. Certainly, as sample sizes increase toward infinity, they should yield results that more consistently match it. But there is no set minimum for (p.375)

TABLE 10.1 OLS regressions on six simulated samples with known measurement error

Simulation

Standard deviation of measurement error, SD(νi)

Variance of measurement error, V(νi)

OLS estimate a of α = −20,000

OLS estimate b of β = 4,000

1

1.0

1.0

−14,312

3,564.5

2

1.5

2.25

−9,220.2

3,177.6

3

2.0

4.0

−2,987.7

2,694.7

4

2.5

6.25

1,220.2

2,365.0

5

3.0

9.0

7,071.0

1,926.4

6

3.5

12.25

10,749

1,633.4

sample size, above which this equation can be taken as reliable. We'll discuss this issue further as we encounter some examples.

Before we examine the quantitative implications of equation (10.18), let's extract some intuition. and V(νi) are both positive. Therefore, the ratio that multiplies β in equation (10.18) must be less than one. Consequently, b understates β, as indicated by the inequality.

Why is this? Remember, b is trying to estimate the true effect of the explanatory variable on the dependent variable. In the case of measurement error, however, the measured explanatory variable xi has two components. It contains the true explanatory variable, , which really affects yi, and the measurement error νi, which doesn't. The true effect of the first component is diluted by the absence of any effect associated with the second, random component when we estimate the overall effect of the observed explanatory variable. That's why the estimated effect has to be smaller than the true effect.

How much dilution takes place? The ratio that multiplies β in equation (10.18), , is the bias in b. It clearly declines as the variance of the measurement error νi increases. In other words, as more of the observed value of xi is attributable to random measurement error that doesn't affect yi, the effect of the true explanatory variable on yi gets harder to see. Consequently, the estimated effect of the observed xi on yi goes down.

Equation (10.18) implies that the value of b in the regression of equation (10.18) should be approximately

$Display mathematics$

Sure enough, the estimate of b in that equation is almost exactly what equation (10.18) predicts that it should be! The addition of a relatively small amount of measurement error to has led to an estimate of β that is reliably biased downwards by more than 10%. (p.376)

TABLE 10.2 Theoretical and actual OLS bias due to measurement error

Simulation

Variance of measurement error, SD(νi)

Bias in OLS estimate of β, Asymptotic value of OLS estimate of β OLSsample estimate of β, b

1

1.0

8.5/(8.5 + 1) = .8947

3,578.9

3,564.5

2

2.25

8.5/(8.5 + 2.25) = .7907

3,162.8

3,177.6

3

4.0

8.5/(8.5 + 4) = .6800

2,720.0

2,694.7

4

6.25

8.5/(8.5 + 6.25) = .5763

2,305.1

2,365.0

5

9.0

8.5/(8.5 + 9) = .4857

1,942.9

1,926.4

6

12.25

8.5/(8.5 + 12.25) = .4096

1,638.6

1,633.4

The third column of table 10.2 calculates the bias, , for each of the six simulations in table 10.1. As the variance of the measurement error, given in the second column, increases, the OLS estimate of β becomes a progressively smaller fraction of the true value. In other words, the bias in b increases.

The fourth column of table 10.2 calculates the asymptotic value of b for each of these simulations, according to equation (10.18). Naturally, this declines as the OLS bias increases. The fifth column reproduces the simulated estimate of b from table 10.1.

If we weren't already familiar with the power of formulas like that of equation (10.18), we might be surprised at how closely the fifth column matches the fourth. Our samples consistently yield almost exactly what equation (10.18) tells us to expect. But we've become accustomed to seeing actual regressions that reproduce their theoretical properties quite closely. The correspondence in table 10.2 between the asymptotic and estimated values for b indicates that, at least as far as equation (10.18) is concerned, sample sizes of 100,000 are as close as we need to get to infinity.

Table 10.2 demonstrates that bad measurement can yield really bad estimates. As we've said, OLS in the first simulation already understates the true returns to education by more than 10%. By simulation 5, the variance of the measurement error is slightly larger than the variance in itself. The consequence is that OLS understates the returns to education by more than half ! In general, the measurement error in the later simulations is so large that the observed explanatory variable, xi, is largely junk. So are the values for b that we get from them: They are nowhere near the true returns to education.

The other sources of endogeneity, reverse causality and dynamic choice, can be equally dangerous. Moreover, there's one more disappointment in store. In chapters 8 and 9, we were able to rely on the ei's to both fix the OLS estimate of V(b) and recover best linear unbiased (BLU) estimators. That's because, even under the assumptions of equation (8.10) or (9.1), b and a were (p.377) consistent estimators of β and α. Consequently, the OLS errors were consistent estimators of the disturbances.

We can see what the problem is here. Under equation (10.12), b is not consistent for β. This means that ei is not a consistent estimator of εi. In fact, the OLS errors don't estimate anything useful. We can't use them to fix b.

# 10.4 What Can Be Done?

If we hadn't already gotten this far in the book, we might think that things look bleak. However, the fundamental theme of chapters 8 and 9 is that there is usually a way to overcome the obstacles that arise when the basic assumptions of chapter 5 are violated, and that way usually involves some creative reapplications of OLS. This theme is still valid here, regardless of the other contrasts between this chapter and the two that precede it.

Two familiar elements of this theme also remain valid here. First, the solution will consist, at least conceptually, of two steps, as in the weighted least squares (WLS) procedure of chapter 8 and the generalized least squares (GLS) procedure of chapter 9. In those procedures, both steps consisted of OLS regression calculations. Both steps here will also be OLS regressions. The first-step regression will produce transformed data. The second-step regression will produce usable estimates of the population parameters.

However, the solution here differs importantly from the two-step procedures of chapters 8 and 9. In the discussion of equation (10.18), we observed that the magnitude of the problem could be predicted more accurately in larger samples. Similarly, the problem can be dealt with more effectively as sample size increases.

In other words, equation (10.18) isn't the only asymptotic result here. To the contrary, all of the required properties and supporting proofs are asymptotic. That is, they are based on the presumed behavior of the sample as its size increases toward infinity. In particular, the best that we can hope for is consistent estimators. Unbiased estimators, estimators that would have α and β as their expected values regardless of sample size, do not exist.

This has two important implications for us. First, the relationship between the size of our sample and the effectiveness of the solution that we present has an additional dimension here that was not present in earlier chapters. In chapters 8 and 9, as in chapter 7, the principal benefit of larger samples would have been to reduce the standard deviations of our estimators. Here, even before we consider the precision of our estimates, we need larger samples in order to obtain the benefits of consistency. In other words, we need larger samples in order to reassure ourselves that our estimates even tend to be accurate.

(p.378) Second, we won't examine the formal proofs that establish the properties of our solution. The formal discussion of asymptotic properties requires more statistical sophistication than we have required of ourselves, as well as new notation. Therefore, we reserve them for a later course. Here, we'll work through a semiformal, semi-intuitive presentation to motivate the important results.7

The essential element in our response to equation (10.12) is an instrument, or an instrumental variable. Intuitively, an instrumental variable is a variable that is related to the explanatory variable in our population relationship, xi, but not to the disturbance, εi. A formal definition of these properties would require notation that we won't develop here. However, we can represent them, usefully if not precisely, as follows. If zi is to be an acceptable instrumental variable, then, roughly speaking,

(10.19)
$Display mathematics$

and

(10.20)
$Display mathematics$

at least when the sample is infinite. Moreover, these properties should be more likely to hold, the larger is the sample size.

Assuming that we had an appropriate instrumental variable, what could we do with it? Our objective must be to purge xi, of its correlations with the disturbance term in the original population relationship. How could we do that?

The answer is in the auxiliary regressions that we first introduced in section 8.6. We recall from there that these regressions are not intended to reflect the structural population relationship underlying the observed data. They are intended solely as conveniences, or intermediate calculations, along the route to estimating the population relationships of interest.

Here, the first step of our two-step response to equation (10.12) is to run a first-stage equation, or instrumenting equation. This is the auxiliary OLS regression

(10.21)
$Display mathematics$

In equation (10.21), c represents the calculated intercept, d represents the calculated slope, and fi represents the calculated error for observation i. Accordingly, we calculate d using equation (4.40), where d replaces b, zi replaces xi, replaces , xi replaces yi, and replaces . Correspondingly, we calculate c using equation (4.35), replacing b with d, with , and with .8

The values for the slope and intercept of equation (10.21) are merely incidental to our true purpose. Our goal is to use equation (10.21) in a similar (p.379) manner to the way in which we use equation (4.1). Here, we want to form predicted values, or estimates, of xi:

(10.22)
$Display mathematics$

The estimate is based solely on zi. Therefore, it consists only of the part of xi that is “reflected,” so to speak, in zi.

Somewhat more formally, by equation (10.20), zi is effectively uncorrelated with εi, at least in samples that are large enough. The estimate is just a linear function of zi. Therefore, is also effectively uncorrelated with εi if our observations are sufficiently numerous.9

This is at least a little bit miraculous. Equation (10.12) tells us that

$Display mathematics$

Our analysis of equation (10.22) leads us to conclude that, under the right conditions,

$Display mathematics$

While xi and εi violate the fourth assumption of chapter 5, and εi satisfy it!

This naturally suggests that, when equation (10.12) holds, we should estimate the population relationship of equation (5.1) using rather than xi as the explanatory variable. The second step of our two-step procedure is exactly that. We obtain our two-stage least squares, or 2SLS, estimators of α and β through OLS estimation of our second-stage equation,10

(10.23)
$Display mathematics$

Following equation (5.23), as derived from equation (4.37), our estimator of the coefficient β is

(10.24)
$Display mathematics$

where is the average value in the sample of . Following equation (4.35), our estimator of the constant α is

(10.25)
$Display mathematics$
(p.380)

TABLE 10.3 2SLS regressions on six simulated samples with known measurement error

Simulation

Standard deviation of measurement error, SD(νi)

Variance of measurement error, V(νi)

2SLS estimate a of α = −20,000

2SLS estimate b of β = 4,000

1

1.0

1.0

−19,802

3,986.8

2

1.5

2.25

−19,822

3,993.3

3

2.0

4.0

−19,455

3,961.1

4

2.5

6.25

−20,307

4,020.8

5

3.0

9.0

−20,335

4,034.1

6

3.5

12.25

−18,462

3,880.7

OLS estimation of equation (10.23) also yields estimators of SD(b2SLS) and SD(a2SLS) that are asymptotically correct. However, the R2 value is basically meaningless. As in exercise 5.7, the variance of the errors and of the predicted value for yi do not sum to the variance in yi.

Does this really work? We can create a variable zi in the simulated data that we used in the regression of equation (10.17) that satisfies equations (10.19) and (10.20). If we then apply equations (10.24) and (10.25), we obtain the following 2SLS equation:

(10.26)
$Display mathematics$

While the value for b in equation (10.17) was predictably less than the true value for β, the value for bIV in equation (10.26) is almost exactly equal to it! This seems like a huge improvement.

Table 10.3 tabulates the 2SLS results in equation (10.26) in the first row and the results of 2SLS estimations for the other five simulations of tables 10.1 and 10.2 in the subsequent rows. As in table 10.2, a striking pattern emerges. The difference is that, in the former table, OLS estimates became more and more distorted as the measurement error increased. In table 10.3, the 2SLS estimates are invariably close to the true parameter values, regardless of the amount of measurement error. This is true even in the last two simulations, where measurement error is responsible for more than half of the variation in the explanatory variable. This certainly suggests that, if an appropriate instrument, zi, is available, 2SLS estimation is an improvement over OLS.

# 10.5 Two-Stage Least Squares and Instrumental Variables

The two-stage procedure of the previous section certainly makes intuitive sense, doesn't it? However, that's not the same as having attractive formal (p.381) properties. While, as we've already agreed, we're not going to derive these properties rigorously, we still need to investigate them to the limits of our technical abilities.

For this purpose, let's rewrite the formulas for b2SLS in equation (10.24) and a2SLS in equation (10.25) in more convenient forms. The first step in this direction is to consider the average predicted value for . By the definition of equation (2.8),

$Display mathematics$

Substituting equation (10.22), we find that

$Display mathematics$

Simplifying, using the rules of summations that, by now, are familiar to us, we obtain

(10.27)
$Display mathematics$

Equation (10.27) is very useful. Let's put it to work right away. If we take the formula for b2SLS in equation (10.24) and substitute equation (10.22) for and equation (10.27) for , we get

$Display mathematics$

Simplifying, we have

$Display mathematics$

(p.382) Next, we can factor d from the numerator and denominator and cancel:

(10.28)
$Display mathematics$

The final steps concern the denominator of equation (10.28). Following the development in equations (4.50) through (4.52), we can prove that the summation

(10.29)
$Display mathematics$

where fi is the residual from the auxiliary regression of equation (10.21).11 We can therefore add it to the denominator of the last ratio in equation (10.28) without changing its value:

$Display mathematics$

If we need reminding, the final equality comes to us as a consequence of a derivation very similar to that in equations (2.25) and (2.26).

Equation (10.22) allows us to substitute c + dzi for $Display mathematics$

Equation (10.21) allows us to substitute xi for c + dzi + fi,

$Display mathematics$

Finally, we can substitute this last result back into the denominator of the last ratio in equation (10.28) to obtain

(10.30)
$Display mathematics$

(p.383) That felt like a lot of work. In fact, it was. What could possibly be the point? Comparing equations (10.28) and (10.30), the only difference is the presence of a single caret in the denominator of the former and the absence of it in the latter.

That is precisely the point. The formula for b2SLS in equation (10.30) does not contain . This demonstrates that it is not necessary to actually predict the values of xi, using equation (10.22), in order to calculate the value of b2SLS. In other words, it is possible to calculate b2SLS without actually running the auxiliary regression of equation (10.21). Instead, we can simply combine the original variables yi, xi, and zi as given in equation (10.30) to obtain our estimator of β.12 In the next section, we'll see that equation (10.30) is also the most convenient basis from which to explore the properties of b2SLS.

Similarly, equation (10.25) is not the most convenient way to calculate the 2SLS estimator of α. Following the development of the first normal equation in equation (4.17) and its implication in equation (4.34), we know that

$Display mathematics$

Comparing this to equation (10.27), we conclude that . In other words, the average of the predicted xi's is equal to the average of the xi's themselves.

We can therefore rewrite equation (10.25) as

(10.31)
$Display mathematics$

As in equation (10.30), this expression for a2SLS does not contain . For that matter, it doesn't contain zi either. Once we have b2SLS, we can calculate a2SLS using only the averages of yi and xi from the original data.

It is very useful to examine equations (10.21) and (10.23) in order to motivate and interpret a2SLS and b2SLS. In addition, in section 10.7, we will see that equation (10.21) is important for assessing the quality of our instrument zi. However, equations (10.30) and (10.31) demonstrate that it is possible to calculate the values for these estimators directly from the original sample. It is not actually necessary to perform the auxiliary and substantive regressions of equations (10.21) and (10.23) for this purpose.

From the perspective of equations (10.30) and (10.31), the distinguishing characteristic is not the two-step process that motivated our original discussion, but the presence of the instrumental variable zi in equation (10.30). For this reason, the estimation procedure that we have explored in this and the previous section is often referred to as instrumental variables (IV) estimation. This terminology is more general, and so we will use it in preference to the 2SLS terminology introduced earlier. However, we need to be very clear that (p.384) both terms refer to the same estimators:

(10.32)
$Display mathematics$

and

$Display mathematics$

# 10.6 The Properties of Instrumental Variables Estimators

Section 10.4 did a pretty good job of motivating the idea that 2SLS, or IV, estimation would “purge” the explanatory variable xi of its correlation with the disturbance εi. That's not the same as a proof. While we're not going to work through a true proof, we ought to examine the properties of bIV a little more formally.

We can begin by examining the relationship among bIV, yi, xi, zi, and εi. Substituting the population relationship of equation (5.1) into equation (10.30), we obtain

(10.33)
$Display mathematics$

We can distribute the product in the numerator of the last ratio in exactly the same way as we did in equations (5.24) through (5.28).13 This yields

(10.34)
$Display mathematics$

The expected value of this is

(10.35)
$Display mathematics$

(p.385) Here, we find ourselves in a situation that is very similar to that we experienced in regard to equations (10.13) and (10.14). If we were back in chapter 5, we would take the denominator out of the expected value and proceed. However, the theme of this chapter, starting in section 10.2, is that xi is now a random variable. Implicitly, the assumption in equation (10.19) that xi and zi have an interesting covariance makes zi a random variable as well.

We're going to proceed anyway, because the intuition that we are going to develop is quite valuable. In order to do so, however, we have to recognize two things. First, what we're doing is legitimate only if the sample is very large. Second, we're using the notation that we've relied on since chapter 5 because we're familiar with it. If we really wanted to do this correctly, we would have to introduce some new concepts and adopt some new notation, as suggested in note 7.

With these cautions in mind, let's return to the expected value of bIV. Imagine that n is large and that we can treat as if it could be removedfrom the expectation in equation (10.35) as indicated by equation (5.34). We could imagine getting an equation that looks very much like equation (10.14):

(10.36)
$Display mathematics$

In equation (10.14), the term that hung us up was . In equations (10.15) and (10.16), we convinced ourselves that this was approximately equal to COV(xi, εi). By equation (10.12), this isn't zero.

In equation (10.36), is replaced by . The argument of equations (10.15) and (10.16) implies that this is approximately equal to COV(zi, εi). Equation (10.20) says that this is zero, at least in infinite samples. Asymptotically, then, the second term to the right of the equality in equation (10.36) is zero. This implies that E(bIV) converges to β as the sample size increases.14

What about V(bIV)? Equations (10.30) and (10.32) give us

(10.37)
$Display mathematics$

For the last time, we ignore the fact that . is a random variable because, if the sample were infinite in size and we used the right notation, it (p.386) Wouldn't matter. If we're willing to do this, then equation (10.37) becomes, approximately,

(10.38)
$Display mathematics$

Having gotten this far, we can apply equation (5.45) to the variance of the summation in equation (10.38) because we're still assuming equation (5.22), that the yi's have zero covariances.

$Display mathematics$

As we might easily guess at this point, the term presents some additional challenges. Remembering that our sample is large and that we're not going to be too careful about our notation, we can continue as if were a constant. This gives us, according to equation (5.43),

$Display mathematics$

Replacing V(yi) as in equation (5.20), we get

(10.39)
$Display mathematics$

At last! We have V(bIV) in terms of data from the sample and the population variance of εi. In order to construct an estimate of V(bIV) from the sample (p.387) alone, we have to estimate σ2. In the IV context, that estimate is

(10.40)
$Display mathematics$

where we calculate the errors using xi, not .15 Therefore, the sample estimate of V(bIV) is

(10.41)
$Display mathematics$

Naturally, SD(bIV) is just the positive square root:

(10.42)
$Display mathematics$

Table 10.4 reports the values for SD(bIV) in the six simulated samples that we've been following since table 10.1. The third column reproduces the estimates of bIV from table 10.3. Their estimated standard deviations, according to equation (10.42), are given in the fourth column.

The values for SD(bIV) in table 10.4 suggest that the IV estimates are not only close to the true value β = 4,000 but are precisely estimated as well.16 At the same time, the precision seems to vary with the amount of measurement

TABLE 10.4 IV estimators with n = 100,000 and different degrees of measurement error

Simulation

Variance of measurement error, V(vi)

IV estimate of β, bIV

IV estimate of SD(bIV)

1

1.0

3,986.8

28.982

2

2.25

3,993.3

31.277

3

4.0

3,961.1

34.473

4

6.25

4,020.8

38.377

5

9.0

4,034.1

43.435

6

12.25

3,880.7

47.974

(p.388) error. Values for SD(bIV) increase monotonically with the variance of the measurement error, V(νi). We'll explore this observation further in the next section.

The remaining question for this section is consistency. We know that E(bIV) converges to β as the sample size approaches infinity. Does V(bIV) shrink to zero? We need to do some more rearranging in order to address this question.

At this point, we can see that the summation in the numerator of equation (10.39) is the numerator in the expression for the sample variance of zi. We should also be able to see that the denominator looks a lot like the numerator in the expression for the sample covariance between zi and xi, even if we haven't already worked this through in exercise 10.6. We'll prove both results and extend them in exercise 10.9. Anticipating that accomplishment, we can rewrite equation (10.39) as

(10.43)
$Display mathematics$

Consistency would be very hard to see in equation (10.39) but is obvious, almost, in equation (10.43). As the sample size increases, the sample variance for xi converges to its population variance. Similarly, the sample correlation between zi and xi converges to the population correlation. So the two latter terms in the denominator of equation (10.43) are, asymptotically, positive constants. They do not give us consistency.

However, the first term in the denominator of equation (10.43) is simply n. Tautologically, it increases as the sample expands. It is this term that guarantees that the variance of bIV converges to zero when the sample is infinite. It is therefore this term that guarantees that bIV is a consistent estimator of β. In section 10.4, we said that we could get consistent, though not unbiased, estimates of β from OLS if we first purged xi of its correlation with εi. Now we've proven it, or at least come as close to a proof as our current statistical knowledge will permit.

Table 10.5 demonstrates the consistency of bIV. It presents IV estimates of β from simulations constructed along the same lines as those for table 10.1. The difference is that, here, what varies across simulations is the sample size, as given in the second column. The extent of the measurement error is fixed at the value for the third simulation in the previous tables of this chapter, SD(νi) = 2 and V(νi) = 4.

Our description of these simulations in section 10.3 gives σ = 25,000. In exercise 10.10, we establish that the population variance of xi is V(xi) = 12.5 and that the population correlation between xi and zi is CORR(xi, zi) = .6800. Finally, the first column of table 10.5 gives us n. As we can see, the fourth (p.389)

TABLE 10.5 IV estimates of β when V(vi) = 4 and sample size increases

Simulation

Sample size

IV sample estimate of β, bIV

Asymptotic IV value of SD(bIV)

IV sample estimate of SD(bIV)

1

100

4,132.1

1,039.9

1,076.5

2

1,000

3,725.9

328.83

349.46

3

10,000

3,933.1

103.99

112.03

4

100,000

3,961.1

32.883

34.473

5

1,000,000

3,985.4

10.399

10.902

simulation in this table, where n = 100,000, is the same as the third simulation in each of the previous tables of this chapter.

This gives us everything we need to know to calculate the true, or population, value for V(bIV) from equation (10.43). The fourth column of table 10.5 reports the positive square root of this variance: the true, or population, value for SD(bIV). The fifth column of table 10.5 reports the sample value for SD(bIV) from equation (10.42).

The implications of consistency are immediately apparent in both the population and the sample values for SD(bIV). The values for bIV in the third column of table 10.5 differ from the true value of β = 4,000 by less as the sample size increases. Moreover, the population and sample values for SD(bIV) in the fourth and fifth columns get markedly smaller with each tenfold increase in sample size.17 In other words, bIV gets noticeably better as n increases.

# 10.7 What's a Good Instrument?

Equation (10.43) is worth a further look. It will help us, in this section, to understand what a good instrument must be like. In the next section, it will help us understand how we can test for the presence of endogeneity. All we need to do is to extend its derivation a little further.

First, recognize that

$Display mathematics$

where the approximation is especially good in the large samples that we need to make IV estimation work. In combination with equation (5.50), this (p.390) implies that

(10.44)
$Display mathematics$

We can replace this approximation in equation (10.43) and rearrange to obtain, finally,

(10.45)
$Display mathematics$

Equation (10.45) demonstrates that V(bIV) is at least as large as V(b). In other words, OLS is more efficient than IV estimation. Of course, in the presence of endogeneity, the attraction of this is entirely illusory. We showed in section 10.3 that, in this case, OLS is biased and inconsistent. The fact that it is precisely biased and inconsistent offers no consolation, to say nothing of satisfaction.

The interest in equation (10.45) is, rather, in what it implies about the qualities that we want in our instrument zi. For example, it indicates that bIV is as efficient as b if CORR(zi, xi) is one. The problem here is that, if this is true, then the assumption of equation (10.20) must be false. The instrument and the disturbance can't be uncorrelated if the instrument is perfectly correlated with the explanatory variable. In this case, zi is a crummy instrument, and we have to start over.

The same is probably the case if CORR(zi, xi) is less than one but large in absolute value. In this circumstance, bIV is only somewhat less efficient than b, but zi looks enough like xi that it's hard to believe it's not correlated with εi, just like xi. Once again, zi's virtues as an instrument are suspect.

The assumption of equation (10.20) gets a lot easier to accept when CORR(zi, xi) is noticeably less than one. In this case, zi differs enough from xi that it's possible to believe that COV(zi, εi) = 0, even though COV(xi, εi) ≠ 0. But it's precisely in this case that the inefficiency of bIV is substantial. If, for example, CORR(zi, xi) = .5, the variance of bIV is four times that of b!18

Table 10.6 illustrates these points. It presents six simulations, all with sample sizes of 100,000. The degree of measurement error in each is the same as in table 10.5, SD(νi) = 2 and V(νi) = 4.19 Therefore, the bias in the OLS estimates, as given in equation (10.18), is identical in all six. Consequently, these estimates, in the second column of table 10.6, are very similar to each other, to the asymptotic value in table 10.2, and to the OLS estimates in tables 10.1 and 10.2 with the same value for V(νi). (p.391)

TABLE 10.6 Population standard deviations for b and asymptotic standard deviations for bIV

Simulation

OLS estimate, b

IV estimate, bIV

CORR(zi, xi)

Asymptotic SD(bIV)

Estimated SD(bIV)

1

2,744.1

4,008.4

.78001

28.667

30.026

2

2,736.9

4,023.6

.73326

30.495

32.192

3

2,694.7

3,961.1

.68000

32.883

34.473

4

2,699.0

4,009.5

.62599

35.720

37.522

5

2,700.8

3,959.7

.57470

38.908

40.810

6

2,718.0

4,030.6

.52778

42.367

44.592

The third column of table 10.6 presents the estimates of bIV for each simulation. In each case, these estimates are very close to the true value of β = 4,000. In other words, the instrument in each simulation appears to have been successful.

Nevertheless, these instruments are not the same. The fourth column reports CORR(zi, xi), the correlation between the instrument zi and the explanatory variable xi. This is what varies across simulations. It is progressively lower for each successive simulation.

The value for CORR(zi, xi) played no role in equations (10.35) and (10.36), our heuristic analysis of the expected value of bIV. It was also unimportant in equation (10.43), our analysis of the consistency of bIV. That's why the variations in its value exhibited in the fourth column of table 10.6 don't affect the estimated values of bIV in the third.

However, the value for CORR(zi, xi) is central in equation (10.45), which establishes the value for V(bIV). As this correlation declines, V(bIV) must increase. Obviously, the same must be true for SD(bIV). The fifth column of table 10.6 demonstrates this increase in the asymptotic values for SD(bIV) for each simulation. The sixth column demonstrates this increase in the values for SD(bIV) that are actually estimated by each simulation.

Table 10.7 compares the population standard deviations for b and the asymptotic standard deviations for bIV in the simulations of tables 10.1 through 10.4.20 In these simulations, the standard deviations for b decline as the measurement error increases. This is simply because the variance of the true explanatory variable, , is the same for all six simulations. As more measurement error is added to , the variance of the observed explanatory variable, xi, goes up.21 According to equation (10.44), this means that SD(b) has to go down.

In chapter 7, we made the claim that, all else equal, a bigger variance for the explanatory variable was a good thing. In general, that's true. It's not true here, because, in the construction of these simulations, bigger variances just mean that the explanatory variable is more heavily contaminated with measurement error. We've already seen the damage that this can do in table 10.1. (p.392)

TABLE 10.7 Population standard deviations for b and asymptotic standard deviations for bIV

Simulation

Variance of measurement error, V(vi)

Population SD(b)

CORR(zi, xi)

Asymptotic SD(bIV)

1

1

25.649

.89473

28.667

2

1.5

24.112

.79070

30.495

3

2

22.361

.68000

32.883

4

2.5

20.585

.57627

35.720

5

3

18.898

.48571

38.908

6

3.5

17.355

.40964

42.367

TABLE 10.8 Estimated standard deviations for b and bIV in simulated data

Simulation

Variance of measurement error, V(vi)

OLS estimate of SD(b)

IV estimate of SD(bIV)

1

1

25.909

28.982

2

1.5

24.589

31.277

3

2

23.110

34.473

4

2.5

21.562

38.377

5

3

19.993

43.435

6

3.5

18.432

47.974

The fourth column of table 10.7 reports the population correlation between our error-ridden explanatory variable, xi, and our instrument, zi. In these simulations, as in those of table 10.6, this correlation also declines as the measurement error increases. According to equation (10.45), the reduction in CORR(zi, xi) should increase SD(bIV).

The final column of table 10.7 calculates the net effects of the reductions in SD(b) and CORR(zi, xi) as the measurement error increases. It combines them as indicated in equation (10.45) to yield the asymptotic standard deviation of bIV. We can see that, as the measurement error increases, the asymptotic value of SD(bIV) increases as well. The reduction in CORR(zi, xi) has a more powerful effect than does the reduction in SD(b).22

Table 10.8 presents estimates of the standard deviations of the slope estimates from our simulated samples. These estimates generally reproduce the population and asymptotic results of table 10.7. In other words, actual applications of the OLS estimation formula in equation (7.5) and the IV estimation formula in equation (10.42) yield results that are completely in line with those that we expect, based on the population relationship in equation (5.50) and the asymptotic relationship in equation (10.39).23

Tables 10.6, 10.7, and 10.8 reinforce the fundamental intuition embedded in equation (10.45): The variance of bIV gets smaller as the correlation (p.393) between zi and xi gets bigger. This creates a temptation to choose an instrument that has a large absolute value for CORR(zi, xi). The problem with this is that, as zi begins to look more and more like xi, it becomes harder and harder to maintain that zi is unrelated to εi.

However, the more likely it is that zi is uncorrelated with εi, the less likely it is that zi is correlated with xi. If CORR(zi, xi) is really small, there is good news and bad news. The good news is that equation (10.20) is easy to believe.

The bad news is that zi is a weak instrument. This has two implications. The first is that, as we have shown in equation (10.45), V(bIV) will be huge if we proceed as if zi were an acceptable instrument. The second implication is that zi differs so much from xi that what's left in xi after it's reflected in zi may not have much to do with yi. Worse, zi may give us estimates that are not even consistent!

To see this, begin with equation (10.34). Exercise 10.16 suggests that this equation can be rewritten as

$Display mathematics$

at least as an approximation in very large samples. CORR(zi, εi), the numerator in the ratio

(10.46)
$Display mathematics$

is zero, according to the assumption in equation (10.20). Suppose that assumption isn't quite correct. Suppose the numerator is very small, but not exactly zero.

If the denominator, CORR(zi, xi) is large, this shouldn't matter much. The ratio in equation (10.46) will be small. The asymptotic difference between bIV and β will ordinarily be small, too.

However, all we've actually assumed about CORR(zi, xi) is, in equation (10.19), that it isn't zero. If it is very small as well, then the ratio in equation (10.46) could actually be quite large! In this case, the difference between bIV and β could also be large, even asymptotically.

Table 10.9 demonstrates the risks. It presents additional simulations that share the same design as those in table 10.6. Certainly, the OLS estimates here are almost identical to those in table 10.6 and just as far away from the true value of β.

The difference is that the correlations between zi and xi in this table are much lower than those in table 10.6. For the first three simulations, it doesn't seem to matter much with respect to bIV. Even with correlations as low as (p.394)

TABLE 10.9 Sample correlations and IV estimates with weak instruments

Simulation

CORR(zi, xi)

OLS estimate, b

IV estimate, bIV

SD(bIV)

F-statistic

7

.11365

2,725.5

4,132.5

206.75

399.49

8

.04798

2,711.8

3,951.3

491.11

64.73

9

.02141

2,741.6

4,180.5

1,101.1

14.42

10

.01359

2,722.0

3,769.9

1,721.1

4.80

11

.00880

2,709.0

3,078.4

2,627.1

1.37

12

.00638

2,709.7

1,803.6

3,648.7

.24

.02, the IV estimate of β is pretty close to the true value. Of course, the standard deviations, as given by equation (10.45), are much bigger than those in table 10.6.

However, with correlations that are smaller, bIV seems to lose its accuracy. In fact, in simulation 12, where the correlation is barely .006, it's worse than b itself ! The standard deviations in these cases are huge. We might wonder if all we're getting here is, perhaps predictably, imprecision.

Unfortunately, it's worse than that. With correlations between zi and xi that are this low, bIV is not consistent. In other words, as sample sizes increase, it converges to some value that is definitely not β!

How can we know if zi is too weak for consistency? The context of this chapter, where the population relationship is equation (5.1), the sole explanatory variable is endogenous, and there is only one instrument, is addressed by the Staiger-Stock rule of thumb. It stipulates that the F-statistic for the firststage regression should exceed 10 in order for bIV to be useful.

This rule makes judgments that appear quite reasonable. The F-statistics for the first-stage regressions of equation (10.21) in table 10.6 all exceed 300! So there's no reason to doubt the results there. In table 10.9, the last column indicates that the IV estimates in simulations 7, 8, and 9 are acceptable according to the Staiger-Stock rule. The IV estimates in simulations 10, 11, and 12 are not.24

The best instruments, then, are moderately correlated with xi: enough so as to support the assumption of equation (10.19) without invalidating that of equation (10.20). As we can imagine, the identification of a suitable zi and the justification of these properties is usually the most difficult part of the IV solution to endogeneity.

At the same time, if suitable zi's can be identified, there's no reason to stop at one. We've done so here because, at the moment, we can only analyze the auxiliary regression in equation (10.21) in depth if it replicates the bivariate structure that we first introduced in chapter 4. More advanced texts and courses will demonstrate that the IV strategy becomes more and more effective as the number of appropriate instruments increases.

# (p.395) 10.8 How Do We Know If We Have Endogeneity?

This chapter and the two that precede it tell an interesting, implicit story about the ways in which our assumptions of chapter 5 can be violated. The possible violations become increasingly worrisome as we progress through these chapters. One way in which this is symbolized is that the diagnostic tests for the presence of each successive violation get successively more difficult to implement.

In sections 8.1 through 8.3, the most likely violations were so innocuous that we didn't bother to talk about how we might test to see whether they were present. In the rest of chapter 8, we were able to use the original OLS estimates of equation (4.4) to test whether heteroscedasticity of any sort was distorting those results with the White test. In chapter 9, with the example of first-order autocorrelation and the Durbin-Watson statistic, we found that we could test for at least some specific forms of autocorrelation, again using only the original OLS results.

The risks associated with endogeneity are more serious than those associated with the issues of chapters 8 and 9. This is indicated by the placement of this section. With endogeneity, there is no way to identify its presence in the original OLS estimates. This is why this section doesn't appear earlier in the chapter. To the contrary, we can't test for endogeneity until we've corrected for it. In other words, now.

This raises an obvious question. If we've already gone to the trouble to correct for endogeneity, isn't it a little late to decide whether we need to? This question is especially compelling because the IV estimates of α and β are consistent, even if we were wrong about endogeneity in the first place. In other words, even if equation (10.12) is wrong and the assumptions of chapter 5 are correct, bIV is consistent for β.

How can we see this? Equation (10.12) is about the properties of the xi's. The discussion of consistency for bIV in the previous section relies only on the properties of the zi's. It required only equations (10.19) and (10.20). As long as they're correct, bIV is consistent for β, whether or not equation (10.12) is true.

It's equation (10.45) that explains why we still want to test whether endogeneity is present, even though we've already corrected for it. If equation (10.12) is wrong, bIV remains usable because it's still consistent for β. In this case, however, b is also unbiased. Moreover, as we showed in the last section, b is more efficient, and maybe a lot more efficient, than bIV. Therefore, if endogeneity is absent, we will almost surely prefer our original OLS estimator b to the IV estimator bIV.

The relevant test is called the Hausman test. Our discussion of it here will necessarily be somewhat limited because we are getting a little ahead of (p.396) ourselves, as we did briefly in chapter 8. The Hausman test consists of an auxiliary OLS regression with two explanatory variables. We've seen this twice before, in the auxiliary OLS regression of equation (8.14) that comprises the White test for heteroscedasticity, and the WLS regression of equation (8.19). As we said then, we won't provide a comprehensive analysis of this structure until chapter 11. Therefore, as with equations (8.14) and (8.19), we'll have to defer some of our discussion of the Hausman test until then.25

Here, we'll concentrate on motivating the test and establishing some intuition. The auxiliary regression for the Hausman test augments our original regression of equation (4.4) with as an additional explanatory variable:

(10.47)
$Display mathematics$

Recalling equations (10.21) and (10.22), is just a piece of xi. If the assumptions of chapter 5 are valid, whatever has got to say is already being said by xi. With both of them in the regression of equation (10.47), should be redundant. In this case, as we'll see in the next chapter, b2 should not differ significantly from zero.

However, exercise 10.4 proves that if COV(zi, εi) = 0. If, at the same time, equation (10.12) is true and COV(xi, εi) 0, then has something that xi does not. In this case, is free of correlation with εi, while xi is contaminated with it.

We've already worked through the consequences of this. In section 10.3, we've shown that the OLS estimate of the effect of xi on yi, b 1 in the notation of equation (10.47), is distorted. In section 10.6, we've shown that the OLS estimate of the effect of on yi, b2, is not distorted. Accordingly, if equation (10.12) is true, is not redundant with xi and b 2 should differ significantly from zero.

Consequently, the Hausman test for endogeneity is actually very simple to execute and interpret. It consists of testing the statistical significance for b 2 in the auxiliary regression for yi containing both and xi. We simply run the regression of equation (10.47) as we will describe in the next chapter. If b 2 does not differ statistically from zero, we accept the null hypothesis that endogeneity is absent and rely on OLS estimation of the population relationship in equation (5.1). If b 2 differs significantly from zero, we reject that null hypothesis and instead rely on the 2SLS, or IV, procedure of sections 10.4 through 10.6 in order to estimate the population relationship.

Table 10.10 presents the results of the Hausman test on our simulated data of tables 10.1 through 10.4, 10.7, and 10.8. In all six of these samples, the regression of equation (10.47) yields a large t-statistic for b 2. Consequently, each test rejects the null hypothesis that endogeneity is absent and recommends that we estimate the population relationship of equation (5.1) (p.397)

TABLE 10.10 Hausman test for endogeneity in simulated data

Simulation

Variance of measurement error, V(vi)

t-statistic for OLS estimate of b2

Decision regarding null hypothesis of no endogeneity

1

1

32.90

Reject

2

1.5

50.32

Reject

3

2

51.55

Reject

4

2.5

45.03

Reject

5

3

59.53

Reject

6

3.5

55.95

Reject

with the IV estimators of equations (10.30) and (10.31) instead. Given that we explicitly built endogeneity into these simulations, that we've seen the damage that it does to OLS in tables 10.1 and 10.2, and that we've seen how much more satisfying the IV estimators in tables 10.3, 10.4, and 10.5 are, the results of this test are very reassuring.

# 10.9 What Does This Look Like in Real Data?

Let's work through the IV estimation technique with our Census data on earnings and education. In the appendix to chapter 3, we discussed the errors that might arise in our translation of the Census information regarding educational attainment into our explanatory variable measuring years of schooling. That's one possible source of measurement error in the regressions that we have run using these data.

There are other sources of error embedded in the Census itself. Individuals report all of the information that the Census collects. Sometimes those reports are incomplete. Sometimes they are self-contradictory. In these cases, the Census edits the information and replaces the original data with imputations that seem more plausible.

In addition, the Census cares passionately about protecting the anonymity of everyone who responds. Sometimes, a response is so unusual that the respondent might be identifiable if the Census simply reproduced it in its publications. In these cases, the Census again replaces the original data with edits, in this case for the purpose of confidentiality rather than accuracy.26

Naturally, the Census understands that we might want to know whether we're looking at data that came directly from a respondent or that were revised. Therefore, it includes allocation codes in the Public Use Microdata Samples (PUMS). These codes identify whether the associated data item was an original response or a Census imputation.

(p.398) In the case of education, this code is a variable with two values: It has a value of zero if the associated value for educational attainment is what the respondent said originally. It has a value of one if the Census has changed that original report. This is another example of the discrete, or categorical, variables that we first met in chapter 1. We'll have more to say about them in chapter 13.

For the purpose of this chapter, though, this discussion raises two relevant issues. First, our explanatory variable, years of schooling, may be subject to measurement error. Second, the allocation code for the original Census data on educational attainment may be an acceptable instrument.

Why is the allocation code a potential instrument? First, it may be correlated with our estimate of years of schooling, as required by equation (10.19). The allocation code identifying imputations will be correlated with educational attainment if there are any systematic differences between imputed and nonimputed values of the variable for educational attainment. If those differences are preserved in our translation of educational attainment into years of schooling, then the allocation code will be correlated with our explanatory variable as well.

Second, the allocation code is probably not correlated with the random component of earnings, as required by equation (10.20). The disturbance, εi, represents characteristics that affect earnings idiosyncratically and unpredictably for each individual. It doesn't seem likely that the Census has any knowledge of these characteristics when it decides whether or not to edit information regarding educational attainment, much less that it would use such knowledge in the editing process if it did. If it doesn't, then the allocation codes and disturbances would be uncorrelated.

In section 7.6, we ran a regression of earnings on years of schooling in a sample of 179,549 observations from California. Let's continue to work with that sample here. First, what evidence does it provide regarding the suitability of the allocation code for educational attainment as an instrument for years of schooling?

The sample correlation between this allocation code and years of schooling is −.09372. This indicates that there is a weak tendency for individuals with imputed values for educational attainment to have lower values for years of schooling. This tendency is so weak that we might worry about whether the allocation code is a weak instrument. Apart from worry, though, the only thing that we can do about this is to continue with the procedure and see how it works out.

It's always harder to get a sense of whether equation (10.20) is satisfied from the sample, because authentic data never contain values for εi. However, there is some modest support for this assumption in the sample correlation between the allocation code and earnings: −.06183. This correlation is not zero, which is good. We want the allocation code to be correlated with that (p.399) part of earnings that comes from years of schooling. However, it is smaller in absolute value than the correlation between the allocation code and years of schooling itself. This is also good. We don't want the allocation code to be correlated with that part of earnings that comes from the disturbance.

Let's estimate this with the complete 2SLS procedure. The first-stage regression of years of schooling on the allocation codes, from equation (10.21), looks like this:

(10.48)
$Display mathematics$

where the parentheses contain t-statistics. The R2 value for this regression is tiny, at .0088.27 The allocation code explains less than 1% of the variation in years of schooling. The relationship between the allocation code and years of schooling is clearly not of great substantive importance.

But substantive importance was never our interest in the first-stage regression. Statistically, this relationship is very reliable. The slope estimated for the allocation code achieves a very high level of statistical significance. More important, the F-statistic for this regression is 1,590.9. This is well in excess of the threshold set by the Staiger-Stock rule of thumb. Accordingly, we can move on to the second-stage regression with confidence that our instrument isn't so weak as to cause bias itself.

That regression, from equation (10.23), is

(10.49)
$Display mathematics$

The IV estimate of the returns to schooling is approximately double the OLS estimate of 3,845.7 that we obtained in section 7.6!

At the same time, we've lost a lot of precision. The t-statistic on the predicted years of schooling, while large, is many times smaller than the OLS t-statistic of 144.3 implied by section 7.6.28 Is this necessary? In this case, the Hausman test regression of equation (10.47) is

$Display mathematics$

The estimated slope for predicted years of schooling achieves a high level of statistical significance. Accordingly, we have to reject the null hypothesis that (p.400) endogeneity is absent. This means that, on statistical grounds, we prefer the IV estimates of equation (10.49) to the OLS estimates of section 7.6.

We need to conclude this analysis with three important observations. First, it's always necessary to have a behavioral argument as to why an instrument might be appropriate. It's usually possible to offer a counterargument as to why it might not be.

In our example, one counterargument might be that some of the Census imputation procedures involve comparing respondents whose answers need editing with other respondents who share some of the same characteristics. This comparison can be based on many variables, not just our explanatory variable, educational attainment. Therefore, these variables can help determine whether or not a variable is imputed. If some of these other variables also affect earnings, then the imputation may be correlated with the part of earnings that doesn't depend on years of schooling.

One concern raised by this example is that maybe the Census knows something about ε, that is not available to us and uses it to help decide whether to edit a response or not. There may not be much that we can do about this. In more advanced courses, we'll find that there are some statistical tools to help decide whether an instrument is appropriate or not. But a lot of that decision will always ride on the quality of the logic and behavioral insight that goes into supporting or attacking the claim that equations (10.19) and (10.20) are valid.

Another concern arising from this example is that maybe there are other variables that determine earnings and that are available to us, but are contributing to the unexplained part of earnings in our analysis because we're only using years of schooling as an explanatory variable. This is a concern that has probably been lurking in our minds since the first chapter. Fortunately, we can do something about this. It's called chapter 11.

The second observation is that great instruments are hard to come by. Even acceptable instruments often require considerable ingenuity to construct.29 Our instrument here isn't great, in part because the correlation between it and years of schooling is so weak. What makes it work, to the extent that it succeeds, is the exceptionally large samples that we're working with. With this much information, we can pick up almost any relationship, no matter how weak.

Our third and last observation is that the specific estimates of this section should not be taken too seriously. Annual returns to education of almost \$8,000 are probably too big to be believed, no matter how much we might like to. The next chapter will adopt another strategy for the purpose of improving our understanding of how earnings are generated. Whether the results of the analysis that we've performed here will persist after we've followed this strategy is, for the moment, an open question.

# (p.401) 10.10 Conclusion

In section 4.2, we took a very strong position regarding the direction of causality. This chapter addresses the question of how we proceed when that position can't be supported. As is almost always the case, there is a way, and that way involves creative reapplication of the OLS approach that we first developed in chapter 4.

It should now be obvious that this theme is an important one. It is shared between this chapter and the two that precede it. An allied theme also appears in all three: This creative reapplication of OLS needs to be supported by a compelling behavioral explanation. In chapter 8, that explanation addresses the sources of heteroscedasticity. In chapter 9, it addresses the sources of autocorrelation. In this chapter, it addresses the reasons for endogeneity and the validity of the instrument.

Finally, the specific form of the solution in each of these three chapters depends on the specific content of this explanation. In chapter 8, the form of the heteroscedasticity dictated the structure of the WLS estimation. In chapter 9, the autocorrelative relationship dictated the structure of the GLS estimation. Here, the behavioral context controls what is and is not an acceptable instrument for IV estimation.

# Exercises

1. 10.1 Note 4 asserts that equation (10.5) can be understood as a special case of equation (5.1), where the constant is set to zero and the coefficient is set to one. Explain.

2. 10.2 The discussion of equation (10.7) asserts that εi is uncorrelated with xi because it is uncorrelated with and νi. Let's prove that here.

1. (a) Explain why the population covariance between two random variables must be zero if they are uncorrelated.

2. (b) Prove that . Begin by invoking equation (5.8):

$Display mathematics$

Rearrange to obtain

$Display mathematics$

(p.402) Demonstrate that the term to the left of the equality is equal to

$Display mathematics$

3. (c) Our assumptions regarding equation (10.4) assert that . Our assumptions regarding νi assert that COV(εi, νi) = 0. Use these assumptions and the result of part b to prove that COV(εi, xi) = 0.

3. 10.3 Imagine that, in equation (10.9), xi was determined exactly by ai. Would the population relationship of equation (10.11) still be contaminated with endogeneity? Why or why not? In either case, give an intuitive as well as a formal explanation.

4. 10.4 Use the result of exercise 3.7 and equation (3.10) to prove that, if the sample COV(εi, zi) = 0, then the sample COV(εi, c + dzi) = 0. Use this result to confirm that the sample .

5. 10.5 Use the normal equations for the auxiliary regression of equation (10.21) and follow the development from equations (4.50) through (4.52) to prove equation (10.29),

$Display mathematics$

6. 10.6 Equations (10.30) and (10.32) give bIV as

$Display mathematics$

1. (a) Follow equations (2.31) through (2.33) to transform the numerator into and the denominator into .

2. (b) What else must be done in order to transform the above formula for bIV into

$Display mathematics$

where the covariances are for the sample?

7. 10.7 Consider the demonstration that E(bIV) converges to β as the sample size increases, which begins with equations (10.33) and (10.34).

1. (a) Simplify equation (10.33) to obtain equation (10.34).

2. (p.403) (b) Follow the argument of equations (10.15) and (10.16) to demonstrate that is approximately equal to COV(zi, εi).

8. 10.8 Construct the 95% confidence intervals around the values of bIV in table 10.4. How many of these confidence intervals contain the true value β = 4,000?

9. 10.9 Recall the population expression for V(bIV) in equation (10.39):

$Display mathematics$

Complete the following derivation. Now that we've had nine chapters of practice at this sort of thing, we're going to try it with a little less help than in previous exercises.

1. (a) Use the result of exercise 10.6a to rewrite the summation in the denominator as

$Display mathematics$

2. (b) Multiply both the numerator and the denominator by 1/(n − 1)2. Show that the numerator becomes σ2V(zi)/(n − 1), where V(zi) represents the sample variance of zi. Show that the denominator becomes the square of the sample covariance between zi and xi, COV(zi, xi).

3. (c) With the replacements of part b, show that

$Display mathematics$

4. (d) Multiply the denominator by V(xi)/V(xi), where V(xi) represents the sample variance of xi. Demonstrate that

$Display mathematics$

5. (p.404) (e) Recall that all of the properties of bIV are asymptotic. Validate equation (10.43) by explaining why it's acceptable to approximate the result of part d with

$Display mathematics$

10. 10.10 Return to the calculation of the population value for SD(bIV) in table 10.5.

1. (a) In the context of measurement error, equation (10.5) defines the observed value of the explanatory variable, xi, as

$Display mathematics$

In addition, we assumed that the measurement error, νi, is uncorrelated with the true value of the explanatory variable . Explain why equation (5.45) is valid for determining V(xi). Use it to establish that

$Display mathematics$

Verify that V(xi) = 12.5 for table 10.5.

2. (b) For the purpose of simulation, we constructed our instrument, zi, for the regressions in table 10.5 using the following formula:

$Display mathematics$

where ηi is a random variable with expected value zero and variance four, uncorrelated with or νi. Explain why this is a valid instrument. Confirm that

(1)
$Display mathematics$

3. (c) This is the hard part, but we've already done a simpler version of it in exercise 10.2. The covariance between the observed explanatory variable and the instrument, COV(xi, zi), is

$Display mathematics$

According to equation (5.8), this can be rewritten as

$Display mathematics$

(p.405) Explain why this can be rewritten, yet again, as

$Display mathematics$

Expand to obtain

$Display mathematics$

What rule of expectations justifies this expansion?

4. (d) Use equation (5.4) to demonstrate that the first term following the equality is . Explain why the remaining three terms are all equal to zero. Conclude that, in this case,

$Display mathematics$

5. (e) Combine the results of parts a, b, c, and d to obtain

$Display mathematics$

Verify that, with the values of the simulation for table 10.5, CORR(xi, zi) = .6800.

11. 10.11 In table 10.5, examine the magnitude of the reduction in SD(bIV) that occurs with each tenfold increase in sample size.

1. (a) Looking at the quantities in the expression for V(bIV) in equation (10.43), which are responsible for this reduction? Why?

2. (b) How does the answer to part a compare to that we obtained for exercise 7.13?

12. 10.12 Construct the 95% confidence intervals around the values of bIV in table 10.5. How many of these confidence intervals contain the true value β = 4,000?

13. (p.406) 10.13 Equation (10.45) states that V(bIV) can be expressed as

$Display mathematics$

Use this expression to provide another explanation of why V(bIV) converges to zero as n converges to infinity.

14. 10.14 Set CORR(zi, xi) = .5 in equation (10.45).

1. (a) Demonstrate that this implies V(bIV) = 4 V(b).

2. (b) Demonstrate that this implies SD(bIV) = 2 SD(b).

3. (c) Demonstrate that the 95% confidence interval around bIV is two times as wide as the 95% confidence interval around b.

4. (d) Demonstrate that the acceptance region for the test of the null hypothesis H 0 : β0 = 0 at the 5% significance level is two times as wide when the test is based on bIV than when it is based on b.

15. 10.15 Let's derive the results in table 10.7.

1. (a) Exercise 10.10a demonstrates that

$Display mathematics$

Recall that . Use the value for V(νi) to calculate V(xi) for each simulation in table 10.7. Recalling that σ = 25,000 for all simulations, use equation (10.44) to calculate the population value of SD(b) for each simulation. Compare the result to the third column of table 10.7.

2. (b) We constructed our instrument, zi, in each of these simulations using the following formula:

$Display mathematics$

where ηi is a random variable with expected value zero and variance equal to V(νi) in that simulation. Use the result of exercise 10.10e to calculate CORR(zi, xi) for each simulation. Compare the result to the fourth column of table 10.7.

3. (c) Apply equation (10.45) to the results of parts a and b to derive the asymptotic value of SD(bIV) for each simulation. Compare the result to the fifth column of table 10.7.

16. 10.16 Consider equation (10.34).

1. (a) Offer an informal explanation of why the second term to the right of the equality in equation (10.34) might be written as (p.407)

$Display mathematics$

2. (b) Demonstrate that, if the approximation in part a is valid, then the following approximation is also valid:

$Display mathematics$

17. 10.17 Recall that the Staiger-Stock rule of thumb stipulates that the F-statistic for the first-stage regression should exceed 10 in order for bIV to be useful. We're getting a little bit ahead of ourselves, but sneak a look at equation (12.30) to see the relationship between the F-statistic for the regression of equation (4.4) and the t-statistic for b in that regression. According to this relationship, the value imposed by the Staiger-Stock rule of thumb as the threshold value for the F-statistic establishes an implicit threshold value for the t-statistic. What is this value? Why?

18. 10.18 Consider the IV estimate of the effects of schooling on earnings in section 10.9.

1. (a) Examine the first-stage regression of equation (10.48). What does it imply about the comparison between the typical years of schooling for an individual whose response has not been edited, as compared to the typical years of schooling for an individual whose response has been edited.

2. (b) The estimate of SD(bIV) in the second-stage regression of equation (10.49) is 301.87. As reported in section 10.9, the correlation between the instrument and the explanatory variable for this regression is −.09372. Finally, section 7.6 gave the estimate of SD(b) as 26.649. Are these values consistent with equation (10.45)? Demonstrate.

(p.408)

(p.409)

(p.410)