High multicollinearity results from a linear relationship between your independent variables with a high degree of correlation but aren’t completely deterministic (in other words, they don’t have perfect correlation). It’s much more common than its perfect counterpart and can be equally problematic when it comes to estimating an econometric model.
You can describe an approximate linear relationship, which characterizes high multicollinearity, as follows:
where the Xs are independent variables in a regression model and u represents a random error term (which is the component that differentiates high multicollinearity from perfect multicollinearity). Therefore, the difference between perfect and high multicollinearity is that some variation in the independent variable is not explained by variation in the other independent variable(s).
The stronger the relationship between the independent variables, the more likely you are to have estimation problems with your model.
Strong linear relationships resulting in high multicollinearity can sometimes catch you by surprise, but these three situations tend to be particularly problematic:
You use variables that are lagged values of one another. For example, one independent variable is an individual’s income in the current year, and another independent variable measures an individual’s income in the previous year. These values may be completely different for some observations, but for most observations the two are closely related.
You use variables that share a common time trend component. For example, you use yearly values for GDP (gross domestic product) and the DJIA (Dow Jones Industrial Average) as independent variables in a regression model. The value for these measurements tends to increase (with occasional decreases) and generally move in the same direction over time.
You use variables that capture similar phenomena. For example, your independent variables to explain crime across cities may be unemployment rates, average income, and poverty rates. These variables aren’t likely to be perfectly correlated, but they’re probably highly correlated.
Technically, the presence of high multicollinearity doesn’t violate any CLRM assumptions. Consequently, OLS estimates can be obtained and are BLUE (best linear unbiased estimators) with high multicollinearity.
Although OLS estimators remain BLUE in the presence of high multicollinearity, it reinforces a desirable repeated sampling property. In practice, you probably don’t have an opportunity to utilize multiple samples, so you want any given sample to produce sensible and reliable results.
With high multicollinearity, the OLS estimates still have the smallest variance, but smallest is a relative concept and doesn’t ensure that the variances are actually small. In fact, the larger variances (and standard errors) of the OLS estimators are the main reason to avoid high multicollinearity.
The typical consequences of high multicollinearity include the following:
Larger standard errors and insignificant t-statistics: The estimated variance of a coefficient in a multiple regression is
where
is the mean squared error (MSE) and
is the R-squared value from regressing Xk on the other Xs. Higher multicollinearity results in a larger
which increases the standard error of the coefficient. The figure illustrates the effect of multicollinearity on the variance (or standard error) of a coefficient.
Because the t-statistic associated with a coefficient is the ratio of the estimated coefficient to the standard error
high multicollinearity also tends to result in insignificant t-statistics.
Coefficient estimates that are sensitive to changes in specification: If the independent variables are highly collinear, the estimates must emphasize small differences in the variables in order to assign an independent effect to each of them. Adding or removing variables from the model can change the nature of the small differences and drastically change your coefficient estimates. In other words, your results aren’t robust.
Nonsensical coefficient signs and magnitudes: With higher multicollinearity, the variance of the estimated coefficients increases, which in turn increases the chances of obtaining coefficient estimates with extreme values. Consequently, these estimates may have unbelievably large magnitudes and/or signs that counter the expected relationship between the independent and dependent variables. The figure illustrates how the sampling distribution of the estimated coefficients is affected by multicollinearity.
When two (or more) variables exhibit high multicollinearity, there’s more uncertainty as to which variable should be credited with explaining variation in the dependent variable. For this reason, a high R-squared value combined with many statistically insignificant coefficients is a common consequence of high multicollinearity.