Collinearity

The Longley dataset is a good example of collinearity:
> g <- lm(longley.y ~ longley.x)
> summary(g)
Coefficients:
                           Value Std. Error    t value   Pr(>|t|) 
          (Intercept) -3482.2586   890.4204    -3.9108     0.0036
longley.xGNP deflator     0.0151     0.0849     0.1774     0.8631
         longley.xGNP    -0.0358     0.0335    -1.0695     0.3127
  longley.xUnemployed    -0.0202     0.0049    -4.1364     0.0025
longley.xArmed Forces    -0.0103     0.0021    -4.8220     0.0009
  longley.xPopulation    -0.0511     0.2261    -0.2261     0.8262
        longley.xYear     1.8292     0.4555     4.0159     0.0030

Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-Squared: 0.9955 
F-statistic: 330.3 on 6 and 9 degrees of freedom, the p-value is 4.984e-10 
Recall that the response is number employed. Three of the predictors have large p-values but all are variables that might be expected to affect the response. Why aren't they significant? Check the correlation matrix first (rounding to 3 digits for convenience)
> round(cor(longley.x),3)
             GNP deflator   GNP Unemployed Armed Forces Population  Year 
GNP deflator        1.000 0.992      0.621        0.465      0.979 0.991
         GNP        0.992 1.000      0.604        0.446      0.991 0.995
  Unemployed        0.621 0.604      1.000       -0.177      0.687 0.668
Armed Forces        0.465 0.446     -0.177        1.000      0.364 0.417
  Population        0.979 0.991      0.687        0.364      1.000 0.994
        Year        0.991 0.995      0.668        0.417      0.994 1.000
There are several large pairwise correlations. Now we check the eigendecomposition:
> e <- eigen( t(longley.x) %*% longley.x)
> e$val
[1] 6.665299e+07 2.090730e+05 1.053550e+05 1.803976e+04 2.455730e+01
[6] 2.015117e+00
> sqrt(e$val[1]/e$val)
[1]    1.00000   17.85504   25.15256   60.78472 1647.47771 5751.21560
There is a wide range in the eigenvalues and several condition numbers are large (kappa = 5751.) Now check out the variance inflation factors. For the first variable this is
> summary(lm( longley.x[,1] ~ longley.x[,-1]))$r.squared
[1] 0.9926217
> 1/(1-0.9926217)
[1] 135.5326
which is large - the VIF for orthogonal predictors is 1. Here's a function to compute all the VIF's in one go:
> vif <- function(x){
+ ncols <- dim(x)[2]
+ v <- numeric(ncols)
+ for(i in 1:ncols) v[i] <- 1/(1-summary(lm(x[,i] ~ x[,-i]))$r.squared)
+ v
+ }
and now try it out:
> vif(longley.x)
[1]  135.53244 1788.51348   33.61889    3.58893  399.15102  758.98060
There's definitely a lot of variance inflation! How can we get rid of this problem? One way is to throw out some of the variables. Examine the full correlation matrix above. Notice that variables 3 and 4 do not have extremely large pairwise correlations with the other variables so we should keep them and focus on the others for candidates for removal:
> cor(longley.x[,-c(3,4)])
             GNP deflator       GNP Population      Year 
GNP deflator    1.0000001 0.9915892  0.9791635 0.9911493
         GNP    0.9915892 1.0000000  0.9910901 0.9952735
  Population    0.9791635 0.9910901  1.0000001 0.9939529
        Year    0.9911493 0.9952735  0.9939529 1.0000000
These four variables are strongly correlated with each other - any one of them could do the job of representing the others - we could pick year arbitrarily:
> summary(lm(longley.y ~ longley.x[,c(3,4,6)]))

Call: lm(formula = longley.y ~ longley.x[, c(3, 4, 6)])
Residuals:
     Min      1Q  Median     3Q   Max 
 -0.5729 -0.1199 0.04087 0.1398 0.753

Coefficients:
                 Value Std. Error    t value    Pr(>|t|) 
 (Intercept) -1797.2211    68.6416   -26.1827    0.0000
  Unemployed    -0.0147     0.0017    -8.7932    0.0000
Armed Forces    -0.0077     0.0018    -4.2037	 0.0012
        Year     0.9564     0.0355    26.9215	 0.0000
Residual standard error: 0.3321 on 12 degrees of freedom
Multiple R-Squared: 0.9928 
F-statistic: 555.2 on 3 and 12 degrees of freedom, the p-value is 3.916e-13 
Compare this with the original fit - what do you see?

One final point - extreme collinearity can cause problems in computing the estimates - look what happens when we use the direct formula for beta hat:

> x <- cbind(1,longley.x)
> solve(t(x) %*% x) %*% t(x) %*% longley.y
Error in solve.qr(a): apparently singular matrix
Dumped
S+, like most statistical packages, uses a more numerically stable method for computing the estimates in lm().