|
|
Collinearity
> g <- lm(longley.y ~ longley.x)
> summary(g)
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -3482.2586 890.4204 -3.9108 0.0036
longley.xGNP deflator 0.0151 0.0849 0.1774 0.8631
longley.xGNP -0.0358 0.0335 -1.0695 0.3127
longley.xUnemployed -0.0202 0.0049 -4.1364 0.0025
longley.xArmed Forces -0.0103 0.0021 -4.8220 0.0009
longley.xPopulation -0.0511 0.2261 -0.2261 0.8262
longley.xYear 1.8292 0.4555 4.0159 0.0030
Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-Squared: 0.9955
F-statistic: 330.3 on 6 and 9 degrees of freedom, the p-value is 4.984e-10
Recall that the response is number employed. Three of the predictors
have large p-values but all are variables that might be expected to
affect the response. Why aren't they significant? Check the correlation
matrix first (rounding to 3 digits for convenience)
> round(cor(longley.x),3)
GNP deflator GNP Unemployed Armed Forces Population Year
GNP deflator 1.000 0.992 0.621 0.465 0.979 0.991
GNP 0.992 1.000 0.604 0.446 0.991 0.995
Unemployed 0.621 0.604 1.000 -0.177 0.687 0.668
Armed Forces 0.465 0.446 -0.177 1.000 0.364 0.417
Population 0.979 0.991 0.687 0.364 1.000 0.994
Year 0.991 0.995 0.668 0.417 0.994 1.000
There are several large pairwise correlations. Now we check the
eigendecomposition:
> e <- eigen( t(longley.x) %*% longley.x) > e$val [1] 6.665299e+07 2.090730e+05 1.053550e+05 1.803976e+04 2.455730e+01 [6] 2.015117e+00 > sqrt(e$val[1]/e$val) [1] 1.00000 17.85504 25.15256 60.78472 1647.47771 5751.21560There is a wide range in the eigenvalues and several condition numbers are large (kappa = 5751.) Now check out the variance inflation factors. For the first variable this is > summary(lm( longley.x[,1] ~ longley.x[,-1]))$r.squared [1] 0.9926217 > 1/(1-0.9926217) [1] 135.5326which is large - the VIF for orthogonal predictors is 1. Here's a function to compute all the VIF's in one go:
> vif <- function(x){
+ ncols <- dim(x)[2]
+ v <- numeric(ncols)
+ for(i in 1:ncols) v[i] <- 1/(1-summary(lm(x[,i] ~ x[,-i]))$r.squared)
+ v
+ }
and now try it out:
> vif(longley.x) [1] 135.53244 1788.51348 33.61889 3.58893 399.15102 758.98060There's definitely a lot of variance inflation! How can we get rid of this problem? One way is to throw out some of the variables. Examine the full correlation matrix above. Notice that variables 3 and 4 do not have extremely large pairwise correlations with the other variables so we should keep them and focus on the others for candidates for removal:
> cor(longley.x[,-c(3,4)])
GNP deflator GNP Population Year
GNP deflator 1.0000001 0.9915892 0.9791635 0.9911493
GNP 0.9915892 1.0000000 0.9910901 0.9952735
Population 0.9791635 0.9910901 1.0000001 0.9939529
Year 0.9911493 0.9952735 0.9939529 1.0000000
These four variables are strongly correlated with each other - any one
of them could do the job of representing the others - we could pick year
arbitrarily:
> summary(lm(longley.y ~ longley.x[,c(3,4,6)]))
Call: lm(formula = longley.y ~ longley.x[, c(3, 4, 6)])
Residuals:
Min 1Q Median 3Q Max
-0.5729 -0.1199 0.04087 0.1398 0.753
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -1797.2211 68.6416 -26.1827 0.0000
Unemployed -0.0147 0.0017 -8.7932 0.0000
Armed Forces -0.0077 0.0018 -4.2037 0.0012
Year 0.9564 0.0355 26.9215 0.0000
Residual standard error: 0.3321 on 12 degrees of freedom
Multiple R-Squared: 0.9928
F-statistic: 555.2 on 3 and 12 degrees of freedom, the p-value is 3.916e-13
Compare this with the original fit - what do you see? One final point - extreme collinearity can cause problems in computing the estimates - look what happens when we use the direct formula for beta hat: > x <- cbind(1,longley.x) > solve(t(x) %*% x) %*% t(x) %*% longley.y Error in solve.qr(a): apparently singular matrix DumpedS+, like most statistical packages, uses a more numerically stable method for computing the estimates in lm().
|