Criterion-based Variable Selection

Now we try some criterion-based methods for the selection of variables in the State dataset. The default for the leaps() function is the Mallow's Cp criterion:
> x <- state.x77[,-4]
> lif <- state.x77[,4]
> cp <- leaps(x,lif)
> cp
$Cp:
 [1]  16.126762  58.058327  59.225243  94.755692 102.252365 111.351303
 [7] 112.447952   9.669895  10.886509  12.475674  13.791500  17.280022
[13]  17.634186   3.739876   5.647594   9.236259  10.873257  11.169556
[19]  11.485641   2.019657   5.411182   5.430088   5.693493   5.780357
[25]   4.008739   4.012589   4.018234   7.306931   9.383251  10.427692
[31]  13.465149  14.827362   6.001965   6.007961   6.008530   9.149822
[37]   9.329010  10.401108  47.718651   8.000000

$size:
 [1] 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7
[39] 7 8

$label:
 [1] "Murder"                                                
 [2] "Illiteracy"
etc.
[39] "Population,Income,Illiteracy,HS Grad,Frost,Area"       
[40] "Population,Income,Illiteracy,Murder,HS Grad,Frost,Area"

$which:
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] 
 [1,]    F    F    F    T    F    F    F
 [2,]    F    F    T    F    F    F    F
etc.
[39,]    T    T    T    F    T    T    T
[40,]    T    T    T    T    T    T    T

$int:
[1] T
It's usual to plot the Cp data:
> motif()
> plot(cp$size,cp$Cp)
> abline(0,1)
The plot is distorted by the large Cp values - let's exclude them:
> big <- (cp$Cp > 20)
> plot(cp$size[!big],cp$Cp[!big])
> abline(0,1)
Now identify the models of interest - left mouse button to identify, middle button to quit.
> identify(cp$size[!big],cp$Cp[!big],cp$label[!big])
What models look best? How does this compare to the results from the stepwise method? Now try the adjusted R2 method:
> adjr <- leaps(x,lif,method="adjr2")
> adjr
$adjr2:
 [1]  60.1589241  33.2687683  32.5204391   9.7352295   4.9277191  -0.9073105
 [7]  -1.6105728  64.8499146  64.0531158  63.0123177  62.1505470  59.8658066
[13]  59.6338539  69.3922958  68.1157150  65.7142944  64.6188660  64.4205933
[19]  64.2090759  71.2569046  68.9369659  68.9240341  68.7438583  68.6844330
[25]  70.6112900  70.6085968  70.6046448  68.3039246  66.8513641  66.1206818
[31]  63.9957199  63.0427399  69.9326859  69.9283905  69.9279785  67.6792755
[37]  67.5509949  66.7835388  40.0695801  69.2182312

$size:
 [1] 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7
[39] 7 8

$label:
 [1] "Murder"                                                
 [2] "Illiteracy"                                            
etc.
[39] "Population,Income,Illiteracy,HS Grad,Frost,Area"       
[40] "Population,Income,Illiteracy,Murder,HS Grad,Frost,Area"

$which:
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] 
 [1,]    F    F    F    T    F    F    F
 [2,]    F    F    T    F    F    F    F
etc.
[39,]    T    T    T    F    T    T    T
[40,]    T    T    T    T    T    T    T

$int:
[1] T
Let's look at the 10 best (in the sense of having high adjusted R2)
> i_order(adjr$a)
> cbind(adjr$a[i[31:40]],adjr$label[i[31:40]])
                    [,1] 
 [1,] "68.9369659423828"
 [2,] "69.2182312011718"
 [3,] "69.3922958374023"
 [4,] "69.927978515625" 
 [5,] "69.9283905029297"
 [6,] "69.9326858520508"
 [7,] "70.6046447753906"
 [8,] "70.6085968017578"
 [9,] "70.6112899780273"
[10,] "71.2569046020508"
                                                          [,2] 
 [1,] "Income,Murder,HS Grad,Frost"                           
 [2,] "Population,Income,Illiteracy,Murder,HS Grad,Frost,Area"
 [3,] "Murder,HS Grad,Frost"                                  
 [4,] "Population,Income,Murder,HS Grad,Frost,Area"           
 [5,] "Population,Illiteracy,Murder,HS Grad,Frost,Area"       
 [6,] "Population,Income,Illiteracy,Murder,HS Grad,Frost"     
 [7,] "Population,Murder,HS Grad,Frost,Area"                  
 [8,] "Population,Illiteracy,Murder,HS Grad,Frost"            
 [9,] "Population,Income,Murder,HS Grad,Frost"                
[10,] "Population,Murder,HS Grad,Frost"
How does this compare to previous results?

Variable selection methods are sensitive to outliers and influential points. Let's check for high leverage points:

> h <- hat(x)
> dotchart(h,state.abb)
Which state sticks out? Let's try excluding it:
> l <- leaps(x[-2,],lif[-2],method="adjr2")
> max(l$a)
[1] 71.04404
> l$label[l$a == max(l$a)]
[1] "Population,Murder,HS Grad,Frost,Area"
Compare with the "best" model selected above. Transforming the predictors can also have an effect: Take a look at the variables:
> par(mfrow=c(3,3))
> apply(state.x77,2,boxplot)
Population, Illiteracy and Area are skewed - we try transforming them:
> nx <- cbind(log(x[,1]),x[,2],log(x[,3]),x[,4:6],log(x[,7]))
And now replot:
> usa() to fill out that last pane
> apply(nx,2,boxplot)
Now try the adjusted R2 method again - we make sure the columns of the new matrix have the right label names for ease of reference.
> dimnames(nx)[[2]] <- c("logpop","income","logillit","murder","grad","frost","logarea")
> a <- leaps(nx,lif,method="adjr2")
> max(a$a)
[1] 71.73392
> a$label[a$a == max(a$a)]
[1] "logpop,murder,grad,frost"
This changes the "best" model yet again. Compare the maximum adjusted R2 for all the "best" models we have seen.