|
|
|
Criterion-based Variable Selection
> x <- state.x77[,-4]
> lif <- state.x77[,4]
> cp <- leaps(x,lif)
> cp
$Cp:
[1] 16.126762 58.058327 59.225243 94.755692 102.252365 111.351303
[7] 112.447952 9.669895 10.886509 12.475674 13.791500 17.280022
[13] 17.634186 3.739876 5.647594 9.236259 10.873257 11.169556
[19] 11.485641 2.019657 5.411182 5.430088 5.693493 5.780357
[25] 4.008739 4.012589 4.018234 7.306931 9.383251 10.427692
[31] 13.465149 14.827362 6.001965 6.007961 6.008530 9.149822
[37] 9.329010 10.401108 47.718651 8.000000
$size:
[1] 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7
[39] 7 8
$label:
[1] "Murder"
[2] "Illiteracy"
etc.
[39] "Population,Income,Illiteracy,HS Grad,Frost,Area"
[40] "Population,Income,Illiteracy,Murder,HS Grad,Frost,Area"
$which:
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] F F F T F F F
[2,] F F T F F F F
etc.
[39,] T T T F T T T
[40,] T T T T T T T
$int:
[1] T
It's usual to plot the Cp data:
> motif() > plot(cp$size,cp$Cp) > abline(0,1)The plot is distorted by the large Cp values - let's exclude them: > big <- (cp$Cp > 20) > plot(cp$size[!big],cp$Cp[!big]) > abline(0,1)Now identify the models of interest - left mouse button to identify, middle button to quit. > identify(cp$size[!big],cp$Cp[!big],cp$label[!big])What models look best? How does this compare to the results from the stepwise method? Now try the adjusted R2 method:
> adjr <- leaps(x,lif,method="adjr2")
> adjr
$adjr2:
[1] 60.1589241 33.2687683 32.5204391 9.7352295 4.9277191 -0.9073105
[7] -1.6105728 64.8499146 64.0531158 63.0123177 62.1505470 59.8658066
[13] 59.6338539 69.3922958 68.1157150 65.7142944 64.6188660 64.4205933
[19] 64.2090759 71.2569046 68.9369659 68.9240341 68.7438583 68.6844330
[25] 70.6112900 70.6085968 70.6046448 68.3039246 66.8513641 66.1206818
[31] 63.9957199 63.0427399 69.9326859 69.9283905 69.9279785 67.6792755
[37] 67.5509949 66.7835388 40.0695801 69.2182312
$size:
[1] 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7
[39] 7 8
$label:
[1] "Murder"
[2] "Illiteracy"
etc.
[39] "Population,Income,Illiteracy,HS Grad,Frost,Area"
[40] "Population,Income,Illiteracy,Murder,HS Grad,Frost,Area"
$which:
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] F F F T F F F
[2,] F F T F F F F
etc.
[39,] T T T F T T T
[40,] T T T T T T T
$int:
[1] T
Let's look at the 10 best (in the sense of having high adjusted R2)
> i_order(adjr$a)
> cbind(adjr$a[i[31:40]],adjr$label[i[31:40]])
[,1]
[1,] "68.9369659423828"
[2,] "69.2182312011718"
[3,] "69.3922958374023"
[4,] "69.927978515625"
[5,] "69.9283905029297"
[6,] "69.9326858520508"
[7,] "70.6046447753906"
[8,] "70.6085968017578"
[9,] "70.6112899780273"
[10,] "71.2569046020508"
[,2]
[1,] "Income,Murder,HS Grad,Frost"
[2,] "Population,Income,Illiteracy,Murder,HS Grad,Frost,Area"
[3,] "Murder,HS Grad,Frost"
[4,] "Population,Income,Murder,HS Grad,Frost,Area"
[5,] "Population,Illiteracy,Murder,HS Grad,Frost,Area"
[6,] "Population,Income,Illiteracy,Murder,HS Grad,Frost"
[7,] "Population,Murder,HS Grad,Frost,Area"
[8,] "Population,Illiteracy,Murder,HS Grad,Frost"
[9,] "Population,Income,Murder,HS Grad,Frost"
[10,] "Population,Murder,HS Grad,Frost"
How does this compare to previous results?Variable selection methods are sensitive to outliers and influential points. Let's check for high leverage points: > h <- hat(x) > dotchart(h,state.abb)Which state sticks out? Let's try excluding it: > l <- leaps(x[-2,],lif[-2],method="adjr2") > max(l$a) [1] 71.04404 > l$label[l$a == max(l$a)] [1] "Population,Murder,HS Grad,Frost,Area"Compare with the "best" model selected above. Transforming the predictors can also have an effect: Take a look at the variables: > par(mfrow=c(3,3)) > apply(state.x77,2,boxplot)Population, Illiteracy and Area are skewed - we try transforming them: > nx <- cbind(log(x[,1]),x[,2],log(x[,3]),x[,4:6],log(x[,7]))And now replot: > usa() to fill out that last pane > apply(nx,2,boxplot)Now try the adjusted R2 method again - we make sure the columns of the new matrix have the right label names for ease of reference.
> dimnames(nx)[[2]] <- c("logpop","income","logillit","murder","grad","frost","logarea")
> a <- leaps(nx,lif,method="adjr2")
> max(a$a)
[1] 71.73392
> a$label[a$a == max(a$a)]
[1] "logpop,murder,grad,frost"
This changes the "best" model yet again. Compare the maximum
adjusted R2 for all the "best" models we have seen. |