Stepwise variable selection

We illustrate the variable selection methods on some data on the 50 states - the variables are population estimate as of july 1, 1975; per capita income (1974); illiteracy (1970, percent of population); life expectancy in years (1969-71); murder and non-negligent manslaughter rate per 100,000 population (1976); percent high-school graduates (1970); mean number of days with min temperature < 32 degrees (1931-1960) in capital or large city; and land area in square miles. We will take life expectancy as the response and the remaining variables as predictors - we extract these for convenience and fit the full model.
> x <- state.x77[,-4]
> lif <- state.x77[,4]
> summary(lm(lif ~ x))
Coefficients:
               Value Std. Error  t value Pr(>|t|) 
(Intercept)  70.9432   1.7480    40.5859   0.0000
xPopulation   0.0001   0.0000     1.7748   0.0832
    xIncome   0.0000   0.0002    -0.0892   0.9293
xIlliteracy   0.0338   0.3663     0.0923   0.9269
    xMurder  -0.3011   0.0466    -6.4590   0.0000
   xHS Grad   0.0489   0.0233     2.0979   0.0420
     xFrost  -0.0057   0.0031    -1.8246   0.0752
      xArea   0.0000   0.0000    -0.0443   0.9649

Residual standard error: 0.7448 on 42 degrees of freedom
Multiple R-Squared: 0.7362 
F-statistic: 16.74 on 7 and 42 degrees of freedom, the p-value is 2.534e-10 
Which predictors should be included - can you tell from the p-values? Looking at the coefficients, can you see what operation would be helpful? Does the murder rate decrease life expectancy - that's obvious a priori but how should these results be interpreted?

We start with the backward method - see the help page on stepwise() for details of the settings:

> b <- stepwise(x,lif,method="backward")
> b
$rss:
[1] 23.29822 23.30198 23.30804 25.37162 29.77036 34.46133 88.29900

$size:
[1] 6 5 4 3 2 1 0

$which:
      Population Income Illiteracy Murder HS Grad Frost Area 
6(-7)          T      T          T      T       T     T    F
5(-3)          T      T          F      T       T     T    F
4(-2)          T      F          F      T       T     T    F
3(-1)          F      F          F      T       T     T    F
2(-6)          F      F          F      T       T     F    F
1(-5)          F      F          F      T       F     F    F
0(-4)          F      F          F      F       F     F    F

$f.stat:
[1]  0.002332003  0.008068924  0.012999420  4.426759766  8.668602866
[6]  7.878594024 78.113178674

$method:
[1] "backward"
What is the best size model? The f-statistics are for the one-at-a-time predictor tests so we can convert these to t-statistics by taking the square root:
> sqrt(b$f.stat)
[1] 0.04829082 0.08982719 0.11401500 2.10398664 2.94424912 2.80688333 8.83816602
Let's examine the two models at the crucial stage:
> summary(lm(lif ~ x[,b$w[3,]]))
Coefficients:
                            Value Std. Error  t value Pr(>|t|) 
             (Intercept)  71.0271   0.9529    74.5415   0.0000
x[, b$w[3,  ]]Population   0.0001   0.0000     1.9960   0.0520
    x[, b$w[3,  ]]Murder  -0.3001   0.0366    -8.1987   0.0000
   x[, b$w[3,  ]]HS Grad   0.0466   0.0148     3.1417   0.0030
     x[, b$w[3,  ]]Frost  -0.0059   0.0024    -2.4550   0.0180

Residual standard error: 0.7197 on 45 degrees of freedom
Multiple R-Squared: 0.736 
F-statistic: 31.37 on 4 and 45 degrees of freedom, the p-value is 1.696e-12 
> summary(lm(lif ~ x[,b$w[4,]]))
Coefficients:
                         Value Std. Error  t value Pr(>|t|) 
          (Intercept)  71.0364   0.9833    72.2456   0.0000
 x[, b$w[4,  ]]Murder  -0.2831   0.0367    -7.7064   0.0000
x[, b$w[4,  ]]HS Grad   0.0499   0.0152     3.2859   0.0020
  x[, b$w[4,  ]]Frost  -0.0069   0.0024    -2.8240   0.0070

Residual standard error: 0.7427 on 46 degrees of freedom
Multiple R-Squared: 0.7127 
F-statistic: 38.03 on 3 and 46 degrees of freedom, the p-value is 1.634e-12 
Should we include population or not? It depends on what we want to use the model for. Also compare the p-values to those found in the full model.

Now the forward method:

> f <- stepwise(x,lif,method="forward")
> f
$rss:
[1] 34.46133 29.77036 25.37162 23.30804 23.30198 23.29822 23.29714

$size:
[1] 1 2 3 4 5 6 7

$which:
      Population Income Illiteracy Murder HS Grad Frost Area 
1(+4)          F      F          F      T       F     F    F
2(+5)          F      F          F      T       T     F    F
3(+6)          F      F          F      T       T     T    F
4(+1)          T      F          F      T       T     T    F
5(+2)          T      T          F      T       T     T    F
6(+3)          T      T          T      T       T     T    F
7(+7)          T      T          T      T       T     T    T

$f.stat:
[1] 74.988651527  7.405878382  7.975114636  3.984083789  0.011439490
[6]  0.006939274  0.001958883

$method:
[1] "forward"
The outcome here is comparable to the "backward" result. Now try the stepwise method - the default is to use 2 as the critical value for the "F-to-enter" and "F-to-delete".
> s <- stepwise(x,lif)
> s
$rss:
[1] 34.46133 29.77036 25.37162 23.30804

$size:
[1] 1 2 3 4

$which:
      Population Income Illiteracy Murder HS Grad Frost Area 
1(+4)          F      F          F      T       F     F    F
2(+5)          F      F          F      T       T     F    F
3(+6)          F      F          F      T       T     T    F
4(+1)          T      F          F      T       T     T    F

$f.stat:
[1] 74.988652  7.405878  7.975115  3.984084

$method:
[1] "efroymson"
This method settles on the four predictor model with population. Is this the "best" model?