|
|
|
Stepwise variable selection
> x <- state.x77[,-4]
> lif <- state.x77[,4]
> summary(lm(lif ~ x))
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 70.9432 1.7480 40.5859 0.0000
xPopulation 0.0001 0.0000 1.7748 0.0832
xIncome 0.0000 0.0002 -0.0892 0.9293
xIlliteracy 0.0338 0.3663 0.0923 0.9269
xMurder -0.3011 0.0466 -6.4590 0.0000
xHS Grad 0.0489 0.0233 2.0979 0.0420
xFrost -0.0057 0.0031 -1.8246 0.0752
xArea 0.0000 0.0000 -0.0443 0.9649
Residual standard error: 0.7448 on 42 degrees of freedom
Multiple R-Squared: 0.7362
F-statistic: 16.74 on 7 and 42 degrees of freedom, the p-value is 2.534e-10
Which predictors should be included - can you tell from the p-values?
Looking at the coefficients, can you see what operation would be
helpful? Does the murder rate decrease life expectancy - that's
obvious a priori but how should these results be interpreted?We start with the backward method - see the help page on stepwise() for details of the settings:
> b <- stepwise(x,lif,method="backward")
> b
$rss:
[1] 23.29822 23.30198 23.30804 25.37162 29.77036 34.46133 88.29900
$size:
[1] 6 5 4 3 2 1 0
$which:
Population Income Illiteracy Murder HS Grad Frost Area
6(-7) T T T T T T F
5(-3) T T F T T T F
4(-2) T F F T T T F
3(-1) F F F T T T F
2(-6) F F F T T F F
1(-5) F F F T F F F
0(-4) F F F F F F F
$f.stat:
[1] 0.002332003 0.008068924 0.012999420 4.426759766 8.668602866
[6] 7.878594024 78.113178674
$method:
[1] "backward"
What is the best size model? The f-statistics are for the one-at-a-time
predictor tests so we can convert these to t-statistics by taking the
square root:
> sqrt(b$f.stat) [1] 0.04829082 0.08982719 0.11401500 2.10398664 2.94424912 2.80688333 8.83816602Let's examine the two models at the crucial stage:
> summary(lm(lif ~ x[,b$w[3,]]))
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 71.0271 0.9529 74.5415 0.0000
x[, b$w[3, ]]Population 0.0001 0.0000 1.9960 0.0520
x[, b$w[3, ]]Murder -0.3001 0.0366 -8.1987 0.0000
x[, b$w[3, ]]HS Grad 0.0466 0.0148 3.1417 0.0030
x[, b$w[3, ]]Frost -0.0059 0.0024 -2.4550 0.0180
Residual standard error: 0.7197 on 45 degrees of freedom
Multiple R-Squared: 0.736
F-statistic: 31.37 on 4 and 45 degrees of freedom, the p-value is 1.696e-12
> summary(lm(lif ~ x[,b$w[4,]]))
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 71.0364 0.9833 72.2456 0.0000
x[, b$w[4, ]]Murder -0.2831 0.0367 -7.7064 0.0000
x[, b$w[4, ]]HS Grad 0.0499 0.0152 3.2859 0.0020
x[, b$w[4, ]]Frost -0.0069 0.0024 -2.8240 0.0070
Residual standard error: 0.7427 on 46 degrees of freedom
Multiple R-Squared: 0.7127
F-statistic: 38.03 on 3 and 46 degrees of freedom, the p-value is 1.634e-12
Should we include population or not? It depends on what we
want to use the model for. Also compare the p-values to those
found in the full model.Now the forward method:
> f <- stepwise(x,lif,method="forward")
> f
$rss:
[1] 34.46133 29.77036 25.37162 23.30804 23.30198 23.29822 23.29714
$size:
[1] 1 2 3 4 5 6 7
$which:
Population Income Illiteracy Murder HS Grad Frost Area
1(+4) F F F T F F F
2(+5) F F F T T F F
3(+6) F F F T T T F
4(+1) T F F T T T F
5(+2) T T F T T T F
6(+3) T T T T T T F
7(+7) T T T T T T T
$f.stat:
[1] 74.988651527 7.405878382 7.975114636 3.984083789 0.011439490
[6] 0.006939274 0.001958883
$method:
[1] "forward"
The outcome here is comparable to the "backward" result. Now try the
stepwise method - the default is to use 2 as the critical value
for the "F-to-enter" and "F-to-delete".
> s <- stepwise(x,lif)
> s
$rss:
[1] 34.46133 29.77036 25.37162 23.30804
$size:
[1] 1 2 3 4
$which:
Population Income Illiteracy Murder HS Grad Frost Area
1(+4) F F F T F F F
2(+5) F F F T T F F
3(+6) F F F T T T F
4(+1) T F F T T T F
$f.stat:
[1] 74.988652 7.405878 7.975115 3.984084
$method:
[1] "efroymson"
This method settles on the four predictor model with
population. Is this the "best" model?
|