Testing Linear Models

Let's test a variety of linear hypotheses about the savings dataset we used earlier. These data are averages over 1960-1970 (to remove business cycle or other short-term fluctuations). Income is per-capita disposable income in U.S. dollars; growth is the percent rate of change in per capita disposable income; savings rate is aggregate personal saving divided by disposable income. The percentage population under 15 and over 75 are also recorded. For ease of access, let's seperate out each variable:
    > p15 <-saving.x[,1]
    > p75 <-saving.x[,2]
    > inc <-saving.x[,3]
    > gro <-saving.x[,4]
    > sav <-saving.x[,5]

Testing a single predictor

Suppose our objective is to model the savings rate as a function of the other four variables:
    > g<-lm(sav ~ p15 + p75 + inc + gro)
    > summary(g)

    Call: lm(formula = sav ~ p15 + p75 + inc + gro)
    Residuals:
        Min     1Q  Median    3Q   Max 
     -8.242 -2.686 -0.2491 2.428 9.751

    Coefficients:
                   Value Std. Error  t value Pr(>|t|) 
    (Intercept)  28.5666   7.3545     3.8842   0.0003
            p15  -0.4612   0.1446    -3.1886   0.0026
            p75  -1.6916   1.0836    -1.5611   0.1255
            inc  -0.0003   0.0009    -0.3617   0.7193
            gro   0.4097   0.1962     2.0882   0.0425

    Residual standard error: 3.803 on 45 degrees of freedom
    Multiple R-Squared: 0.3385 
    F-statistic: 5.756 on 4 and 45 degrees of freedom, the p-value is 0.0007902
To test the null hypothesis that ß1=0 i.e. that p15 is not significant in the full model, we can simply observe that the p-value is 0.0026 from the table and conclude that the null should be rejected.

 Let's do the same test using the general F-testing approach: We'll need the RSS and df for the full model (which represents the alternative hypothesis:

    > sum(g$res^2)
    [1] 650.7061
    > g$df
    [1] 45
and then fit the model that represents the null:
    > g2 <- lm( sav ~ p75 + inc +gro)
and compute the RSS and the F-statistic:
    > sum(g2$res^2)
    [1] 797.7234
    > (797.7234-650.706)/(650.706/45)
    [1] 10.16708
The p-value is then
    > 1-pf(10.16708,1,45)
    [1] 0.002602461
We can relate this to the t-based test and p-value by
    > sqrt(10.16708)
    [1] 3.188586
    > 2*(1-pt(3.188586,45))
    [1] 0.002602461
A somewhat more convenient way to compare two nested models is
    > anova(g2,g)

Testing all the predictors

We can also test whether any of the predictors have significance in the model. In other words, whether ß1 = ß2 = ß3 = ß4 = 0.

We can do it directly using the F-testing formula:

    > sum((sav-mean(sav))^2)
    [1] 983.6282
    > ((983.6282-650.706)/4)/(650.706/45)
    [1] 5.755863
    > 1-pf(5.755863,4,45)
    [1] 0.0007902025
Do you know where all the numbers come from? Check that they match the regression summary from lm().

 


Question Suppose we believe that people under 15 or over 75 are indistinguishable in their effect on the savings rate. We might believe that they are dependents that have an equivalent effect. Express this hypothesis in statistical terms and test it.