Analysis of Covariance

Qualitative predictor with more than 2 levels
The data for this example come from a 1966 paper by Cyril Burt entitled "The genetic determination of differences in intelligence: A study of monozygotic twins reared apart". The data consist of IQ scores for identical twins, one raised by foster parents, the other by the natural parents. We also know the social class of natural parents (high, middle or low). We are interested in predicting the IQ of the twin with foster parents from the IQ of the twin with the natural parents and the social class of natural parents. Let's read in and take a look at the data:
> twins <- read.table("/afs/stat.lsa.umich.edu/users/student/kutsyy/.public_html/classes/data/twin.data",header=T)
> twins
   Foster Biological Social 
 1     82         82   high
 2     80         90   high
etc.
26    107        106    low
27     98        111    low
> motif()
> plot(twins$B,twins$F,type="n",xlab="Biological IQ",ylab="Foster IQ")
> text(twins$B,twins$F,substring(as.character(twins$S),1,1))
The most general model we'll consider is the seperate lines model:
> g <- lm(Foster ~ Biological*Social, twins)
> summary(g,cor=F)
Coefficients:
                     Value Std. Error  t value Pr(>|t|) 
      (Intercept)   2.0495  11.9230     0.1719   0.8652
       Biological   0.9662   0.1255     7.6981   0.0000
          Social1   4.5383  12.2244     0.3713   0.7142
          Social2  -0.6168   9.6097    -0.0642   0.9494
BiologicalSocial1  -0.0146   0.1223    -0.1191   0.9063
BiologicalSocial2   0.0032   0.1038     0.0308   0.9758

Residual standard error: 7.921 on 21 degrees of freedom
Multiple R-Squared: 0.8041 
F-statistic: 17.24 on 5 and 21 degrees of freedom, the p-value is 8.31e-07
What do those coefficients mean? Must know the coding to determine this:
> model.matrix(g)
   (Intercept) Biological Social1 Social2 BiologicalSocial1 BiologicalSocial2 
 1           1         82      -1      -1               -82               -82
 2           1         90      -1      -1               -90               -90
etc.
26           1        106       1      -1               106              -106
27           1        111       1      -1               111              -111
This is Helmert coding which we can see in the coding matrix:
> contr.helmert(3)
  [,1] [,2] 
1   -1   -1
2    1   -1
3    0    2
This is hard to interpret. So let's change to treatment coding:
> options(contrasts=c("contr.treatment","contr.poly"))
> g <- lm(Foster ~ Biological*Social, twins)
> summary(g,cor=F)
Coefficients:
                          Value Std. Error  t value Pr(>|t|) 
           (Intercept)  -1.8720  17.8083    -0.1051   0.9173
            Biological   0.9776   0.1632     5.9902   0.0000
             Sociallow   9.0767  24.4487     0.3713   0.7142
          Socialmiddle   2.6881  31.6042     0.0851   0.9330
   BiologicalSociallow  -0.0291   0.2446    -0.1191   0.9063
BiologicalSocialmiddle  -0.0050   0.3295    -0.0152   0.9880

Residual standard error: 7.921 on 21 degrees of freedom
Multiple R-Squared: 0.8041 
F-statistic: 17.24 on 5 and 21 degrees of freedom, the p-value is 8.31e-07
What's different and what's the same compared to the last fit?

 Here's the coding matrix:

> contr.treatment(3)
  2 3 
1 0 0
2 1 0
3 0 1
Parameters can be interpreted relative to the first level (which is "high" class - alphabetically). This is easier to follow. I prefer to use the treatment coding unless there is some strong reason against it. Since I will use the treatment coding in all subsequent labs, you are advised to make this change

 Now see if the model can be simplified to the parallel lines model:

> gr <- lm(Foster ~ Biological+Social, twins)
> anova(gr,g)
Analysis of Variance Table

Response: Foster

                Terms Resid. Df      RSS               Test Df Sum of Sq 
1 Biological + Social        23 1318.403                                
2 Biological * Social        21 1317.471 +Biological:Social  2 0.9318114
      F Value     Pr(F) 
1                      
2 0.007426364 0.9926037
Yes it can. The sequential testing can be done in one go:
> anova(g)
Analysis of Variance Table

Response: Foster

Terms added sequentially (first to last)
                  Df Sum of Sq  Mean Sq  F Value     Pr(F) 
Biological         1  5231.133 5231.133 83.38233 0.0000000
Social             2   175.131   87.565  1.39576 0.2696934
Biological:Social  2     0.932    0.466  0.00743 0.9926037
Residuals         21  1317.471   62.737
We see that a further reduction to a single line model is possible:
> gr <- lm(Foster ~ Biological, twins)
Plot the regression line on the plot:
> abline(gr$coef)
Check the diagnostics:
> par(mfrow=c(2,3))
> plot(gr)
and check out the final model:
> summary(gr)
Coefficients:
             Value Std. Error t value Pr(>|t|) 
(Intercept) 9.2076 9.2999     0.9901  0.3316  
 Biological 0.9014 0.0963     9.3575  0.0000  

Residual standard error: 7.729 on 25 degrees of freedom
Multiple R-Squared: 0.7779 
F-statistic: 87.56 on 1 and 25 degrees of freedom, the p-value is 1.204e-09
One might be interested in a further simplification of this model to the line y=x (the IQ's are equal). Do you see how to do this test? (I get an F-statistic of 0.53 and a p-value of 0.595).

 Burt was interested in demonstrating the importance of heredity over environment in intelligence and this data certainly point that way. (Although it would be helpful to know the social class of the foster parents)

 However, before jumping to any conclusions, one should note that there is now considerable evidence that Cyril Burt invented some of his data on identical twins. In light of this, can you see anything in the above analysis that might lead one to suspect this data?