|
|
Analysis of Covariance> twins <- read.table("/afs/stat.lsa.umich.edu/users/student/kutsyy/.public_html/classes/data/twin.data",header=T)
> twins
Foster Biological Social
1 82 82 high
2 80 90 high
etc.
26 107 106 low
27 98 111 low
> motif()
> plot(twins$B,twins$F,type="n",xlab="Biological IQ",ylab="Foster IQ")
> text(twins$B,twins$F,substring(as.character(twins$S),1,1))
The most general model we'll consider is the seperate lines model:
> g <- lm(Foster ~ Biological*Social, twins) > summary(g,cor=F) Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 2.0495 11.9230 0.1719 0.8652 Biological 0.9662 0.1255 7.6981 0.0000 Social1 4.5383 12.2244 0.3713 0.7142 Social2 -0.6168 9.6097 -0.0642 0.9494 BiologicalSocial1 -0.0146 0.1223 -0.1191 0.9063 BiologicalSocial2 0.0032 0.1038 0.0308 0.9758 Residual standard error: 7.921 on 21 degrees of freedom Multiple R-Squared: 0.8041 F-statistic: 17.24 on 5 and 21 degrees of freedom, the p-value is 8.31e-07What do those coefficients mean? Must know the coding to determine this: > model.matrix(g) (Intercept) Biological Social1 Social2 BiologicalSocial1 BiologicalSocial2 1 1 82 -1 -1 -82 -82 2 1 90 -1 -1 -90 -90 etc. 26 1 106 1 -1 106 -106 27 1 111 1 -1 111 -111This is Helmert coding which we can see in the coding matrix: > contr.helmert(3) [,1] [,2] 1 -1 -1 2 1 -1 3 0 2This is hard to interpret. So let's change to treatment coding: > options(contrasts=c("contr.treatment","contr.poly"))
> g <- lm(Foster ~ Biological*Social, twins)
> summary(g,cor=F)
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -1.8720 17.8083 -0.1051 0.9173
Biological 0.9776 0.1632 5.9902 0.0000
Sociallow 9.0767 24.4487 0.3713 0.7142
Socialmiddle 2.6881 31.6042 0.0851 0.9330
BiologicalSociallow -0.0291 0.2446 -0.1191 0.9063
BiologicalSocialmiddle -0.0050 0.3295 -0.0152 0.9880
Residual standard error: 7.921 on 21 degrees of freedom
Multiple R-Squared: 0.8041
F-statistic: 17.24 on 5 and 21 degrees of freedom, the p-value is 8.31e-07
What's different and what's the same compared to the last fit?
Here's the coding matrix: > contr.treatment(3) 2 3 1 0 0 2 1 0 3 0 1Parameters can be interpreted relative to the first level (which is "high" class - alphabetically). This is easier to follow. I prefer to use the treatment coding unless there is some strong reason against it. Since I will use the treatment coding in all subsequent labs, you are advised to make this change Now see if the model can be simplified to the parallel lines model: > gr <- lm(Foster ~ Biological+Social, twins) > anova(gr,g) Analysis of Variance Table Response: Foster Terms Resid. Df RSS Test Df Sum of Sq 1 Biological + Social 23 1318.403 2 Biological * Social 21 1317.471 +Biological:Social 2 0.9318114 F Value Pr(F) 1 2 0.007426364 0.9926037Yes it can. The sequential testing can be done in one go: > anova(g) Analysis of Variance Table Response: Foster Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) Biological 1 5231.133 5231.133 83.38233 0.0000000 Social 2 175.131 87.565 1.39576 0.2696934 Biological:Social 2 0.932 0.466 0.00743 0.9926037 Residuals 21 1317.471 62.737We see that a further reduction to a single line model is possible: > gr <- lm(Foster ~ Biological, twins)Plot the regression line on the plot: > abline(gr$coef)Check the diagnostics: > par(mfrow=c(2,3)) > plot(gr)and check out the final model: > summary(gr) Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 9.2076 9.2999 0.9901 0.3316 Biological 0.9014 0.0963 9.3575 0.0000 Residual standard error: 7.729 on 25 degrees of freedom Multiple R-Squared: 0.7779 F-statistic: 87.56 on 1 and 25 degrees of freedom, the p-value is 1.204e-09One might be interested in a further simplification of this model to the line y=x (the IQ's are equal). Do you see how to do this test? (I get an F-statistic of 0.53 and a p-value of 0.595). Burt was interested in demonstrating the importance of heredity over environment in intelligence and this data certainly point that way. (Although it would be helpful to know the social class of the foster parents) However, before jumping to any conclusions, one should note that there is now considerable evidence that Cyril Burt invented some of his data on identical twins. In light of this, can you see anything in the above analysis that might lead one to suspect this data? |