Analysis of Covariance

The data for this example consist of x= nave height and y = total length in feet for English medieval cathedrals. Some are in the Romanesque (r) style and others are in the Gothic (g) style. Some cathedrals have parts in both styles and are listed twice. We wish to investigate how the length is related to height for the two styles. Read in the data and examine.
> cath <- read.table("/afs/stat.lsa.umich.edu/users/student/kutsyy/.public_html/classes/data/cathedral.data")
> cath
             style   x   y 
      Durham     r  75 502
  Canterbury     r  80 522
etc.
WinchesterG     g 103 530
 Old.St.Paul     g 103 611
   Salisbury     g  84 473
> summary(cath)
 style           x               y        
 g:16    Min.   : 45.00   Min.   :182.0  
 r: 9    1st Qu.: 67.00   1st Qu.:370.0  
         Median : 75.00   Median :425.0  
         Mean   : 74.76   Mean   :425.5  
         3rd Qu.: 83.00   3rd Qu.:519.0  
         Max.   :103.00   Max.   :611.0
Now plot the data:
> motif()
> plot(cath$x,cath$y,type="n",xlab="Nave height",ylab="Length")
> text(cath$x,cath$y,as.character(cath$s))
Now fit the seperate regression lines model. Note that y ~ x*style is equivalent.
> g <- lm(y ~ x+style+x:style, data=cath)
> summary(g)
Residuals:
    Min     1Q Median    3Q  Max 
 -172.7 -30.22  23.75 55.78 89.5

Coefficients:
                Value Std. Error   t value  Pr(>|t|) 
(Intercept)  139.4718  173.6037     0.8034    0.4307
          x    3.9728    2.3206     1.7120    0.1016
      style  102.3610  173.6037     0.5896    0.5617
    x:style   -0.8347    2.3206    -0.3597    0.7227

Residual standard error: 79.11 on 21 degrees of freedom
Multiple R-Squared: 0.5412 
F-statistic: 8.257 on 3 and 21 degrees of freedom, the p-value is 0.0008072
Because style is non-numeric, S-PLUS automatically treats it as a qualitative variables and sets up a coding - but which coding?
> model.matrix(g)
             (Intercept)   x style x:style 
      Durham           1  75     1      75
  Canterbury           1  80     1      80
etc.
Old.St.Paul           1 103    -1    -103
   Salisbury           1  84    -1     -84
We can use another coding - for example:
> dim(cath)
[1] 25  3
> dummy <- rep(0,25)
> dummy[cath$s == 'g'] <- 1
> dummy
 [1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Refit the model - note the necessity of using I() so that * gets interpreted as multiply.
> ga <- lm( y ~ x + dummy + I(x*dummy), data=cath)
> summary(ga)
Residuals:
    Min     1Q Median    3Q  Max 
 -172.7 -30.22  23.75 55.78 89.5

Coefficients:
                 Value Std. Error   t value  Pr(>|t|) 
 (Intercept)  241.8327  336.4711     0.7187    0.4802
           x    3.1381    4.5059     0.6964    0.4938
       dummy -204.7220  347.2075    -0.5896    0.5617
I(x * dummy)    1.6695    4.6411     0.3597    0.7227

Residual standard error: 79.11 on 21 degrees of freedom
Multiple R-Squared: 0.5412 
F-statistic: 8.257 on 3 and 21 degrees of freedom, the p-value is 0.0008072
Compare to the previous regression result - what's the same and what's different?

 We see that the model can be simplified to

> g <- lm(y ~ x+style, data=cath)
> summary(g)
Coefficients:
              Value Std. Error t value Pr(>|t|) 
(Intercept) 84.4942 80.6861     1.0472  0.3064 
          x  4.7116  1.0584     4.4516  0.0002 
      style 40.1963 16.1532     2.4884  0.0209 

Residual standard error: 77.53 on 22 degrees of freedom
Multiple R-Squared: 0.5384 
F-statistic: 12.83 on 2 and 22 degrees of freedom, the p-value is 0.0002028
Put the two parallel regression on the plot:
> abline(84.4942+40.1963,4.7116)
> abline(84.4942-40.1963,4.7116,lty=2)
Check out the diagnostics:
> par(mfrow=c(2,3))
> plot(g)
What do you see?

 Our conclusion is that for cathedrals of the same height, Romanesque ones are about 80 (=2*40.2) feet longer. For each extra foot height, both types of cathedral are about 4.7 feet longer.

 Consider the summary subsetted by style

> by(cath,cath$style,summary)
cath$style:g
 style           x               y        
 g:16    Min.   : 45.00   Min.   :182.0  
 r: 0    1st Qu.: 60.75   1st Qu.:298.8  
         Median : 73.50   Median :412.0  
         Mean   : 74.94   Mean   :397.4  
         3rd Qu.: 86.50   3rd Qu.:481.2  
         Max.   :103.00   Max.   :611.0  
------------------------------------------------------------
cath$style:r
 style           x               y       
 g:0     Min.   :64.00   Min.   :344.0  
 r:9     1st Qu.:70.00   1st Qu.:425.0  
         Median :75.00   Median :502.0  
         Mean   :74.44   Mean   :475.4  
         3rd Qu.:80.00   3rd Qu.:530.0  
         Max.   :83.00   Max.   :551.0
Notice that in this case the two groups have about the same average heights. Look at the difference in the lengths.

 With the alternative coding, we get:

> ga <- lm( y ~ x + dummy, data=cath)
> summary(ga)
Coefficients:
                Value Std. Error   t value  Pr(>|t|) 
(Intercept)  124.6905   82.9216     1.5037    0.1469
          x    4.7116    1.0584     4.4516    0.0002
      dummy  -80.3925   32.3063    -2.4884    0.0209

Residual standard error: 77.53 on 22 degrees of freedom
Multiple R-Squared: 0.5384 
F-statistic: 12.83 on 2 and 22 degrees of freedom, the p-value is 0.0002028
What does the coefficient for dummy tell us?