[R-lang] Contrast Coding in R Regressions

Sat Sep 19 20:19:26 PDT 2009

On Sep 15, 2009, at 9:16 PM, Rachel Baker wrote:

> Hi,
>
> I've recently started using R to do regressions, using the 'lmer'  
> function.  I am currently re-running some analyses that originally  
> had treatment coding, so that they now have contrast coding.  My  
> question is about how to interpret contrast coded regression outputs.
>
> One of my independent variables (nativeLanguage) has 3 levels:  
> English, Chinese, and Korean.  As this experiment was conducted in  
> English, participants in the English group were native speakers, and  
> participants in the other two groups were non-native speakers.  In  
> my original treatment-coded analysis, English was the reference  
> level.  My output for e.g. 'langCompare.lmer =  
> lmer(duration~nativeLanguage+(1|Subject), data=myData)' had lines  
> like:
>
>                                                  Estimate Std. Error  
> t value
> nativeLanguageChinese              0.025920   0.002384  10.872
> nativeLanguageKorean                -0.004416   0.002091  -2.112
>
> As I understood it, such lines gave information about the comparison  
> between Chinese and English, and between Korean and English,  
> respectively.
>
> I contrast coded this variable with the code: 'contrasts(myData 
> $nativeLanguage) = c(-1, .5, .5)' (after ordering the levels:  
> English, Chinese, Korean).  This was in order to compare the native  
> (English) group to the non-native (Chinese and Korean) groups.   
> After this contrast coding, my output had lines like:
>
>                                         Estimate Std. Error t value
> nativeLanguage1               0.10002   0.010113  11.242
> nativeLanguage2              -0.00046   0.639887  1.388
>
> I was wondering how to interpret this output.  My guess is that  
> nativeLanguage1 is the comparison between the native and non-native  
> groups, and native_language2 is the comparison between Chinese and  
> Korean, but I haven't been able to find any resources to confirm this.

Hi Rachel,

Your guess is correct, but the situation may be a little more  
complicated than you think.  First, you need to realize that you  
didn't specify a complete contrast.  Here's a little code snippet to  
illustrate:

 > m <- 20
 > n <- 3
 > lang <- factor(rep(c("English", "Chinese", "Korean"),m*n),  
levels=c("English", "Chinese", "Korean"))
 > old.contrasts <- contrasts(lang)
 > contrasts(lang) <- c(1,-.5,-.5)
 > new.contrasts <- contrasts(lang)

Now, let's take a look at the old and new contrast matrices:

 > old.contrasts
         Chinese Korean
English       0      0
Chinese       1      0
Korean        0      1
 > new.contrasts
         [,1]          [,2]
English  1.0 -5.551115e-17
Chinese -0.5 -7.071068e-01
Korean  -0.5  7.071068e-01

The value of old.contrasts derives from the fact that by default, R  
uses contr.treatment for unordered factors, with the first level of  
the factor being the baseline (which for you is English, so that the  
contrast matrix is all zeroes in the English row):

 > options()$contrasts
         unordered           ordered
"contr.treatment"      "contr.poly"

The value of contrasts(lang) reflects the fact that -- quoting from ? 
contrasts -- "If too few [entries for the contrast matrix] are  
supplied, a suitable contrast matrix is created by extending value  
after ensuring its columns are contrasts (orthogonal to the constant  
term) and not collinear."

Now let's generate some artificial data and look at how to interpret  
models fit using the old and new contrast matrices:

 > set.seed(3)
 > beta <- c(0,0.26,-0.004)
 > speaker <- rep(1:m,langs*n)
 > b <- rnorm(m,0,0.1)
 > y <- beta[lang] + b[speaker] + rnorm(3*m*n)
 > contrasts(lang) <- old.contrasts
 > print(m.old <- lmer(y ~ lang + (1 | speaker),REML=F))
[...]
Fixed effects:
             Estimate Std. Error t value
(Intercept) -0.01403    0.13236 -0.1060
langChinese  0.49860    0.18719  2.6636
langKorean  -0.11447    0.18719 -0.6115
[...]
 > contrasts(lang) <- new.contrasts
 > print(m.new <- lmer(y ~ lang + (1 | speaker),REML=F))
[...]
Fixed effects:
             Estimate Std. Error t value
(Intercept)  0.11402    0.07642   1.492
lang1       -0.12804    0.10807  -1.185
lang2       -0.43350    0.13236  -3.275
[...]

Ignoring speaker-specific effects, the predicted mean for a given  
language is the intercept plus the dot product of the language's  
contrast-matrix representation with the coefficients for the language  
factor.  Since the two models are equivalent, their predicted means  
should be the same for each language.  And they are:

 > ## compare old contrasts and new contrasts
 > ## English: old model
 > fixef(m.old)[1] + sum(old.contrasts["English",] * fixef(m.old)[2:3])
(Intercept)
-0.01402749
 > ## English: new model
 > fixef(m.new)[1] + sum(new.contrasts["English",] * fixef(m.new)[2:3])
(Intercept)
-0.01402749

The same will come out to be the case for the other two languages.

So -- to get back to your question: what do the nativeLanguage1 and  
nativeLanguage2 coefficients mean in your new model?  First, your  
contrast matrix has columns summing to 0, so the intercept can loosely  
be thought of as the predicted grand mean.  The coefficient for  
nativeLanguage1 is the difference between (a) the intercept and the  
English mean, and (b) twice the difference between the intercept and  
the average of the Chinese and Korean means.  The coefficient for  
nativeLanguage2 is the difference between Chinese and Korean divided  
by the square root of two.  So your guess was basically correct.  But  
it is important to recognize that these two coefficients operate on  
different scales, as reflected by the fact that the two columns of  
new.contrasts are vectors of different lengths.

> I would greatly appreciate any advice on how to interpret  
> regressions after contrast coding, or pointers to appropriate  
> resources on this topic!

So -- I wish I knew a really good reference on contrast coding.  There  
is some useful information in Chambers & Hastie 1991, Section 2.3.2,  
and in Venables & Ripley 2002, Section 6.2.  I think that Healy 2000  
("Matrices for Statistics") is a useful book that has some pertinent  
information.  But if anyone out there knows a great reference for  
contrast coding -- I'd love to hear it too!

Best

Roger

--

Roger Levy                      Email: rlevy at ling.ucsd.edu
Assistant Professor             Phone: 858-534-7219
Department of Linguistics       Fax:   858-534-4789
UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy