[R-lang] Contrast Coding in R Regressions
Roger Levy
rlevy at ling.ucsd.edu
Sat Sep 19 20:19:26 PDT 2009
On Sep 15, 2009, at 9:16 PM, Rachel Baker wrote:
> Hi,
>
> I've recently started using R to do regressions, using the 'lmer'
> function. I am currently re-running some analyses that originally
> had treatment coding, so that they now have contrast coding. My
> question is about how to interpret contrast coded regression outputs.
>
> One of my independent variables (nativeLanguage) has 3 levels:
> English, Chinese, and Korean. As this experiment was conducted in
> English, participants in the English group were native speakers, and
> participants in the other two groups were non-native speakers. In
> my original treatment-coded analysis, English was the reference
> level. My output for e.g. 'langCompare.lmer =
> lmer(duration~nativeLanguage+(1|Subject), data=myData)' had lines
> like:
>
> Estimate Std. Error
> t value
> nativeLanguageChinese 0.025920 0.002384 10.872
> nativeLanguageKorean -0.004416 0.002091 -2.112
>
> As I understood it, such lines gave information about the comparison
> between Chinese and English, and between Korean and English,
> respectively.
>
> I contrast coded this variable with the code: 'contrasts(myData
> $nativeLanguage) = c(-1, .5, .5)' (after ordering the levels:
> English, Chinese, Korean). This was in order to compare the native
> (English) group to the non-native (Chinese and Korean) groups.
> After this contrast coding, my output had lines like:
>
> Estimate Std. Error t value
> nativeLanguage1 0.10002 0.010113 11.242
> nativeLanguage2 -0.00046 0.639887 1.388
>
> I was wondering how to interpret this output. My guess is that
> nativeLanguage1 is the comparison between the native and non-native
> groups, and native_language2 is the comparison between Chinese and
> Korean, but I haven't been able to find any resources to confirm this.
Hi Rachel,
Your guess is correct, but the situation may be a little more
complicated than you think. First, you need to realize that you
didn't specify a complete contrast. Here's a little code snippet to
illustrate:
> m <- 20
> n <- 3
> lang <- factor(rep(c("English", "Chinese", "Korean"),m*n),
levels=c("English", "Chinese", "Korean"))
> old.contrasts <- contrasts(lang)
> contrasts(lang) <- c(1,-.5,-.5)
> new.contrasts <- contrasts(lang)
Now, let's take a look at the old and new contrast matrices:
> old.contrasts
Chinese Korean
English 0 0
Chinese 1 0
Korean 0 1
> new.contrasts
[,1] [,2]
English 1.0 -5.551115e-17
Chinese -0.5 -7.071068e-01
Korean -0.5 7.071068e-01
The value of old.contrasts derives from the fact that by default, R
uses contr.treatment for unordered factors, with the first level of
the factor being the baseline (which for you is English, so that the
contrast matrix is all zeroes in the English row):
> options()$contrasts
unordered ordered
"contr.treatment" "contr.poly"
The value of contrasts(lang) reflects the fact that -- quoting from ?
contrasts -- "If too few [entries for the contrast matrix] are
supplied, a suitable contrast matrix is created by extending value
after ensuring its columns are contrasts (orthogonal to the constant
term) and not collinear."
Now let's generate some artificial data and look at how to interpret
models fit using the old and new contrast matrices:
> set.seed(3)
> beta <- c(0,0.26,-0.004)
> speaker <- rep(1:m,langs*n)
> b <- rnorm(m,0,0.1)
> y <- beta[lang] + b[speaker] + rnorm(3*m*n)
> contrasts(lang) <- old.contrasts
> print(m.old <- lmer(y ~ lang + (1 | speaker),REML=F))
[...]
Fixed effects:
Estimate Std. Error t value
(Intercept) -0.01403 0.13236 -0.1060
langChinese 0.49860 0.18719 2.6636
langKorean -0.11447 0.18719 -0.6115
[...]
> contrasts(lang) <- new.contrasts
> print(m.new <- lmer(y ~ lang + (1 | speaker),REML=F))
[...]
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.11402 0.07642 1.492
lang1 -0.12804 0.10807 -1.185
lang2 -0.43350 0.13236 -3.275
[...]
Ignoring speaker-specific effects, the predicted mean for a given
language is the intercept plus the dot product of the language's
contrast-matrix representation with the coefficients for the language
factor. Since the two models are equivalent, their predicted means
should be the same for each language. And they are:
> ## compare old contrasts and new contrasts
> ## English: old model
> fixef(m.old)[1] + sum(old.contrasts["English",] * fixef(m.old)[2:3])
(Intercept)
-0.01402749
> ## English: new model
> fixef(m.new)[1] + sum(new.contrasts["English",] * fixef(m.new)[2:3])
(Intercept)
-0.01402749
The same will come out to be the case for the other two languages.
So -- to get back to your question: what do the nativeLanguage1 and
nativeLanguage2 coefficients mean in your new model? First, your
contrast matrix has columns summing to 0, so the intercept can loosely
be thought of as the predicted grand mean. The coefficient for
nativeLanguage1 is the difference between (a) the intercept and the
English mean, and (b) twice the difference between the intercept and
the average of the Chinese and Korean means. The coefficient for
nativeLanguage2 is the difference between Chinese and Korean divided
by the square root of two. So your guess was basically correct. But
it is important to recognize that these two coefficients operate on
different scales, as reflected by the fact that the two columns of
new.contrasts are vectors of different lengths.
> I would greatly appreciate any advice on how to interpret
> regressions after contrast coding, or pointers to appropriate
> resources on this topic!
So -- I wish I knew a really good reference on contrast coding. There
is some useful information in Chambers & Hastie 1991, Section 2.3.2,
and in Venables & Ripley 2002, Section 6.2. I think that Healy 2000
("Matrices for Statistics") is a useful book that has some pertinent
information. But if anyone out there knows a great reference for
contrast coding -- I'd love to hear it too!
Best
Roger
--
Roger Levy Email: rlevy at ling.ucsd.edu
Assistant Professor Phone: 858-534-7219
Department of Linguistics Fax: 858-534-4789
UC San Diego Web: http://ling.ucsd.edu/~rlevy
More information about the R-lang
mailing list