[R-lang] Collinearity and how to decide what predictors to include

Sun Jul 25 16:32:08 PDT 2010

Dear R-lang users,

I have a set of binomial data of the stress production of second language
learners of English on three types of compound words. If they stressed the
first word of a compound word, they got 1; otherwise, they got 0. every
participant saw all three lists of words.

Some of the potential predictors are:
(1). Age of Exposure to English (AoE);
(2). Length of exposure to English (LE) (time of experiment - AoE);
(3). Age of Arrival (AoA) in the US,
(4). Length of Stay in the US. (time of experiment - AoA):

(5). CompoundType
(6). FamiliarityRating.

Given that some of these predictors may be correlated (AoA and LS, and AoE
and LE, FamiliarityRating and CompoundType), before I tried to fit a model,
I examined these predictors to see if they were highly correlated.
I used cor.test() to examine potential correlations. For example:

cor.test(data$AoA, data$LS)
the correlation I got was -0.44, and the p-value was very small.

On the other hand, the correlation between AoE and LE was not that strong
(0.11). though it was statistically significant. So there obvious problem is
collinearity. From what I've read online, it seems that one way to solve the
collinearity problem is the compare which of the correlated predictors can
be removed. And to do so, one method is to do model comparison. So should I
do model comparisons like the following?

model1=lmer(word1~compoundType + AoA + (1|subject) + (1|word), family =
"binomial", data=data)
model2=lmer(word1~compoundType + AoA + LS + (1|subject) + (1|word), family =
"binomial", data=data)
model3=lmer(word1~compoundType + AoA + LS + AoE + (1|subject) + (1|word),
family = "binomial", data=data)
model4=lmer(word1~compoundType + AoA + LS + AoE + LE + (1|subject) +
(1|word), family = "binomial", data=data)
model5=lmer(word1~compoundType + AoA + LS + AoE + LE + familiarityRating +
(1|subject) + (1|word), family = "binomial", data=data)

And then I use anova() to compare these models.

However, I am not sure if this is the right way to do it because these
models only test for main effects. I would also like to see if the predictor
"compoundType" interacts with any of the age related predictors as well as
familiarityRating. But I am not really sure how model comparisons can be
done with 4 predictors. I've only done model comparions with two
predictors.

Another solution seems to be centering predictors. I've centered data with 2
two-leveled factors before, following examples from Jaeger's powerpoint
slides, but I've never done centering with more than 2 factors, let alone
when the predictors to be centered are numeric (or interval).

It would be great if someone could give me some suggestions for my issues.
Thank you in advance for your help!

Best,
Xiao He
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.ucsd.edu/mailman/private/ling-r-lang-l/attachments/20100725/d230a2c0/attachment.html