[R-lang] Re: Collinearity and how to decide what predictors to include

Sun Jul 25 17:13:16 PDT 2010

Sorry for flooding your inboxes, but I sent an underspecified link just 
now.  The link to Peter's slides is here:

 http://hlplab.wordpress.com/2010/05/10/mini-womm-montreal-slides-now-available/

self-flagellating,
alex

Xiao He wrote:
> Dear R-lang users,
>
> I have a set of binomial data of the stress production of second 
> language learners of English on three types of compound words. If they 
> stressed the first word of a compound word, they got 1; otherwise, 
> they got 0. every participant saw all three lists of words.
>
> Some of the potential predictors are: 
> (1). Age of Exposure to English (AoE);
> (2). Length of exposure to English (LE) (time of experiment - AoE);
> (3). Age of Arrival (AoA) in the US,
> (4). Length of Stay in the US. (time of experiment - AoA):
>
> (5). CompoundType
> (6). FamiliarityRating.
>
> Given that some of these predictors may be correlated (AoA and LS, and 
> AoE and LE, FamiliarityRating and CompoundType), before I tried to fit 
> a model, I examined these predictors to see if they were highly 
> correlated. 
> I used cor.test() to examine potential correlations. For example:
>
> cor.test(data$AoA, data$LS)
> the correlation I got was -0.44, and the p-value was very small.
>
> On the other hand, the correlation between AoE and LE was not that 
> strong (0.11). though it was statistically significant. So there 
> obvious problem is collinearity. From what I've read online, it seems 
> that one way to solve the collinearity problem is the compare which of 
> the correlated predictors can be removed. And to do so, one method is 
> to do model comparison. So should I do model comparisons like the 
> following?
>
> model1=lmer(word1~compoundType + AoA + (1|subject) + (1|word), family 
> = "binomial", data=data)
> model2=lmer(word1~compoundType + AoA + LS + (1|subject) + (1|word), 
> family = "binomial", data=data)
> model3=lmer(word1~compoundType + AoA + LS + AoE + (1|subject) + 
> (1|word), family = "binomial", data=data)
> model4=lmer(word1~compoundType + AoA + LS + AoE + LE + (1|subject) + 
> (1|word), family = "binomial", data=data)
> model5=lmer(word1~compoundType + AoA + LS + AoE + LE + 
> familiarityRating + (1|subject) + (1|word), family = "binomial", 
> data=data)
>
> And then I use anova() to compare these models.
>
> However, I am not sure if this is the right way to do it because these 
> models only test for main effects. I would also like to see if the 
> predictor "compoundType" interacts with any of the age related 
> predictors as well as familiarityRating. But I am not really sure how 
> model comparisons can be done with 4 predictors. I've only done model 
> comparions with two predictors. 
>
> Another solution seems to be centering predictors. I've centered data 
> with 2 two-leveled factors before, following examples from Jaeger's 
> powerpoint slides, but I've never done centering with more than 2 
> factors, let alone when the predictors to be centered are numeric (or 
> interval). 
>
> It would be great if someone could give me some suggestions for my 
> issues. Thank you in advance for your help!
>
>
> Best,
> Xiao He