[R-lang] Re: Collinearity and how to decide what predictors to include

Sun Jul 25 17:04:46 PDT 2010

Xiao He wrote:
> Dear R-lang users,
>
> I have a set of binomial data of the stress production of second 
> language learners of English on three types of compound words. If they 
> stressed the first word of a compound word, they got 1; otherwise, 
> they got 0. every participant saw all three lists of words.
>
> Some of the potential predictors are: 
> (1). Age of Exposure to English (AoE);
> (2). Length of exposure to English (LE) (time of experiment - AoE);
> (3). Age of Arrival (AoA) in the US,
> (4). Length of Stay in the US. (time of experiment - AoA):
>
> (5). CompoundType
> (6). FamiliarityRating.
>
> Given that some of these predictors may be correlated (AoA and LS, and 
> AoE and LE, FamiliarityRating and CompoundType), before I tried to fit 
> a model, I examined these predictors to see if they were highly 
> correlated. 
> I used cor.test() to examine potential correlations. For example:
>
> cor.test(data$AoA, data$LS)
> the correlation I got was -0.44, and the p-value was very small.
>
> On the other hand, the correlation between AoE and LE was not that 
> strong (0.11). though it was statistically significant. So there 
> obvious problem is collinearity. From what I've read online, it seems 
> that one way to solve the collinearity problem is the compare which of 
> the correlated predictors can be removed. And to do so, one method is 
> to do model comparison. So should I do model comparisons like the 
> following?
>
> model1=lmer(word1~compoundType + AoA + (1|subject) + (1|word), family 
> = "binomial", data=data)
> model2=lmer(word1~compoundType + AoA + LS + (1|subject) + (1|word), 
> family = "binomial", data=data)
> model3=lmer(word1~compoundType + AoA + LS + AoE + (1|subject) + 
> (1|word), family = "binomial", data=data)
> model4=lmer(word1~compoundType + AoA + LS + AoE + LE + (1|subject) + 
> (1|word), family = "binomial", data=data)
> model5=lmer(word1~compoundType + AoA + LS + AoE + LE + 
> familiarityRating + (1|subject) + (1|word), family = "binomial", 
> data=data)
>
> And then I use anova() to compare these models.
>
> However, I am not sure if this is the right way to do it because these 
> models only test for main effects. I would also like to see if the 
> predictor "compoundType" interacts with any of the age related 
> predictors as well as familiarityRating. But I am not really sure how 
> model comparisons can be done with 4 predictors. I've only done model 
> comparions with two predictors.
Probably the first thing to do is check out Peter Graff's slides on 
model comparison:  http://www.bcs.rochester.edu/people/fjaeger/  (look 
for ModelComparisonTutorial).  I think these slides really do a good job 
of conveying the basics, and include an adorable bear, so you can't go 
wrong.

You may also look into stepwise regression, where you start with all 
predictors in the model, then remove non-significant predictors one at a 
time until only those predictors justified by the data are left:  
http://en.wikipedia.org/wiki/Stepwise_regression

also I just ran across this, but don't know anything about it:  
http://www.rensenieuwenhuis.nl/r-sessions-32/
>
> Another solution seems to be centering predictors. I've centered data 
> with 2 two-leveled factors before, following examples from Jaeger's 
> powerpoint slides, but I've never done centering with more than 2 
> factors, let alone when the predictors to be centered are numeric (or 
> interval).
Centering a numeric predictor just means subtracting the mean of that 
predictor from every value, so something like this:

data$cAoE <- data$AoE - mean(data$AoE, na.rm=T)

so now you have a new, improved, centered AoE variable.

at least based on my experience, that should substantially reduce 
collinearity.  It doesn't matter how many factors you're centering.  I 
think in general you should just center all continuous predictors (I 
don't know of a reason to ever not do this....). 
>
> It would be great if someone could give me some suggestions for my 
> issues. Thank you in advance for your help!
Hope that helps.  I'm sure other people on this list can add a lot to this.

alex
>
>
> Best,
> Xiao He