[R-lang] Re: Removing collinearity from control variables

Sun Jan 9 19:06:00 PST 2011

A quick tip to save a little typing, you can center a predictor using scale(data$Frequency,scale=F). If you're keen on keeping the number of columns in your dataset to a minimum, then scale() can also be directly applied to the predictors in the model.

For what it's worth, I'd second the point that there's no real need to worry about collinearity if it's *just* between your controls. If it's between controls and predictors-of-interest then that's more of an issue and there's a few different approaches for trying to take care of it (I've been residualizing against PoIs, as the direction of causality is pretty clear).

Ian

________________________________

From: ling-r-lang-l-bounces@mailman.ucsd.edu on behalf of Alex Fine
Sent: Mon 10/01/2011 00:52
To: Ariel M. Goldberg
Cc: ling-r-lang-l@mailman.ucsd.edu
Subject: [R-lang] Re: Removing collinearity from control variables

In general, a very simple step you can take to reduce collinearity is to
center your predictors, i.e. subtract the mean from each predictor (e.g.
for a continuous predictor "data$cFrequency <- data$Frequency -
mean(data$Frequency, na.rm=T)"; do the little na.rm thing in case you
have NAs for that variable).

That being said, it's not really clear why you need to worry about
collinearity within your control variables in this case.  If all you
want to do is see whether some additional variable is significant after
controlling for x, y, and z, and that additional variable is itself not
collinear with x, y, and z, then collinearity within x, y, and z will
not affect the model's ability to estimate either the direction or
significance of the new variable(s).

Given all that, there's probably also no reason to bother with PCA in
this case.

hope that helps,
alex

Ariel M. Goldberg wrote:
> Dear R-langers,
>
> I am working with data from the English Lexicon Project and am using the variables described by Baayen, Feldman & Schreuder (2006) to control for the basic factors that influence reading time (frequency, length, etc).  My goal is to determine if other variables are significant after having controlled for these factors.  I'd like to remove the collinearity from Baayen et al's variable set and I was wondering if you had any suggestions as to what might be the best way to do this.  I was thinking that PCA might be the best, particularly since I'm not concerned with the interpretation of variables at all.  Do you think that's a good way to go about it?
>
> Also, if PCA is good, I have a quick question.  Do I use all the principle components it creates, in order to account for 100% of the variance?  I think this makes sense since again, I'm not trying to interpret the various components.
>
> Thanks!
> Ariel
>
>
>
>
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucsd.edu/pipermail/ling-r-lang-l/attachments/20110110/657b1771/attachment.html