[R-lang] Re: What happens if I include a continuous variable?

T. Florian Jaeger tiflo@csli.stanford.edu
Fri Sep 17 08:13:09 PDT 2010


On Thu, Sep 16, 2010 at 9:00 PM, David Reitter <reitter@cmu.edu> wrote:

> On Sep 16, 2010, at 11:34 AM, Nathaniel Smith wrote:
> >
> >> One way to address collinearity is to regress out length from frequency
> first, e.g. creating testdata$FreqWithoutLen from something like
> resid(lm(Freq ~ Len)).
> >>
> >> I wonder if it would be OK to do stepwise regression, i.e. regressing
> out length from your response variable (lexdectime) first, and then fitting
> the main model.
> >
> > I think you're talking about residualization, not stepwise regression?
> > I would either regress out Length from *both* Freq and lexdectime, or
> > not regress it out at all.
>
> In the first sentence I talk about residualization, in the second I'm
> asking if a different approach would also be feasible.
>
> (1) In Florian's slides (I'm referring to the McGill lecture that he
> referenced, slides 43/44), we're taking the residuals from a model of the
> structure Freq ~ Len (applied to Roger's example).  These residuals are used
> as predictor in the original model, as lexdectime ~ residual  (lexdectime is
> presumable an RT).
>
> (2) The alternative is not really stepwise regression in the sense that
> you'd add main effects, and then interactions etc., but perhaps something
> like this:
>
> m1 <- lm(lexdectime ~ Len)
> m2 <- lm(resid(m1) ~ Freq)
> summary(m2)
>
> ... if we are interested in the effect of Freq, after Len has been
> accounted for.
> Copying from slide 45 in the above reference, one would still say "We have
> granted Len the entire portion of
> the variance that cannot unambiguously attributed to either Freq or Len!".
>  Comments appreciated.
>

I see. Yes, this is what self-paced reading people call residualization
(residualized reading times). I would definitely prefer model comparison
(stepwise regression) as it's more principled and works for a broader set of
scenarios (self-paced reading folks have a separate reason to use approach
(2) above because they use the RTs from all trials, including fillers, when
calculating residualized RTs).

My feeling is further that approach (2) will not work when you have
additional variables in m1, as the residuals will be result of all those
predictors not just the one of interest. But I am not entirely sure. I
assume, as long as we are talking about a linear model this might not matter
(as soon as the right-hand side of m1 is subjected to a non-linear
transformation, the relation to model comparison becomes much more
complicated).

In any case, approach (2) will not work when you, as Roger asked, want to
also know whether the interaction between length and frequency is relevant.

have you ever run a power simulation to compare the two methods when there
is more than the two variables of interest? That would be interesting.

Florian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.ucsd.edu/mailman/private/ling-r-lang-l/attachments/20100917/d648b335/attachment.html 


More information about the ling-r-lang-L mailing list