[R-lang] Re: How to handle missing data when I try to log-transform my data
Reinhold Kliegl
kliegl@uni-potsdam.de
Wed Jun 23 23:52:38 PDT 2010
In response to the transformation issue raised by Roger (different RE?)
We just had a paper come out that addresses this issue in the context
of LMMs (Kliegl, Masson, & Richter, Visual Cognition; data and R
scripts are available, too]. We report analyses in the original metric
as well as log and reciprocal transforms in the context of frequency
effects in masked repetition priming.
(1) Various assessments (boxcox, LMM residual plots) suggested that
the reciprocal transfromation leads to an acceptable distribution of
errors; original and log values clearly do not. We have replicated
this result. (It does not hold for all types of RT experiments!)
(2) The transformations had very little (negligible) consequences for
inferences about fixed effects [main effects and interactions]; this
is in support of Roger's and much of the communities intuition
(3) Here is the important point to consider: The transformations had a
very strong effect on the correlation of intercept, priming effect,
and frequency effect (random effect correlations in LMM). Some of them
even changed its sign. So depending on the transformation you infer a
positive, non-significant, or negative correlation between mean RT and
frequency effect. (The correlation between priming and frequency
effect stayed positive for all three variants.)
So what do you choose when it matters? Our recommendation was that, IN
THE ABSENCE OF A THEORY OR MODEL, it may be best to choose the
transformation that is in agreement with the statistical model we want
to apply. So if we use an LMM, we also have to use the reciprocal
transformation. Basically, in this case, the homogeneity of the error
distribution along the dimension of the dependent variable, is used as
a criterion for equidistant units on the DV. (We do not want the
"yardstick" to become less precise when we measure large values.)
Now if your theory or model or conviction forces you to stay with a
different metric (say the original one), not all is lost either. You
could choose a GLMM, for example with a gamma distribution, that
generates the typical skew observed for RTs, that is one where
variances increase with RT. Or you use a Bayesian framework for
distributions outside the exponential family (e.g., Rouder et al.,
2005, PsychonomicBullRev, argue for a Weibull).
At the outset and in the absence of a theory/model, I consider the
original metric, the log transform, and the reciprocal transform (and
possibly others) as equally plausible starting points. The metric we
are most familiar with has been handed down by cultural evolution. If
I subscribe to a linear model at this point, a defensible starting
point to me is to ensure equal standard deviations along the
measurement scale. I am happy to stand corrected.
Reinhold Kliegl
On 24.06.2010, at 05:25, Roger Levy wrote:
>
> On Jun 23, 2010, at 12:10 PM, Hugo Quené wrote:
>
>> BTW, typical RTs are not normally distributed, so that a
>> transformation is often necessary, using e.g. log(RT) or 1/RT as
>> your dependent variable.
>
> I am going to say something that may be somewhat controversial but I
> hope that it spurs discussion. I tend to advise *against*
> indiscriminately applying log or other transforms of RTs (or other
> continuously-distributed dependent variables, for that matter) prior
> to regression analysis in the name of compensating for non-
> normality. In my mind, the most compelling reason to transform a
> variable is to get as close as possible to the functional form
> between the independent and dependent variables that is either
> theoretically relevant or is known to exist empirically. Applying a
> non-linear transform such as log or inverse to a dependent variable
> can easily break such a functional form. As a concrete example, in
> a controlled 2x2 experiment with the following condition-mean RTs:
>
> B1 B2
> A1 400 600
> A2 600 800
>
> the log-transformed means would be around
>
> B1 B2
> A1 2.60 2.78
> A2 2.78 2.90
>
> In order to assess whether there is an interaction between factors A
> and B, one has to answer the following question: is RT or log-RT the
> "correct" scale in which to interpret the effects? My belief is
> that the answer to this question should be the chief guiding
> principle in determining whether to transform RT before regression
> analysis or ANOVA.
>
> In terms of the negative consequences of deviation from normality on
> the inferences coming out of the analysis, there is probably some
> loss of power that comes with using linear regression on raw RT
> data, which are heavy-tailed and skewed. But my own experience --
> both with empirical data and with artificial simulations -- is that
> the loss of power is pretty minimal. If anyone else has compelling
> simulations that demonstrate a substantial loss of statistical
> power, though, I'd be interested in seeing them!
>
> Best
>
> Roger
>
> --
>
> Roger Levy Email: rlevy@ling.ucsd.edu
> Assistant Professor Phone: 858-534-7219
> Department of Linguistics Fax: 858-534-4789
> UC San Diego Web: http://ling.ucsd.edu/~rlevy
>
>
>
>
>
>
>
>
----
Reinhold Kliegl, Dept. of Psychology, University of Potsdam,
Karl-Liebknecht-Strasse 24-25, 14476 Potsdam, Germany
phone: +49 331 977 2868, fax: +49 331 977 2793
http://www.psych.uni-potsdam.de/people/kliegl/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.ucsd.edu/mailman/private/ling-r-lang-l/attachments/20100624/0c7dac7c/attachment-0001.html
More information about the ling-r-lang-L
mailing list