[R-lang] Re: How to handle missing data when I try to log-transform my data

Wed Jun 23 23:52:38 PDT 2010

In response to the transformation issue raised by Roger (different RE?)

We just had a paper come out that addresses this issue in the context  
of LMMs (Kliegl, Masson, & Richter, Visual Cognition; data and R  
scripts are available, too]. We report analyses in the original metric  
as well as log and reciprocal transforms in the context of frequency  
effects in masked repetition priming.
(1) Various assessments (boxcox, LMM residual plots) suggested that  
the reciprocal transfromation leads to an acceptable distribution of  
errors; original and log values clearly do not. We have replicated  
this result. (It does not hold for all types of RT experiments!)
(2) The transformations had very little (negligible) consequences for  
inferences about fixed effects [main effects and interactions]; this  
is in support of Roger's and much of the communities intuition
(3) Here is the important point to consider: The transformations had a  
very strong effect on the correlation of intercept, priming effect,  
and frequency effect (random effect correlations in LMM). Some of them  
even changed its sign. So depending on the transformation you infer a  
positive, non-significant, or negative correlation between mean RT and  
frequency effect.  (The correlation between priming and frequency  
effect stayed positive for all three variants.)

So what do you choose when it matters? Our recommendation was that, IN  
THE ABSENCE OF A THEORY OR MODEL, it may be best to choose the  
transformation that is in agreement with the statistical model we want  
to apply. So if we use an LMM, we also have to use the reciprocal  
transformation. Basically,  in this case, the homogeneity of the error  
distribution along the dimension of the dependent variable, is used as  
a criterion for equidistant units on the DV. (We do not want the  
"yardstick" to become less precise when we measure large values.)

Now if your theory or model or conviction forces you to stay with a  
different metric (say the original one), not all is lost either. You  
could choose a GLMM, for example with a gamma distribution, that  
generates the typical skew observed for RTs, that is one where  
variances increase with RT. Or you use a Bayesian framework for  
distributions outside the exponential family (e.g., Rouder et al.,  
2005, PsychonomicBullRev, argue for a Weibull).

At the outset and in the absence of a theory/model, I consider the  
original metric, the log transform, and the reciprocal transform (and  
possibly others) as equally plausible starting points. The metric we  
are most familiar with has been handed down by cultural evolution. If   
I subscribe to a linear model at this point, a defensible starting  
point to me is to ensure equal standard deviations along the  
measurement scale. I am happy to stand corrected.

Reinhold Kliegl

On 24.06.2010, at 05:25, Roger Levy wrote:

>
> On Jun 23, 2010, at 12:10 PM, Hugo Quené wrote:
>
>> BTW, typical RTs are not normally distributed, so that a
>> transformation is often necessary, using e.g. log(RT) or 1/RT as
>> your dependent variable.
>
> I am going to say something that may be somewhat controversial but I  
> hope that it spurs discussion.  I tend to advise *against*  
> indiscriminately applying log or other transforms of RTs (or other  
> continuously-distributed dependent variables, for that matter) prior  
> to regression analysis in the name of compensating for non- 
> normality.  In my mind, the most compelling reason to transform a  
> variable is to get as close as possible to the functional form  
> between the independent and dependent variables that is either  
> theoretically relevant or is known to exist empirically.  Applying a  
> non-linear transform such as log or inverse to a dependent variable  
> can easily break such a functional form.  As a concrete example, in  
> a controlled 2x2 experiment with the following condition-mean RTs:
>
>     B1   B2
> A1   400  600
> A2   600  800
>
> the log-transformed means would be around
>
>     B1   B2
> A1   2.60 2.78
> A2   2.78 2.90
>
> In order to assess whether there is an interaction between factors A  
> and B, one has to answer the following question: is RT or log-RT the  
> "correct" scale in which to interpret the effects?  My belief is  
> that the answer to this question should be the chief guiding  
> principle in determining whether to transform RT before regression  
> analysis or ANOVA.
>
> In terms of the negative consequences of deviation from normality on  
> the inferences coming out of the analysis, there is probably some  
> loss of power that comes with using linear regression on raw RT  
> data, which are heavy-tailed and skewed.  But my own experience --  
> both with empirical data and with artificial simulations -- is that  
> the loss of power is pretty minimal.  If anyone else has compelling  
> simulations that demonstrate a substantial loss of statistical  
> power, though, I'd be interested in seeing them!
>
> Best
>
> Roger
>
> --
>
> Roger Levy                      Email: rlevy@ling.ucsd.edu
> Assistant Professor             Phone: 858-534-7219
> Department of Linguistics       Fax:   858-534-4789
> UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy
>
>
>
>
>
>
>
>

----
Reinhold Kliegl, Dept. of Psychology, University of Potsdam,
Karl-Liebknecht-Strasse 24-25, 14476 Potsdam, Germany
phone: +49 331 977 2868, fax: +49 331 977 2793
http://www.psych.uni-potsdam.de/people/kliegl/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.ucsd.edu/mailman/private/ling-r-lang-l/attachments/20100624/0c7dac7c/attachment-0001.html