[R-lang] Re: How to handle missing data when I try to log-transform my data

Thu Jun 24 07:18:38 PDT 2010

Alternatively one might consider fitting the fitting the exgaussian  
using the gamlss package.

Quoting Reinhold Kliegl <kliegl@uni-potsdam.de>:

> In response to the transformation issue raised by Roger (different RE?)
>
> We just had a paper come out that addresses this issue in the  
> context  of LMMs (Kliegl, Masson, & Richter, Visual Cognition; data  
> and R  scripts are available, too]. We report analyses in the  
> original metric  as well as log and reciprocal transforms in the  
> context of frequency  effects in masked repetition priming.
> (1) Various assessments (boxcox, LMM residual plots) suggested that   
> the reciprocal transfromation leads to an acceptable distribution of  
>  errors; original and log values clearly do not. We have replicated   
> this result. (It does not hold for all types of RT experiments!)
> (2) The transformations had very little (negligible) consequences  
> for  inferences about fixed effects [main effects and interactions];  
> this  is in support of Roger's and much of the communities intuition
> (3) Here is the important point to consider: The transformations had  
> a  very strong effect on the correlation of intercept, priming  
> effect,  and frequency effect (random effect correlations in LMM).  
> Some of them  even changed its sign. So depending on the  
> transformation you infer a  positive, non-significant, or negative  
> correlation between mean RT and  frequency effect.  (The correlation  
> between priming and frequency  effect stayed positive for all three  
> variants.)
>
> So what do you choose when it matters? Our recommendation was that,  
> IN  THE ABSENCE OF A THEORY OR MODEL, it may be best to choose the   
> transformation that is in agreement with the statistical model we  
> want  to apply. So if we use an LMM, we also have to use the  
> reciprocal  transformation. Basically,  in this case, the  
> homogeneity of the error  distribution along the dimension of the  
> dependent variable, is used as  a criterion for equidistant units on  
> the DV. (We do not want the  "yardstick" to become less precise when  
> we measure large values.)
>
> Now if your theory or model or conviction forces you to stay with a   
> different metric (say the original one), not all is lost either. You  
>  could choose a GLMM, for example with a gamma distribution, that   
> generates the typical skew observed for RTs, that is one where   
> variances increase with RT. Or you use a Bayesian framework for   
> distributions outside the exponential family (e.g., Rouder et al.,   
> 2005, PsychonomicBullRev, argue for a Weibull).
>
> At the outset and in the absence of a theory/model, I consider the   
> original metric, the log transform, and the reciprocal transform  
> (and  possibly others) as equally plausible starting points. The  
> metric we  are most familiar with has been handed down by cultural  
> evolution. If   I subscribe to a linear model at this point, a  
> defensible starting  point to me is to ensure equal standard  
> deviations along the  measurement scale. I am happy to stand  
> corrected.
>
> Reinhold Kliegl
>
>
> On 24.06.2010, at 05:25, Roger Levy wrote:
>
>>
>> On Jun 23, 2010, at 12:10 PM, Hugo Quené wrote:
>>
>>> BTW, typical RTs are not normally distributed, so that a
>>> transformation is often necessary, using e.g. log(RT) or 1/RT as
>>> your dependent variable.
>>
>> I am going to say something that may be somewhat controversial but  
>> I  hope that it spurs discussion.  I tend to advise *against*   
>> indiscriminately applying log or other transforms of RTs (or other   
>> continuously-distributed dependent variables, for that matter)  
>> prior  to regression analysis in the name of compensating for non-  
>> normality.  In my mind, the most compelling reason to transform a   
>> variable is to get as close as possible to the functional form   
>> between the independent and dependent variables that is either   
>> theoretically relevant or is known to exist empirically.  Applying  
>> a  non-linear transform such as log or inverse to a dependent  
>> variable  can easily break such a functional form.  As a concrete  
>> example, in  a controlled 2x2 experiment with the following  
>> condition-mean RTs:
>>
>>    B1   B2
>> A1   400  600
>> A2   600  800
>>
>> the log-transformed means would be around
>>
>>    B1   B2
>> A1   2.60 2.78
>> A2   2.78 2.90
>>
>> In order to assess whether there is an interaction between factors  
>> A  and B, one has to answer the following question: is RT or log-RT  
>> the  "correct" scale in which to interpret the effects?  My belief  
>> is  that the answer to this question should be the chief guiding   
>> principle in determining whether to transform RT before regression   
>> analysis or ANOVA.
>>
>> In terms of the negative consequences of deviation from normality  
>> on  the inferences coming out of the analysis, there is probably  
>> some  loss of power that comes with using linear regression on raw  
>> RT  data, which are heavy-tailed and skewed.  But my own experience  
>> --  both with empirical data and with artificial simulations -- is  
>> that  the loss of power is pretty minimal.  If anyone else has  
>> compelling  simulations that demonstrate a substantial loss of  
>> statistical  power, though, I'd be interested in seeing them!
>>
>> Best
>>
>> Roger
>>
>> --
>>
>> Roger Levy                      Email: rlevy@ling.ucsd.edu
>> Assistant Professor             Phone: 858-534-7219
>> Department of Linguistics       Fax:   858-534-4789
>> UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy
>>
>>
>>
>>
>>
>>
>>
>>
>
> ----
> Reinhold Kliegl, Dept. of Psychology, University of Potsdam,
> Karl-Liebknecht-Strasse 24-25, 14476 Potsdam, Germany
> phone: +49 331 977 2868, fax: +49 331 977 2793
> http://www.psych.uni-potsdam.de/people/kliegl/
>
>
>
>
>
>

Richard S. Bogartz
Professor of Psychology
UMASS, Amherst 01003