[R-lang] Re: How to handle missing data when I try to log-transform my data

Wed Jun 23 11:47:24 PDT 2010

Hi Xiao,

Let me just add to the chorus of voices saying that removing the observation is better than replacing it with a 0ms response.  One additional point regarding your concern about imbalance in the data obtained when removing the observations within a traditional pair of by-subjects and by-items ANOVAs.  Since the quantities entered in your ANOVAs are always means of a number of observations (e.g., when performing a by-subjects ANOVA, the raw quantities are per-subject means averaging across items in a given condition), removing a few outliers doesn't generally create true imbalance in the ANOVA (unless you're unlucky enough that *all* the observations in a given condition for some subject or item are outliers, which is pretty rare).  Rather, removing the outliers is a source of inhomogeneity of variance, because the variance of an average is a function of the number of observations that are being averaged.  But traditional linear ANOVA is fairly robust to mild departures from homogeneity of variance, and at any rate there are lots of other sources of inhomogeneity of variance in any self-paced reading study.  So unless you have quite a large number of RTs below 100ms and/or they are not evenly distributed across conditions, I would tend not to worry too much about the consequences of the missing observations on the inferences coming out of your traditional ANOVAs.

Roger

On Jun 23, 2010, at 11:33 AM, Xiao He wrote:

> Hi all,
> 
> Thank you for the helpful explanations. I really do not know the theoretical reasons behind replacing values smaller than 100ms with 0. I vaguely recall that I read a couple of journal articles where the authors resorted to this method. After reading what you guys wrote, I do agree that such replacement would introduce more bias. The reason why I was thinking about replacement was also partly because I just started using lmer() not long ago, and coming from aov(), I guess I was more concerned about unbalanced data and missing values and did not think about the flexibility and power of lmer(). But thank you again for your help. I will do what you guys have suggested. :-)
> 
> 
> 
> Best,
> Xiao
> 
> 
> On Wed, Jun 23, 2010 at 11:08 AM, Scott Jackson <scottuba@gmail.com> wrote:
> I may be missing something, but I'm not sure why you would want to
> replace 100ms with 0ms.  Surely that introduces much more bias than if
> you had just left the too-low RTs in?
> 
> I agree with the others that simply rejecting trials with too-low RTs
> is probably the way to go.  At least, that's a common practice,
> especially if you're using lmer or another analysis that does not
> require balances data.
> 
> Alternatively, depending on what kind of data you have, you might try
> a more sophisticated missing-data imputation technique, like multiple
> imputation.  There's a very nice R package for imputation called
> "mice" that I have been using extensively lately, which has very
> readable and helpful documentation, even if you're new to multiple
> imputation.  There's another package called "Amelia" that I have not
> used, but which also has excellent-looking documentation.
> 
> If you go the route of substituting NAs for too-short times instead of
> completely deleting that observation from your data.frame, here's a
> tip.  The following does NOT work:
> 
> data$RT[data$RT < 100] <- NA
> 
> The following DOES work:
> 
> is.na(data$RT[data$RT < 100]) <- TRUE
> 
> good luck,
> -scott
> 
> On Wed, Jun 23, 2010 at 1:24 PM, Xiao He <praguewatermelon@gmail.com> wrote:
> > Dear R-lang users
> > I have a question that is, I suppose, less related to the use of R.
> > I have a set of self-paced reading data, and all the RTs that are below
> > 100ms are to be discarded. What I used to do when analyzing raw data was to
> > replace discarded values with 0. That was all simple and easy. But I
> > recently started to analyze log-transformed data. An issue then arises as to
> > how to handle missing data. Obviously, if I replace the discarded raw data
> > points with 0, log transformation does not work, as it will return "-Inf"
> > for obvious reasons. So I would like to know what you would suggest me to do
> > in my case. Thank you very much in advance.
> >
> >
> > Xiao He
> >
> 

--

Roger Levy                      Email: rlevy@ling.ucsd.edu
Assistant Professor             Phone: 858-534-7219
Department of Linguistics       Fax:   858-534-4789
UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy