[R-lang] Re: How to handle missing data when I try to log-transform my data

Nathaniel Smith njs@pobox.com
Wed Jun 23 10:59:15 PDT 2010


A little fancier approach would be to use R's built in support for
missing values, "NA". The way this works is that NA is a magic value
that means "this data point is not available", and anywhere R accepts
a number or a string or whatever, it also accepts NA, and all R
functions know how to do something more-or-less sensible with it.

For example:
> log(c(150, 300, NA, 250))
[1] 5.010635 5.703782       NA 5.521461

The way that modeling functions (lm, lmer, etc.) handle NA is that
they throw out any rows of your data that contain an NA, so for
something like reaction times this will end up being basically the
same as what Matt suggested. But it can be more useful in other cases
-- like say you have two predictors Age and TimeOfDay (like, I don't
know, you think older people will have slower reaction times, and you
think people are fastest in the morning). You always know what time
you ran each subject, but some people declined to tell you their age.
If you code those people's ages as NA, then
    lm(RT ~ TimeOfDay)
will automatically analyze *all* subjects' data, while a formula that
includes "Age" like
    lm(RT ~ TimeOfDay + Age)
will automatically analyze just the subset of your subjects who were
willing to tell you their age.

-- Nathaniel

On Wed, Jun 23, 2010 at 10:48 AM, Matt Goldrick
<matt-goldrick@northwestern.edu> wrote:
> Dear Xiao
> Why don't you simply discard the missing values? I don't know what type of
> analyses you're doing, but mixed-effects regression models are generally
> robust to imbalanced data, so there's no need to keep the number of
> observations fixed across conditions.
> HTH,
> Matt
> On Wed, Jun 23, 2010 at 12:24 PM, Xiao He <praguewatermelon@gmail.com>
> wrote:
>>
>> Dear R-lang users
>> I have a question that is, I suppose, less related to the use of R.
>> I have a set of self-paced reading data, and all the RTs that are below
>> 100ms are to be discarded. What I used to do when analyzing raw data was to
>> replace discarded values with 0. That was all simple and easy. But I
>> recently started to analyze log-transformed data. An issue then arises as to
>> how to handle missing data. Obviously, if I replace the discarded raw data
>> points with 0, log transformation does not work, as it will return "-Inf"
>> for obvious reasons. So I would like to know what you would suggest me to do
>> in my case. Thank you very much in advance.
>>
>>
>> Xiao He
>
>


More information about the ling-r-lang-L mailing list