[R-lang] Weird outcome logit mixed-effect model
Roger Levy
rlevy at ucsd.edu
Mon Dec 22 13:30:34 PST 2008
Laura de Ruiter wrote:
> Dear R-users and -experts,
>
> I am performing a rather simple analysis on a small data set (pasted
> below this email) and keep getting a to me inexplicable result. Perhaps
> I am missing something here - it would be great if someone could point
> out to me what I am doing wrong.
>
> I want to test whether the factor "Info" (which has three levels: "new",
> "given", "accessible") is a significant predictor for the binary
> variable "DeaccYN". The random factor is "Subject". The distribution of
> the data looks as follows:
>
> -----------------------------------------------------------------------------
>
>
> Info
> DeaccYN given new accessible
> no 25 42 21
> yes 11 0 1
> ------------------------------------------------------------------------------
>
>
> This is the model:
>
> ----------------------------------------------------------------------------------------------------------
>
> deacc.lmer = lmer (DeaccYN ~ Info + (1|Subject), data = dat, family =
> "binomial")
> -----------------------------------------------------------------------------------------------------------------
>
>
> However, given the distribution above, this outcome seems rather weird
> to me:
>
> ---------------------------------------------------------------------------------------------------------
>
> summary (deacc.lmer)
> Generalized linear mixed model fit using Laplace
> Formula: DeaccYN ~ Info + (1 | Subject)
> Data: dat
> Family: binomial(logit link)
> AIC BIC logLik deviance
> 60.4 70.82 -26.2 52.4
> Random effects:
> Groups Name Variance Std.Dev.
> Subject (Intercept) 0.18797 0.43356
> number of obs: 100, groups: Subject, 21
>
> Estimated scale (compare to 1 ) 0.7316067
>
> Fixed effects:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) -0.8635 0.3795 -2.2754 0.0229 *
> Infonew -18.7451 2764.2445 -0.0068 0.9946
> Infoaccessible -2.2496 1.1186 -2.0110 0.0443 *
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> [...]
>
> ----------------------------------------------------------------------------------------------------
>
>
> Why should the difference between 25/11 and 21/1 be significant, but the
> difference between 25/11 and 42/0 not? Very odd to me seems the standard
> error of 2764!
>
> [...]
> I was wondering: Is it perhaps a problem for the model that there are no
> cases in the DeaccYN == "yes" category for Info == "given"? And if this
^^^^^
I believe you mean "new" here.
> is the case, why?
> Am I overlooking something here?
Dear Laura,
Independently of the issue that Florian is raising...you are right that
the lack of is a problem for the model (to be precise, it's a problem
for estimating the significance of the parameter estimate using the z
value). The z value, which is the basis of the significance level
reported in the lmer summary, is based on the Wald statistic, which is
z = \hat{b} / StdError(\hat{b})
where \hat{b} is the parameter estimate. However, for parameter
estimates of large magnitudes (as is the case for "new" here), the
standard error is inflated, which leads to a small wald statistic. This
problem is discussed in a number of places, including Agresti (1996,
2002); here's a decent mention of it online:
http://userwww.sfsu.edu/~efc/classes/biol710/logistic/logisticreg.htm
You can see this quite clearly if you add just one "yes/given" example
at the end of your data frame, for example:
101 132 new yes
Then the model gives (being sure to correctly specify the levels of "Info"):
> summary(deacc.lmer)
Generalized linear mixed model fit by the Laplace approximation
Formula: Deacc ~ Info + (1 | Subject)
Data: dat
AIC BIC logLik deviance
69.78 80.24 -30.89 61.78
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 0.34332 0.58594
Number of obs: 101, groups: Subject, 21
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8996 0.3936 -2.286 0.02226 *
Iaccessible -2.2808 1.1460 -1.990 0.04656 *
Inew -3.0050 1.1395 -2.637 0.00836 **
The next question, of course, is how to deal with the dataset you
actually have. If you weren't using a mixed-effects model, you could
use a likelihood-ratio test by comparing your full model with a simpler
model; the likelihood-ratio test isn't susceptible to problems with
large parameter estimates the way the Wald test is. For example:
> deacc.glm = glm (Deacc ~ Info, data = dat, family = "binomial")
> deacc.glm1 = glm (Deacc ~ Info1, data = dat, family = "binomial")
> anova(deacc.glm,deacc.glm1,test="Chisq")
Analysis of Deviance Table
Model 1: Deacc ~ Info
Model 2: Deacc ~ Info1
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 97 52.452
2 98 71.600 -1 -19.148 1.210e-05
In principle, you could do this with your mixed-effects model, being
sure to use ML instead of REML fitting:
> deacc.lmer = lmer (Deacc ~ Info + (1 | Subject), data = dat, family =
"binomial",REML=F)
> deacc.lmer1 = lmer (Deacc ~ Info1 + (1 | Subject), data = dat, family
= "binomial",REML=F)
> anova(deacc.lmer,deacc.lmer1)
Data: dat
Models:
deacc.lmer1: Deacc ~ Info1 + (1 | Subject)
deacc.lmer: Deacc ~ Info + (1 | Subject)
Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
deacc.lmer1 3 77.600 85.416 -35.800
deacc.lmer 4 60.400 70.821 -26.200 19.2 1 1.177e-05 ***
There is an argument that the likelihood-ratio test is anti-conservative
and hence inappropriate for comparing mixed-effects models differing
only in fixed-effects structure. (See Pinheiro & Bates, 2000, around
page 76 or so. The argument is made only for linear mixed-effects
models and I'm not sure of the status for logit mixed-effects models.)
That being said, it's not clear how strong the anti-conservativity is,
and in your case it seems like you have such an exceedingly powerful
effect that you might be safe in using the likelihood-ratio test here
and just mentioning the potential anti-conservativity as a caveat.
So the summary is: believe it or not, how to assess the significance of
the parameter estimate for "new" for your model & dataset is a bit of an
open question, but it seems pretty clear that the estimate is
significantly non-zero.
Best & hope this helps.
Roger
--
Roger Levy Email: rlevy at ucsd.edu
Assistant Professor Phone: 858-534-7219
Department of Linguistics Fax: 858-534-4789
UC San Diego Web: http://ling.ucsd.edu/~rlevy
More information about the R-lang
mailing list