[R-lang] Re: Random slopes in LME

Sun Feb 20 13:32:22 PST 2011

One other issue I'd like to flag in this discussion of best practices is that, in my experience, it's a relatively common occurrence that a model with the full random effects structure [(A*B | ppt)+(A*B | item)] fails to converge -- and certainly this is even more frequently the case in slightly more complicated factorial designs (e.g., 2 x 3). When the full model fails to converge, of course, the likelihood ratio test isn't valid, and so the model selection problem becomes still less straightforward.

The approach I've often taken in these cases is to examine the random effect estimates of the full model that didn't converge, and to (iteratively) remove the random effect(s) with the lowest estimated variance (while respecting constraints such as never having random slopes for interactions without also having the random slopes for main effects, etc.) until the model converges. But I'm certainly not convinced this is the best solution.

Klinton

On Feb 20, 2011, at 13:10 , Philip Hofmeister wrote:

> This goes to what Florian just wrote about, because I don't think the issue is resolved yet.
> 
> One issue in finding the maximum random effect structure justified by the data is that it's not straightforward how to compare models with random effect structures which have differing random slope adjustments, but do not differ in the number of random slopes. 
> 
> For instance, in a 2 x 2 design, with factors A & B, one could do as Florian suggests and compare (1+A*B | Participant) to (1 | Participant). If it works out that the more complex model is unjustified, all is hunky dory. But if it *is* justified, then one would want to compare (1+A*B | Participant) with (1+A | Participant).  And here's the problem: if (1+A | Participant) turns out to be as good as (1+A*B | Participant), i.e. no significant improvement in the model, it's also possible that (1+B | Participant) is as good as (1+A*B | Participant).  So now the question is, which is preferred --- (1+A | Participant) or (1+B | Participant)? This doesn't seem to be an unapproachable problem for this simple case, but once one adds in random slopes for items, and the possibility of structures like (1 | Participant) + (0 + A | Participant) now the computational space of the problem seems to explode. 
> 
> I'm not sure if Hal's program addresses this (which I haven't been able to get work, sadly), so apologies if this is all worked out already. But in the interest of formulating some conventions, I would interested in hearing others' thoughts on this issue.  I know that Roger has recommended using the full random effect structure (A*B), so long as it converges, but if others have bumped up against this problem, I'd like to hear how it's been handled. 
> 
> philip
> 
> On Feb 20, 2011, at 12:48 PM, Daniel Ezra Johnson wrote:
> 
>> When comparing models with different random effect structure don't we have to grapple with the "boundary problem" - in other words, not just compare change in deviance to a chi^2(1)?
>> 
>> On Feb 20, 2011, at 3:28 PM, "T. Florian Jaeger" <tiflo@csli.stanford.edu> wrote:
>> 
>>> Hi R-lang,
>>> 
>>> maybe this is a good time to comment on a more general issue that Roger also brought up. As more and more people employ mixed models to analyze their data, we will need some conventions in terms of the procedures employed and the aspects of the analysis that are reported. While that is probably unattainable for the most general case, I think it's a reasonably achievable goal for standard 2 x 2 (or, by extensions, factorial) psycholinguistic experiments with more or less balanced data. 
>>> 
>>> In the meantime, I have some comments on the specific issue of random slopes. For a factorial design, I would submit that it should be the norm that we test the full factorial random effect structure with cross subject and item random effects. The goal of this procedure should be what I have called the "maximal random effect structure justified by the data" (which, of course, is a bit of a shorthand, since it's really the maximal random effect structure justified by the data under a set of assumptions, such as that the assumptions of the generalized linear mixed model and the particular family of distributions are met to a sufficient extent, that there is sufficient data to justify model comparison using a chisq over differences in deviances, that there is no overfitting issue, etc.).
>>> 
>>> To see why we should test whether random slopes are required (and include them in the mode, if they are required), it might be helpful to recall what random effect structure in a linear mixed model would most closely correspond to our good old friends, the F1 and F2 (by-subject and by-item) ANOVA. For example, the F1 ANOVA is intended to not just capture between-subject differences in terms of the intercept (e.g. overall reading time differences, if we assume sum-coding). It's also meant to account for between-subject differences in the sensitivity to all of the design conditions (e.g. the two main effects and the interaction of a 2x2 design). So, the linear mixed model that most closely corresponds to the F1 ANOVA has the following random effect structure:
>>> 
>>> (1 + Condition1 + Condition2 + Condition1:Condition2 | Participant)
>>> 
>>> or shorter
>>> 
>>> (1 + Condition1*Condition2 | Participant)
>>> 
>>> and, similarly, the closest thing to an F2 analysis would be
>>> 
>>> (1 + Condition1*Condition2 | Item)
>>> 
>>> hence, to take maximal advantage of the fact that linear mixed models allow us to combine both analyses into one (reducing, among other things, the conservativity of minF analyses), the starting model should probably have both of these as crossed random effects, plus, obviously, the fixed effects for the design:
>>> 
>>> 1 + Condition1*Condition2 + (1 + Condition1*Condition2 | Participant) + (1 + Condition1*Condition2 | Item)
>>> 
>>> I'll call this the full model (not the same as the model with "the maximal random effect structure justified by the data"!). Now, not all of the random effects may be justified, as we can test by model comparison. For example, the model might not even converge. So, we can simplify the model stepwise, starting with the higher order random effect terms (the highest interaction). I am assuming that we will leave the fixed effect structure untouched since it's directly justified by the theoretical interests that led us to create the 2x2 design.
>>> 
>>> As we simplify the random effect structure stepwise, there are some tips that I find useful (see also http://hlplab.wordpress.com/2009/05/14/random-effect-structure/ for a simple recipe):
>>> 
>>> 1) Let's start by comparing the full model with a model with just the random intercepts:
>>> 
>>> 1 + Condition1*Condition2 + (1 | Participant) + (1 | Item)
>>> 
>>> Let's call this the intercept-only model.  I'll assume that we have enough data to justify model comparison using a chisq over differences in deviances (that will usually be the case for a reasonably sizes psycholinguistic experiment). I'll also assume ML-fitted models. If the difference in deviance between the full model and intercept-only model is small, i.e. if the chisq of the model comparison has a value of less than 3 (or 2, if you want to be conservative; if I recall correctly even a chisq(1)< 3.8 is technically no significant), we don't really need to look further. There is no way then that any of the intermediate models could be significant. That's because difference in deviance between nested models are additive. I.e. if model A is nested in B and B is nested in C (and hence A is nested in C), then the difference in deviance between A and B plus the difference between B and C = the difference between A and C.
>>> 
>>> Even if the chisq between the full and the intercept-only model is larger, you'll immediately get an idea of how whether a lot or just a little of model improvement will be achieved by adding random slopes. So, even if you need to continue your comparisons to find the best model, you've learned something valuable at that moment.
>>> 
>>> 2) In our typical psycholinguistic experiment, participants are exposed to a barrage of lexically and structurally rather homogeneous stimuli. Researchers and paradigms, of course, differ in how a typical set of stimuli looks, but if you live in the "The plumber that the janitor knew is from New York." stimulus world and you 'successfully' created a stimulus set that seeks its match in terms of homogeneity (a match it most certainly won't find in actual language use ;), then, in my experience, you can expect your item random effects to be rather small. I don't necessarily think that's a great thing, but it's convenient with regard to the statistical issues at hand. If you work with such stimuli, I recommend starting by comparing the full model to a model with only random subject effects:
>>> 
>>> 1 + Condition1*Condition2 + (1 + Condition1*Condition2 | Participant) 
>>> 
>>> Again, if the chisq is really small, you don't need to look into item effects any further.
>>> 
>>> 
>>> 3) Hal Tily has developed a function (which he posted a while ago to this list) that automatically performs stepwise model comparison. While I am *not* advocating to blindly rely on automatic model simplification, I think this procedure -if restricted to random effects- could be acceptable for balanced designs. I think Roger Levy and Hal are working on a new an improved implementation, which hopefully will soon be available to this list (yippie!).
>>> 
>>> Ok, sorry, for the unsolicited, long and perhaps trivial (?) email. Also, I am sure I've screwed up somewhere, so feel free to correct, expand. I just had the feeling that the general procedure outlined above wasn't clear to some (many?) readers of this list. 
>>> 
>>> Florian
>>> 
>>> 
>>> 
>>> On Sun, Feb 20, 2011 at 8:08 AM, Zhenguang Garry Cai <z.cai@ed.ac.uk> wrote:
>>> Hi Ariel,
>>> 
>>> Sorry for the somewhat misleading information. I do not recall any formal requirement from JML for random slopes, but in a recent submission to JML, I was required to include random slopes. Also, as Roger Levy said, including random slopes is a sensible thing to do.
>>> 
>>> 
>>> Garry
>>> 
>>> 
>>> On 20/02/2011 12:37, Ariel M. Goldberg wrote:
>>> Garry,
>>> 
>>> Does JML have a set of specific requirements for mixed-effects analyses?  I wasn't able to find anything in their author instructions.
>>> 
>>> Thanks,
>>> AG
>>> 
>>> On Feb 13, 2011, at 1:17 PM, Zhenguang Cai wrote:
>>> 
>>> Hi all,
>>> 
>>> I have a question concerning random slopes in mixed effects modeling. So I ran a structural priming experiment with a 4-level variable (prime type, A, B, D and D). The dependent variable is response construction (DO dative vs. PO dative).  The following is a summary of the experiment results.
>>> 
>>> Prime           A       B       C       D
>>> DOs             85      24      38      59
>>> POs             82      144     128     109
>>> % of DOs        .51     .14     .23     .35
>>> 
>>> I am interested in whether different prime types induced different priming, e.g., whether A led to more DO responses than B, C or D. Initially, I ran LME analyses with random intercepts only. For instance, I did the following to see whether there was a main effect of prime type.
>>> 
>>> fit.0 = lmer(Response~1+(1|Subject)+(1|Item), family=binomial)
>>> fit.p = lmer(Response~Prime+(1|Subject)+(1|Item), family=binomial)
>>> anova (fit.0, fit.p)
>>> 
>>> Then, I did pairwise comparison by changing the reference level for Prime, e.g.,
>>> 
>>> fit.p = lmer(Response~relevel(Prime,ref="B")+(1|Subject)+(1|Item),
>>> family=binomial)
>>> 
>>> It seems that all the levels differed from each other. In particular, the comparison between C and D results in Estimate = -1.02, SE = .32, Z
>>> = -3.21, p<  .01.
>>> 
>>> But it seems I have to consider whether the slope for Prime differs across subjects or item (at least this is a requirement from JML). So the following is the way I considered whether a random slope should be included in a model. I wonder whether I did the correct thing. I first determined whether subject random slope should be included by comparing the following two models.
>>> 
>>> fit.p = lmer(Response~Prime+(1|Subject)+(1|Item), family=binomial)
>>> fit.ps = lmer(Response~Prime+(Prime+1|Subject)+(1|Item), family=binomial)
>>> anova (fit.p, fit.ps)
>>> 
>>> I did the same thing about item random slopes.
>>> 
>>> fit.p = lmer(Response~Prime+(1|Subject)+(1|Item), family=binomial)
>>> fit.pi = lmer(Response~Prime+(1|Subject)+(Prime+1|Item), family=binomial)
>>> anova (fit.p, fit.pi)
>>> 
>>> The subject random slope had a significant effect, so I included it in the final model (e.g., fit.ps). But pairwise comparison returned something that is different from my initial analyses (when random slope was not considered). That is, the comparison between C and D became only marginally significant (Estimate = -.85, SE = .47, z = -1.79, p = .07). It is a bit strange because the 9% difference between B and C turned out to be significant, but the 12% difference between C and D was not.
>>> 
>>> Or did I do anything wrong in the analyses?
>>> 
>>> Thanks,
>>> 
>>> Garry
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
> 
> -----------------------------------------------
> Philip Hofmeister
> University of California - San Diego
> La Jolla, CA 92093-0526
> phofmeister@ucsd.edu
> 
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucsd.edu/pipermail/ling-r-lang-l/attachments/20110220/390dfdfd/attachment-0001.html