[R-lang] Random effect modelling and Zipf distributed corpus data

Benoit Crabbé benoit.crabbe@gmail.com
Thu Mar 3 05:23:56 PST 2011


Dear R-lang-ers,


I am currently trying to do some modelling on corpus extracted data with my students using a random intercept model.

Our model tries to predict the position of French attributive adjectives wrt to their head noun given several variables.
Our setting is a logistic regression with a random variable (random intercept) set on an adjective lemmata variable.

Using lmer(...) we have that :
(1) the distribution of the conditional modes is all but normal.
(2) lmer does not converge properly : (the deviance function is getting flat as the algorithm progresses towards the solution, hence lmer iterates way too many times and yields an overfitted model)

We tried to solve issue 2 : on convergence, we have been able to get a better convergence with the library lme4a.

However problem (1) remains : the words lemmatas being Zipf distributed, most random intercepts have estimates close to 0 (the grand mean).
These random intercepts are also mostly those for the words with low frequency in the data (hapax or quasi hapax  words) for which the standard error of estimation is largest.

Words with a higher frequency have better estimates, and their intercepts apparently follow a normal pattern.
But the overall distribution looks like a mixture of (1) a peak around 0 made mostly (but not only) of poorly estimated ranefs and (2) a flat normal pattern of better estimated ranefs whose apparent mean is clearly much greater than 0.

We tried to fix this,
- by replacing in the data low frequency words  forms (below some given threshold) by a unique word form, say 'hapax' aiming at reducing the above mentioned estimation problem for low frequency words.
- by adding and removing as fixed effect an other word frequency variable in the model (with which the random effect could interact)

yet that does not really help. The distribution remains highly skewed...

This question seems to be a general one, since I suspect people using corpus distributed data should experience similar problems.
I wonder whether someone has already run into similar problems, and which kind of solution he might have found...
any hint would help...

many thanks,
Benoit


More information about the ling-r-lang-L mailing list