[R-lang] poly() and polynomials

Thu Sep 4 21:33:27 PDT 2008

Sharon Goldwater wrote:
 > Hi Roger,
 >
 >> So: as long as you're using maximum likelihood, the precise polynomial
 >> basis functions you choose (=orthogonal or non-orthogonal) won't affect
 >> what model is fitted, though as you point out it will affect the
 >> correlations between the parameter estimates.  If you use any other
 >> model-fitting criterion, though, such as penalized ML or Bayesian
 >> methods, the basis functions will matter.  For lmer(), using REML to fit
 >> a model will also cause the basis functions chosen to affect the
 >> resulting model fit, because REML is effectively Bayesian model fitting
 >> with a uniform prior over the fixed effects (see Pinheiro & Bates 2000,
 >> pp. 75-76).
 >>
 >> As an example, you get different log-likelihoods from the orthogonal &
 >> raw bases in the following simulation:
 >>
 >> x <- runif(1000)
 >> subj <- rep(1:10,100)
 >> r <- rnorm(10)
 >> dat <- data.frame(y=2 * x + 0.5 * x * x + r[subj] +
 >> rnorm(1000,0,5),x=x,subj=subj)
 >> lmer(y ~ poly(x,2) + (1 | subj), dat)
 >> lmer(y ~ poly(x,2,raw=T) + (1 | subj), dat)
 >
 > OK, thanks for pointing that out!  However, I don't think it applies
 > in this case because according to the documentation for lmer(),
 > logistic models (which is what I'm using) are always fit using
 > maximum-likelihood.  (I verified this by adding method = "REML" to my
 > model specification, and nothing changed.)

Aha, good point that the situation is different for logit models in
lmer().  Although I'm not totally clear on the upshot: according to
Bates 2007, it seems like a Laplace approximation is used, which gets
the conditional modes of the random effects and the approximate ML fixed
effects. Jaeger 2007 in press (p. 10) mentions that what's being fitted
is a penalized quasi-log-likelihood, but that's not quite my reading of
Bates 2007.  If a penalized likelihood is being maximized, then the
choice of basis functions could still have an effect.

 >> FWIW, though, spline-based methods usually tend to behave better than
 >> polynomial regression, especially if you're doing even mild 
extrapolation.
 >
 > Yes, it is certainly true that the quadratic fits I have make fairly
 > extreme predictions for points whose values are slightly beyond those
 > observed in the data.  Also using splines does seem to give a slightly
 > better fit.  However, there are a couple of reasons I've been using
 > polynomials rather than splines:
 >
 > 1. I want to be able to produce plots showing the predicted effects of
 > each fixed effect on the result variable.  I've been doing this by
 > extracting the fixed effect coefficients from the fitted model and
 > then plotting them (as in this figure:
 > http://homepages.inf.ed.ac.uk/sgwater/tmp/sri_num_coeff.pdf
 > where the result variable is
 > whether a word was correctly recognized by a speech recognition
 > system.  Variables are all centered and rescaled.)  Is there some way
 > to do this type of plot if I fit using rcs()?  I don't know what the
 > formula would be to make the plot using the rcs coefficients, and I
 > don't know of any predict-type function for lmer...

Yes, there is a way to do this.  You need to extract the spline 
parameters from the model and then multiply the spline basis matrix with 
it. Here's an example -- note that I can't get rcs() or ns() to work 
inside lmer, but putting the call to rcs() on the outside works fine:

## simulate some data
invlogit <- function(x) exp(x) / (1 + exp(x))

x <- runif(1000,-1,1)
y <- runif(1000, 0, 1)
subj <- rep(1:10,100)
r <- rnorm(10)
z <- rbinom(1000, 1, prob=invlogit(2 * x + 3 * x ^ 2  - 3 * y + r[subj]))

## generate the spline basis matrices; here we use 8 knots
knots <- 8
dummy <- seq(-1,1,by=0.0025)
dummy.rcs <- rcs(dummy, knots)
x.rcs <- rcs(x,knots)

## model
m <- lmer(z ~ y + x.rcs + (1 | subj), family="binomial")

## extract spline coefficients
rcs.coef <- fixef(m)[3:9]
plot(dummy, fixef(m)[1] + dummy.rcs %*% rcs.coef, 
type="l",ylim=c(-3,10)) # note that you need to throw in the intercept
lines(dummy, 2 * dummy + 3 * dummy * dummy, lty=2) # decent fit to the 
true effect of x

 >
 > 2.  The fit of this model to the data is fairly poor -- there is
 > simply a large amount of randomness that isn't captured.  So
 > prediction would be really bad anyway.  Mainly what I'm trying to do
 > is (as implied by point 1) figure out how the factors that I'm
 > examining influence the results.  So I'm more interested in
 > qualitative statements like "words with extreme values of certain
 > prosodic features, such as duration, jitter, and speech rate, are more
 > likely to be misrecognized than words with more moderate values" than
 > in worrying about exactly what how much worse it is to have a duration
 > of 1 sec than .5 sec. I would be happy to use rcs() if I could figure
 > out how to make the plots that allow me to make the qualitative
 > statements easy to see.  Any thoughts?

Yeah -- does the above help?

 > One more question, mostly unrelated: if I am using collin.fnc to
 > compute the condition number, am I correct that I should include only
 > the fixed effects when checking for collinearity? I've got a condition
 > number of around 18, which I think is borderline but OK, when I
 > include all the fixed effects including the (non-orthogonal) quadratic
 > terms.  But it goes up to 29 if I include the speaker and word factors
 > of my data (which I'm modeling as random).  That's because I also have
 > a factor for corpus, which is predictable given speaker.  But I guess
 > that high condition number is based on the assumption that you're
 > going to try to fit separate coefficients for each speaker, which in
 > fact I'm not.  So I should exclude the random effects from the
 > computation, right?

I'm going to punt on this...I don't know anything about collin.fnc...

Roger

-- 

Roger Levy                      Email: rlevy at ling.ucsd.edu
Assistant Professor             Phone: 858-534-7219
Department of Linguistics       Fax:   858-534-4789
UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy

-- 

Roger Levy                      Email: rlevy at ucsd.edu
Assistant Professor             Phone: 858-534-7219
Department of Linguistics       Fax:   858-534-4789
UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy