[R-lang] Re: stepwise model fitting procedure in lmer

Fri Oct 15 09:30:49 PDT 2010

Hi Rachel

sorry for the slow reply to this. About a year ago, I started working on
a "drop1" stepwise model selection procedure for lmer. I got the code to
the point where it is useable, but not well tested, and some of the
functionality of other similar tools is missing. I haven't worked on it
for a long time -- partly because I've been busy with other things, but
partly because (thanks to conversations with Roger Levy) I've started to
doubt that this is the best way to build models for the kind of datasets
I most often work with.

The major problem is that when testing a large number of models that
include combinations of many different predictors, inevitably sometimes
the fitting function will fail to converge (this is more often a problem
when you have many random effects, and more with GLMMs than vanilla
linear models). Assuming those models are "bad" can affect which
predictors are removed, even though the reason the model cannot be
fitted may be that the algorithm is having trouble getting good
estimates for some control variable you barely care about rather than
the actual predictor that will end up getting thrown out. So, I would be
very careful using this kind of approach blindly when you have large
numbers of predictors, and particularly categorical predictors and
outcomes.

Anyway, the code is attached, so please use it if you think it will be
useful. But bear in mind that I haven't given it enough testing to be
certain that it will work "correctly", and like I said above, I'm not
even sure that this is a good strategy in all cases. Be sure to check
warnings() for non-convergence messages.

I've also heard that some people on Apple computers have trouble running
this because of the multithreading -- you can try commenting out the
line that calls registerDoMC() if that is an issue.

If you do use it, I'd be interested to hear how it works out!

Hal Tily
-- 

On Wed, 2010-10-06 at 00:43 +0100, Rachel Smith wrote:
> Dear all,
>  
> I wonder if anyone can help me. Apologies that my question is a basic
> one.
>  
> I am fitting linear mixed effects models using lmer from lme4. In the
> past I have carried out model fitting beginning from a saturated model
> and removing predictors by hand, so to speak. To make the process less
> time-consuming for complex models, I am keen to find out whether
> any stepwise selection procedures are available for use with lmer. 
>  
> I'd be grateful for any pointers.
>  
> Thank you, and best regards
>  
> Rachel
>  
> ---
> Dr Rachel Smith
> Lecturer / RCUK Academic Fellow
> English Language
> University of Glasgow
> 12 University Gardens
> Glasgow G12 8QQ
> United Kingdom
> Rachel.Smith@glasgow.ac.uk
> +44 (0)141 330 5533 
> http://www.gla.ac.uk/departments/englishlanguage/staff/rachelsmith/
>  
> I work part-time and am normally unable to respond to mail on Tuesdays

-------------- next part --------------
# drop1 for mer models -- Hal Tily August-November 2009
# this code isn't very neat, and not guaranteed to work. Please make any edits
# you like! At some point this may be packaged up properly, but for the moment
# just do
# > source("drop1.R")
# > drop1(lmer.model, p.criterion)
# the return value is the table of stepwise removals (not the final model, you
# will have to refit that from the formula printed if you want it.)

# only tested with ranef specifications that using intercepts, slopes and simple
# interactions (no "/" syntax, for instance); also, the "scope" parameter isn't
# implemented.

# all (lmer) models tested will use REML if specified in the model passed, and
# ML otherwise. Since the likelihood ratio test is over likelihoods and not REML
# criteria, if you fit by maximising REML the likelihood isn't guaranteed to
# increase as parameters are added, so you may end up with weird negative chi
# square values. This is statistically incorrect, but apparently not a big
# enough problem to worry about. (Also note I don't bother flipping between REML
# and ML depending on whether ranefs or fixefs are being removed.)

# it would be better to evaluate complex expressions in the formula (e.g., 
# log(var), etc), by using evalq or match.call() or something? the ranefs are
# annoying though, I don't know how to deal with them elegantly. For the moment,
# you should pass only variable names as input (so do e.g. d$log.var <- log(d$var)
# before fitting your model)

# ambiguity about the right thing to do when some vars aren't defined for the
# whole dataset, meaning that a reduced model could potentially use much more
# data than the full model; here I use only the data that is defined for all
# predictors in the full model. But this means fitting a model with the formula
# returned by drop1 on the full dataset may not yield the same results...
# (though if it looks very different there are bigger problems in the data!)

# not sure what to do about correlation structure -- e.g. how to deal with the
# difference between (a-1|r) + (b-1|r) + (a:b|r) and (a*b|r)?
# I think the best thing is to always include the correlation parameters and
# not test models without them, though corr~0.0 might be safely removed, and
# cor ~1.0/-1.0 may indicate that the slope can't be reliably estimated but in
# practice may still be preserved by model comparison... anyway, at the moment
# if you pass a model with multiple groups indexed by the same grouping
# factor, they will be collapsed by simplify.formula() (and so correlations
# added).

library(lme4)
library(doMC)

# set up multicore parallelisation
registerDoMC(multicore:::detectCores(all.tests=TRUE))

# returns a vector of grouped variables given a vector of ranef spec strings
left.parts = function(s) {
	ret = c()
	stopifnot(all(grepl('\\|', s)))
	for(i in seq_along(s)) {
		ret = c(ret, strsplit(s[i], '\\|')[[1]][1])
	}
	trim(ret)
}

# the same but returns the grouping factors instead
right.parts = function(s) {
	ret = c()
	stopifnot(all(grepl('\\|', s)))
	for(i in seq_along(s)) {
		ret = c(ret, strsplit(s[i], '\\|')[[1]][2])
	}
	trim(ret)
}

# always use REML=F for the comparisons, since it isn't meaningful to compare
# REML criteria for models with different fixed effects structure -- this is
# not quite correct for models that aren't fitted with plain ML, but in practice
# is close enough (Doug Bates, link below)
# https://stat.ethz.ch/pipermail/r-sig-mixed-models/2008q3/001365.html
logLikML = function(m) {
	if(typeof(m) == "S4" && m@class == "mer") {
		return(logLik(m, REML=F))
	} else {
		return(logLik(m))
	}
}

simplify.formula = function(form) {
	response = as.character(form[2])
	tf = terms.formula(form, simplify=T)
	terms = attr(tf, "term.labels")
	effects = terms[!is.ranef(terms)]
	ranefs = terms[is.ranef(terms)]
	groups = right.parts(ranefs)
	ranefs = left.parts(ranefs)
	for(group in unique(groups)) {
		ranefterms = c()
		for(r in ranefs[groups==group]) {
			ranefterms = c(ranefterms, attr(terms.formula(formula(paste('~', r)),
				simplify=T), "term.labels"))
		}
		if(length(ranefterms) == 0) { ranefterms = "1" }
		ranefterms = do.call(paste, as.list(c(ranefterms, sep=" + ")))
		ranefterms = attr(terms.formula(formula(paste('~', ranefterms)),
				simplify=T), "term.labels")
		if(length(ranefterms) == 0) { ranefterms = "1" }
		effects = c(effects, bracket(do.call(paste,
				as.list(c(do.call(paste, as.list(c(ranefterms, sep=" + "))), "|", group)))))

	}
	effects = do.call(paste, as.list(c(effects, sep=" + ")))
	if(!attr(tf, "intercept")) {
		effects = do.call(paste, as.list(c(effects, "-1")))
	} else if(length(effects) == 0) {
		effects = "1"
	}
	formula(do.call(paste, as.list(c(response, "~", effects))))
}

# remove whitespace
trim = function(s) {
	gsub('(^ +)|( +$)', '', s)
}

# updated version of drop.scope that knows about ranef specifications (doesn't
# seem like I can just call factor.scope() on the output of factor.table())
drop.scope.mer = function(form) {
	form = simplify.formula(as.formula(formula(form)))
	# a table of fixed effects indicating the inputs they involve
	fixeftab = factor.table(form)
	# a list of random effects grouped by grouping variable, indicating
	# for each a vector of effects that can be removed
	ranefs = removable.ranefs(form)
	if(length(fixeftab) == 0 && length(ranefs) == 0) {
		return(NULL)
	}
	for(i in 1:ncol(fixeftab)) {
		filter = fixeftab[,i]==1
		for(j in 1:ncol(fixeftab)) {
			if(all(!fixeftab[,j][!filter]) && !all(fixeftab[,j][filter]))
				fixeftab[,j][filter] <- F
		}
	}

	# only consider removals which do not violate interaction hierarchy
	eligible = which(as.vector(colSums(fixeftab)>0))
	removable = colnames(fixeftab)[eligible]
	# add in the ranefs that can also be removed
	for(group in names(ranefs)) {
		for(r in ranefs[[group]]) {
			removable = c(removable, do.call(paste, list(r, "|", group)))
		}
	}
	removable
}

# returns a list where elements are named after the grouping variable in a
# ranef group, and valued as a vector of removable effects
removable.ranefs = function(form) {
	ret = list()
	tf = terms.formula(form, simplify=T)
	terms = attr(tf, "term.labels")
	effects = terms[!is.ranef(terms)]
	ranefs = terms[is.ranef(terms)]
	if(length(ranefs)==0) {
		return(list())
	}
	groups = right.parts(ranefs)
	ranefs = left.parts(ranefs)
	# sanity check : we're expecting a simplified formula here, so no duplicates
	stopifnot(setequal(groups, unique(groups)))
	for(i in 1:length(ranefs)) {
		terms = drop.scope(formula(do.call(paste, list("~", ranefs[i]))))
		if(length(terms) == 0) { terms = "1" }
		ret[[groups[i]]] = terms
	}
	ret
}

# this is a horrible hack, it will break on other uses of | (logical or)
# could change to use lme4:::findbars which is a little better
is.ranef = function(fx) {
	grepl('\\|', fx)
}

# make a factor table like attr(terms(formula), 'factors') with the ranefs
# parsed as well, to be used to indicate which terms can be dropped
factor.table = function(form) {
	tab = attr(terms(form), 'factors')
	# remove any rows which mention entire ranefs
	tab = subset(tab, !grepl('\\|', rownames(tab)))
# commenting out the next part which would have included ranefs, but actually I
# realise these should be treated separately
#	# loop through all ranefs
#	for(i in which(grepl('\\|', colnames(tab)))) {
#		# get all variables mentioned in this ranef
#		vars = all.vars(formula(paste('~', colnames(tab)[i])))
#		for(v in vars) {
#			if(!(v %in% rownames(tab))) {
#				tab = rbind(tab, (1:ncol(tab)==i))
#				rownames(tab)[nrow(tab)] <- v
#			} else {
#				tab[v,i] <- T
#			}
#		}
#	}
	tab
}

# TODO: this isn't being used currently, but could be
# remove a term from a formula, replacing with a backed-off version if need be;
# e.g. (a|r) -> (1|r); poly(3) -> poly(2)?
backOffTerm = function(term) {
	if(substr(term, 1, 1) == '(' && substr(term, nchar(term), nchar(term)) == ')')
		term = substr(term, 2, nchar(term)-1)
	parts = strsplit(term, '\\|')[[1]]
	# is this a ranef?
	if(length(parts) == 2) {
		# trim
		parts = trim(parts)
		if(parts[1] == '1')
			return(NULL)
		return(paste('(1 | ', parts[2], ')', sep='')) 
	} else if(substr(term, 1, 4) == 'poly') {
		# TODO
	}

	return(NULL)
}

# irritating -- all useful generics are missing for mer. There must be a better
# way to do this
setMethod('family', 'mer',
function(object) {
	cal = slot(object, 'call')
	if(is.null(cal$family)) {
		return(list(family='gaussian', link='identity'))
	}
	return(list(family=cal$family))
})

# put brackets around ranefs
bracket = function(term.labels) {
	for(t in seq_along(term.labels)) {
		if(is.ranef(term.labels[t])) {
			term.labels[t] = paste('(', term.labels[t], ')', sep='')
		}
	}
	term.labels
}

# return a formula with the appropriate term removed (in the future this may
# involve adding back-off terms like a lower order polynomial if necessary)
# TODO: this function is a mess! (but seems to work)
remove.term = function(form, term) {
	tf = terms.formula(form, simplify=T)
	terms = attr(tf, "term.labels")
	effects = terms[!is.ranef(terms)]
	ranefs = terms[is.ranef(terms)]
	if(is.ranef(term)) {
		groups = right.parts(ranefs)
		group = which(groups == right.parts(term))
		rtf = terms.formula(formula(
				paste("~ ", left.parts(ranefs)[group], " -", left.parts(term), sep="")),
			simplify=T)
		newfactors = attr(rtf, "term.labels")
		if(length(newfactors) == 0) {
			if(attr(rtf, "intercept")) {
				newranef = paste("1 | ", groups[group], sep="") 
			} else {
				# remove the random group entirely
				newranef = c()
			}
		} else {
			newfactors = do.call(paste, as.list(c(newfactors, sep=" + ")))
			newranef = do.call(paste, as.list(c(newfactors, "|", groups[group])))
		} 
		ranefs = bracket(c(ranefs[-group], newranef))
		effects = do.call(paste, as.list(c(effects, bracket(ranefs), sep=" + ")))
		response = rownames(attr(tf, "factors"))[attr(tf, "response")]
		if(!attr(tf, "intercept")) {
			effects = do.call(paste, as.list(c(effects, "-1")))
		} else {
			effects = do.call(paste, as.list(c(effects, "+1")))
		}
		return(simplify.formula(formula(do.call(paste, as.list(c(response, "~", effects))))))
	} else {
		effects = do.call(paste, as.list(c(effects, bracket(ranefs), sep=" + ")))
		effects =  paste(effects, " -", term, sep="")
		response = rownames(attr(tf, "factors"))[attr(tf, "response")]
		if(!attr(tf, "intercept")) {
			effects = do.call(paste, as.list(c(effects, "-1")))
		} else {
			effects = do.call(paste, as.list(c(effects, "+1")))
		}
		return(simplify.formula(formula(do.call(paste, as.list(c(response, "~", effects))))))
	}
}

setMethod('drop1', 'mer',
function(object, scope, ...) {
	varargs = substitute(list(...))
	full.model = object
	# expect a minimum bound on chi^2 pvalue for retention
	if(is.null(varargs$p.criterion))
		stop('p.criterion not specified.')
	p.criterion = eval(varargs$p.criterion)

	if(is.null(varargs$use.maximum.data)) {
		use.maximum.data = F
	} else {
		use.maximum.data = as.logical(eval(varargs$use.maximum.data))
	}

	form = simplify.formula(formula(object))
	all.terms = terms(form)
	term.labels = bracket(attr(all.terms, "term.labels"))

# TODO: check which vars we are considering dropping
#	if(missing(scope)) {
#		# use everything by default
#		scope = term.labels
#	} else {
#		# this isn't tested yet...
#		# otherwise expect a formula or a character vector
#		if(!is.character(scope))
#			scope <- bracket(attr(terms(update.formula(form, scope)), "term.labels"))
#		if (!all(match(scope, term.labels, 0L) > 0L))
#			stop("scope is not a subset of term labels")
#	}
	fam = family(full.model)$family
	cal = slot(full.model, 'call')
	cal$family = fam
	if(as.character(cal[1]) == 'nlmer') {
		stop('nlmer objects not supported yet.')
	}

	# determine the fitting procedure/criterion (this is taken from
	# summary(mer))
	mType = if (as.logical(length(full.model@V))) "NMM" else "LMM"
	REML = full.model@dims["REML"] == 1
	if(gen <- as.logical(length(full.model@muEta)))
		mType <- paste("G", mType, sep = "")
	method <- {
		if (mType == "LMM") 
			if(REML) "REML"
			else "ML"
		else
			if (full.model@dims["nAGQ"] == 1) "Laplace"
			else "adaptive Gaussian Hermite"
	}
	no.intercept = attr(all.terms, 'intercept') == 0
	response = as.character(attr(all.terms, 'variables')[2])

	cat(mType, '; Family =', fam, '; Criterion =', method, '\n')
	currentformula = simplify.formula(formula(full.model))

	# unless use.maximum.data is specified,
	# restrict the dataset to cases with no NAs in the *full model*'s set of
	# predictors.
	# if use.maximum.data is true, then the full data set must be available
	# somewhere, either passed explicitly or as an argument to the fit

	if(is.null(varargs$data)) {
		if(is.null(cal$data)) {
			full.data = model.frame(full.model)
			if(use.maximum.data) {
				warning("Could not access a full data frame; use.maximum.data ignored")
			}
			use.maximum.data = F
			msubset = ifelse(is.null(cal$subset), rep(T, nrow(full.data)),
				eval(cal$subset, full.data))
		} else {
			full.data = eval(cal$data)[,names(model.frame(full.model))]
			msubset = ifelse(is.null(cal$subset), rep(T, nrow(full.data)),
				eval(cal$subset, eval(cal$data)))
		}
	} else {
		full.data = eval(varargs$data)
		msubset = ifelse(is.null(cal$subset), rep(T, nrow(full.data)),
			eval(cal$subset, full.data))
	}

	passed.subset = msubset

	full.subset = as.logical(msubset & (rowSums(is.na(full.data[,names(model.frame(full.model))]))==0))

	#cal$subset <- quote(full.subset)
	cal$data <- full.data
	cal$subset <- full.subset
	cal$formula <- currentformula
	# check whether any additional arguments are accepted by both fitting
	# functions
	possargs = intersect(names(formals(lmer)), names(formals(glm)))
	extraargs = setdiff(names(cal[-1]), possargs)
	if(length(extraargs) > 0) {
		warning('Arguments passed to lmer are incompatible with glm: ', extraargs)
	}

	# refit the model (irritating, but safer)
	full.model = eval(cal)

	# formatting busy work
	syswidth = as.numeric(Sys.getenv("COLUMNS"))
	if(is.na(syswidth) || length(syswidth) == 0) { syswidth = 80 }
	labelwidth = syswidth-44
	labels.abbr = trim(abbreviate(term.labels, minlength=labelwidth))
	labelwidth = max(nchar(labels.abbr))
	labels.abbr = format(labels.abbr, width=labelwidth)	
	cat(do.call(paste, as.list(c(format('removed', width=labelwidth), format(c('logLik', 'chiSq', 'df', 'p'), width=10, justify='right')))), '\n')
	line = do.call(paste, as.list(c(rep('-', labelwidth + 44), sep='')))
	cat(line, '\n')
	cat(do.call(paste, as.list(c(format('', width=labelwidth), format(c(signif(logLikML(full.model), 4), '', '', ''), width=10, justify='right')))), '\n')

	# TODO: only consider removals which are specified to be considered in 'scope
	# removable.terms = scope
	# permanent.terms = setdiff(term.labels, removable.terms)
	# start iterating through reduced models
	model = full.model
	prevloglik = as.numeric(logLikML(full.model))
	removals = data.frame(removed=NA, logLik=prevloglik,
		chiSq=NA, df=NA, p=NA)
	p.value = Inf
	prevdf = attr(logLikML(full.model), 'df')
	currentformula = simplify.formula(formula(model))
	while(p.value > p.criterion) {
		# do not violate interaction hierarchy
		#removable.now = intersect(removable.terms, drop.scope.mer(model))
		removable.now = drop.scope.mer(currentformula)
		foreach(t=seq_along(removable.now)) %dopar% {
			removedTerm = removable.now[t]
			# remove the term
			new.formula = remove.term(currentformula, removable.now[t])
			# use lmer to fit models with ranefs and glm otherwise
			new.call = cal
			if(is.null(lme4:::findbars(new.formula))) {
				new.call[1] = expression(glm)
				useargs = names(new.call) %in% possargs
				useargs[1] = T
				new.call = new.call[useargs]
			} else {
				new.call[1] = expression(lmer)
			}
			new.call$formula = new.formula

			reduced.model = eval(new.call)

			loglikelihood = logLikML(reduced.model)
			degrees = as.vector(prevdf - attr(loglikelihood, 'df'))
			chisquare = as.vector(2 * (prevloglik - loglikelihood))
			pval = as.vector(pchisq(chisquare, degrees, lower = FALSE))
			loglikelihood = as.vector(loglikelihood)

			# return the value to .combine function of foreach
			list(ll=as.numeric(loglikelihood), df=degrees,
				cs=chisquare, p=pval, model=reduced.model)
		} -> results

		if(length(results)==0)
			break
		models = foreach(x=results, .combine=c) %dopar% { x['model'] }
		results = data.frame(t(sapply(results, function(x) { unlist(rbind(x[names(x) != "model"])) })))
		colnames(results) <- c("ll", "df", "cs", "p")
		p.value = max(results$p)
		winner = which(results$p==p.value)[1]
		removed = removable.now[winner]
		logL = results[winner, "ll"]
		chiSq = results[winner, "cs"]
		degrees = results[winner, "df"]

		if(p.value > p.criterion) {
			model = models[[winner]]
			new.call = model@call
			currentformula = formula(model)
			prevdf = attr(logLikML(model), 'df')
			removals = rbind(removals, list(removed, logL, chiSq, degrees, p.value))
			cat(do.call(paste, as.list(c(
				format(abbreviate(removed, minlength=20), width=labelwidth),
				format(as.character(
				c(signif(logL, 4),
				  signif(chiSq, 4),
				  degrees,
				  signif(p.value, 4))),
			width=10, justify='right')))), '\n')

			# if we are increasing the size of the dataset each time
			if(use.maximum.data) {
				full.subset = as.logical(passed.subset & (rowSums(is.na(full.data[,names(model.frame(model))]))==0))
				cal$subset <- full.subset
				# refit the model, potentially with more data
				new.call$subset <- full.subset
				model = eval(new.call)
				prevloglik = logLikML(model)
			}
		}
	}
	cat(line, '\n')
	for(i in seq_along(removable.now)) {
		removed = removable.now[i]
		cat(do.call(paste, as.list(c(
			format(abbreviate(removed, minlength=20), width=labelwidth),
			format(as.character(
				c(signif(results[i, "ll"], 4),
				  signif(results[i, "cs"], 4),
				  results[i, "df"],
				  signif(results[i, "p"], 4))),
				width=10, justify='right')))), '\n')
	}
	# output the final model structure
	print(formula(model))
	# return invisibly since we already printed everything relevant
	invisible(removals)
})

drop1.example = function() {
	b1 = rnorm(300)
	b2 = rnorm(300)
	c1 = rnorm(300)
	r1 = sample(1:4, 300, replace=T)
	r2 = sample(1:6, 300, replace=T)
	r3 = sample(1:6, 300, replace=T)

	y = 100 + rnorm(4, 0, 10)[r1]*b1 + 2*b2 + 1*b1*b2 + rnorm(6, 0, 8)[r1] + rnorm(6, 0, 10)[r2] + rnorm(300, 0, 10)

	model = lmer(y ~ b1 * b2 * c1 + (b1|r1) + (b2|r2) + (c1*b2|r3), REML=F)
	#model = lmer(y ~ b1 * b2 * c1 + (1|r3))

	drop1(model, p.criterion=.3)
}