[R-lang] Re: analysis of acceptability judgements

Fri Oct 15 09:36:23 PDT 2010

Dear Kyle, Joao,

On Fri, Oct 15, 2010 at 8:44 AM, Kyle Gorman <kylebgorman@gmail.com> wrote:

> On Oct 15, 2010, at 9:47 AM, João Veríssimo wrote:
>
> > Dear all,
> >
> > Could anyone point me in the direction of papers/chapters that discuss
> > the best way to analyse acceptability judgements data (with or without
> > random effects)?
>
> There is something of a dearth of methodological discussion of
> acceptability judgements, especially in linguistics. Two people who have
> wrote extensively about it are Jon Sprouse (his recent work) and Carson
> Schütze (his book).

I would add the following.First, it might be good to be clear about the fact
that neither of the people you've mentioned has worked on the statistical
analysis of acceptability judgments, which is to some extent independent of
the *method *being used.

If you're looking for in discussions of acceptability judgments as a a
method, I'd agree that Carson's work is a good start, as well as Cowart
1997's book "Empirical Syntax" (mostly on acceptability judgments), Frank
Keller's thesis on magnitude estimation, as well as several papers he had
with collaborators like Ash Asudeh and Antonella Sorace in the early
2000s. Several of these papers discuss trade-offs between magnitude
estimation and likert scale or binary judgments in depth. These papers (and
the development of the WebExp2 software) also triggered a lot of
cross-linguistic work, including methodological comparisons (a small sample
can be seen here http://www.language-experiments.org/, under "previous
experiments").

There's also been a bit of work on speeded grammaticality judgments (I
believe Fanselow is one of the people who worked on this, Meng and Bader,
and others). I am mostly mentioning this because this earlier work (e.g. by
Cowart, and Frank and colleagues) is often less known to theoretical
linguists.

If your interest is methodological, you might also find work on the
correlation between acceptability judgments and reading times interesting. I
think several folks have looked into that (Evelina Fedorenko, Philip
Hofmeister, me, and, possibly, Sprouse?).

> I interpret this work, and older work in psychophysics (Stanley S. Stevens'
> "On the theory of scales of measurement", Science 1946), to indicate that we
> should avoid Likert scales for linguistic judgement tasks.
>

I generally agree on a priori considerations about the methods (ME is much
more appealing), but -practically, I yet have to see an experiment where the
difference matters (we've used both types of judgments). I think that if one
does ME, sliding scales are nicer, just in terms of how fast people can go
through the experiment (and that's relevant since I think that there is
quite good evidence that increased speed of judgments reduces stylistic /
normative effects).

>
> > For judgements on a scale (say, 1 to 7), I have been thinking about
> > ordinal logistic regression, using ordered() and lmr(). I just don't
> > know whether this makes sense with 7 or more categories.
>
> Ordinal regression might make sense if you don't want to commit yourself to
> assuming a linear relationship between the ratings scale and whatever the
> predictors are. There are still some residual problems.
>
> One reason these types of Likert scales are disfavored in psychophysics is
> we don't know what to make of a subject who only uses values in the region
> [3, 5], or a subject who never uses 1, etc. This may be a central tendency
> bias, or a meaningful observation. The same is true of a subject's responses
> who are skewed away from the mean; it could be meaningful or a per-subject
> bias; a random intercept may be appropriate or not. For this reason, people
> have standardized Likert responses per subject, but the normality assumption
> may be bad, and there may be something meaningful in the stimuli that caused
> the subject not to use edge values. What if the subject has an internal
> rating of a stimulus as 3.5, but that value is not on the scale? We simply
> do not know if Likert scales are suitable for the kinds of interpretation
> we'd like to make about them.

There should be a message I sent (for some weird reason awaiting approval),
in which I point out that another problem is heterogeneity of variances when
the condition means are close to the edges of the interval. But weighted
regression might be able to deal with this. Usually, linear mixed models
seem to do just fine (compared to ordinal logistic regression) in the
analysis of a 7-point likert scale, as long as judgments means are
relatively close to the center. I haven't done this very often but for the 4
or 5 data sets of that type I was asked to analyze, linear mixed models
seemed ok (based on the distributional properties of the data).

 > For judgements made with a glider or line where participants can choose

> any point, the variable is continuous. But then lm() model predictions
> > can be outside the scale (and I suppose variance is not constant).
>
>
> Gliders are probably an improvement, but still not as good as magnitude
> estimation because they suffer from "boundary" problems: you can't take the
> slider beyond a certain point, and we face the same problems in modeling the
> results. If you're worried about values outside the value for the glider,
> that's easy to address; I'll assume the glider values runs from [0, 1]. One
> simple solution is to model logit(glider), since the logit function maps
> values on a (0, 1) scale to (-infinity, infinity). But it is undefined at
> [0, 1], so you'll have to truncate or delete those values (R does the former
> by default, just FYI). There may be other functions that make sense that
> scale the values to a scale bounded only by infinity. We're then assuming
> linearity between logit(glider) and your predictors, which may be
> undesirable. We also have the same per-subject issues.
>

Let me add that the real statistical issue isn't so much the boundary but
the dependence of condition variances on the condition mean.

If you normalize your scale to [0,1], you could always use weighted linear
regression over arcsine or empirical logit transformed data (for more
discussion of weighted regression, see Barr, 2008 and his blog; for
discussion of arcsine, empirical logit, and problems with bounded data,
see Jaeger, 2008 in the special issue on emerging methods for data analysis
in JML). This method would avoid the undesirably arbitrary exclusion of 0
and 1 data points! Of course, at that point, nothing prevents you from
interpreting the values between [0,1] as a probability estimate of that the
stimulus is grammatical, in which case probit and logistic models can also
be used (but I think that would be overkill and lead to a very specific
interpretation that you might not intend).

HTH,
Florian

>
> Kyle Gorman
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.ucsd.edu/mailman/private/ling-r-lang-l/attachments/20101015/96287a43/attachment-0001.html