[R-lang] Re: ling-r-lang-L Digest, Vol 1, Issue 54

Mon Oct 18 10:25:55 PDT 2010

Hi Marie,

On Oct 17, 2010, at 11:21 AM, Marie Coppola wrote:

> Dear Roger and Florian (and others),
> 
> Thanks very much for your attention to my query. I have two follow-up questions:
> 
> 1) I'm pursuing your suggestion of applying Fisher's Exact Test/Chi-Square analysis to this comparison, and want to ensure that I'm not violating the assumption of independence by ignoring subject identity (each subject contributes multiple data points). I understand Roger's demonstration that there was no meaningful cross-subject variation (pasted below). Is demonstrating a lack of variation sufficient in general to justify violating the assumption of independence of the Fisher and Chi-square tests? Are there conventions for determining an "acceptable" level of cross-subject variation that allows one to assume independence for such tests? (Note: the n is too large for the Fisher test so I would be using Chi-Square in this case. I'm pasting the relevant portions of the previous responses below:)

Brief question: what do you mean when you say that N is too large for Fisher's exact test?  Fisher's Exact Test can be used for datasets of any size.  (I guess for large enough datasets computation of the exact p-value might be too time-consuming, but this is a practical, not a theoretical, limitation.)

Regarding "violation" of the assumption of independence: the point that there's no appreciable cross-subject variation in your dataset means that there's no evidence that ignoring subject identity involves any violations of independence!  In principle, there are countless other factors that you've probably ignored on which your Handshape observations could in principle be dependent: perhaps, for example, some of your observations were collected in the first six minutes of a given hour of the day, others in the second six minutes, and so forth.  You very legitimately ignored most of these possible factors on the basis that there's no reason to assume dependence between any of them and Handshape.  (And we all ignore countless factors like these all the time!)

The only reasons to think about subject identity any differently from these myriad other factors are (a) that we empirically have found that there is often considerable inter-subject variation regarding different individuals' linguistic behavior in respects that we find ourselves interested in measuring; and (b) our knowledge of the causal structure of the world induces us to believe that inter-subject variation in these situations is probably meaningful, rather than a coincidence.  In these situations, ignoring inter-subject variation in our statistical analysis would lead to exactly the violations of independence that you allude to.

However, none of this *guarantees* that there is inter-subject variation in any particular study; and if there isn't, then ignoring subject variation doesn't lead to a violation of independence assumptions.  Your data suggests that your study falls into this category, just as ignoring the minute of the hour during which an observation was collected probably doesn't lead to a violation of independence assumptions for your data either.

> 2) Also, just to make sure I'm following, in this scenario, wouldn't the resulting data table be 2x2 (Handshape Type (Object vs. Handling) x Construction Type(Agent+Classifier vs. the other 3 types)?
> >
> > > , , Construction = Classifier
> > >
> > >           Agent
> > > Handshape  agent no_agent
> > >   handling    36        1 (added)
> > >   object      27       65
> > >
> > > , , Construction = Lexical_item
> > >
> > >           Agent
> > > Handshape  agent no_agent
> > >   handling     1        1 (added)
> > >   object      45       54

My thought was that you'd lump all the combinations of Agent & Construction together except for the agent+classifier combination, which you'd distinguish.  This would be for the goal of demonstrating that the agent+classifier condition behaves differently from the rest:

> x <- with(dat, xtabs(~Handshape + AC))
> x
          AC
Handshape  FALSE TRUE
  handling     1   36
  object     164   27
> fisher.test(x)

	Fisher's Exact Test for Count Data

data:  x 
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1 
95 percent confidence interval:
 0.0001191536 0.0304077148 
sample estimates:
odds ratio 
0.00473102 

But of course the precise way you analyze your data as contingency tables would depend on the points you want to make.

Best

Roger

--

Roger Levy                      Email: rlevy@ling.ucsd.edu
Assistant Professor             Phone: 858-534-7219
Department of Linguistics       Fax:   858-534-4789
UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy