[R-lang] Re: ling-r-lang-L Digest, Vol 1, Issue 54
Levy, Roger
rlevy@ucsd.edu
Mon Oct 18 10:25:55 PDT 2010
Hi Marie,
On Oct 17, 2010, at 11:21 AM, Marie Coppola wrote:
> Dear Roger and Florian (and others),
>
> Thanks very much for your attention to my query. I have two follow-up questions:
>
> 1) I'm pursuing your suggestion of applying Fisher's Exact Test/Chi-Square analysis to this comparison, and want to ensure that I'm not violating the assumption of independence by ignoring subject identity (each subject contributes multiple data points). I understand Roger's demonstration that there was no meaningful cross-subject variation (pasted below). Is demonstrating a lack of variation sufficient in general to justify violating the assumption of independence of the Fisher and Chi-square tests? Are there conventions for determining an "acceptable" level of cross-subject variation that allows one to assume independence for such tests? (Note: the n is too large for the Fisher test so I would be using Chi-Square in this case. I'm pasting the relevant portions of the previous responses below:)
Brief question: what do you mean when you say that N is too large for Fisher's exact test? Fisher's Exact Test can be used for datasets of any size. (I guess for large enough datasets computation of the exact p-value might be too time-consuming, but this is a practical, not a theoretical, limitation.)
Regarding "violation" of the assumption of independence: the point that there's no appreciable cross-subject variation in your dataset means that there's no evidence that ignoring subject identity involves any violations of independence! In principle, there are countless other factors that you've probably ignored on which your Handshape observations could in principle be dependent: perhaps, for example, some of your observations were collected in the first six minutes of a given hour of the day, others in the second six minutes, and so forth. You very legitimately ignored most of these possible factors on the basis that there's no reason to assume dependence between any of them and Handshape. (And we all ignore countless factors like these all the time!)
The only reasons to think about subject identity any differently from these myriad other factors are (a) that we empirically have found that there is often considerable inter-subject variation regarding different individuals' linguistic behavior in respects that we find ourselves interested in measuring; and (b) our knowledge of the causal structure of the world induces us to believe that inter-subject variation in these situations is probably meaningful, rather than a coincidence. In these situations, ignoring inter-subject variation in our statistical analysis would lead to exactly the violations of independence that you allude to.
However, none of this *guarantees* that there is inter-subject variation in any particular study; and if there isn't, then ignoring subject variation doesn't lead to a violation of independence assumptions. Your data suggests that your study falls into this category, just as ignoring the minute of the hour during which an observation was collected probably doesn't lead to a violation of independence assumptions for your data either.
> 2) Also, just to make sure I'm following, in this scenario, wouldn't the resulting data table be 2x2 (Handshape Type (Object vs. Handling) x Construction Type(Agent+Classifier vs. the other 3 types)?
> >
> > > , , Construction = Classifier
> > >
> > > Agent
> > > Handshape agent no_agent
> > > handling 36 1 (added)
> > > object 27 65
> > >
> > > , , Construction = Lexical_item
> > >
> > > Agent
> > > Handshape agent no_agent
> > > handling 1 1 (added)
> > > object 45 54
My thought was that you'd lump all the combinations of Agent & Construction together except for the agent+classifier combination, which you'd distinguish. This would be for the goal of demonstrating that the agent+classifier condition behaves differently from the rest:
> x <- with(dat, xtabs(~Handshape + AC))
> x
AC
Handshape FALSE TRUE
handling 1 36
object 164 27
> fisher.test(x)
Fisher's Exact Test for Count Data
data: x
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.0001191536 0.0304077148
sample estimates:
odds ratio
0.00473102
But of course the precise way you analyze your data as contingency tables would depend on the points you want to make.
Best
Roger
--
Roger Levy Email: rlevy@ling.ucsd.edu
Assistant Professor Phone: 858-534-7219
Department of Linguistics Fax: 858-534-4789
UC San Diego Web: http://ling.ucsd.edu/~rlevy
More information about the ling-r-lang-L
mailing list