[R-lang] Re: Concerning glm with contrasts

Tue Aug 3 11:03:46 PDT 2010

Dear Zoe,

On Jul 28, 2010, at 4:15 PM, Zoe Luk wrote:

> Dear Dr. Levy,
>
> Thank you again for your response.
>
> I have two more questions:
> 1) I have tried multinom(), and I got the following results
>
> > multinom(constructions~language, weights=freq, data=comps2.table)
> # weights:  20 (12 variable)
> initial  value 2452.783379
> iter  10 value 1859.186364
> iter  20 value 1784.597860
> final  value 1784.587274
> converged
> Call:
> multinom(formula = constructions ~ language, data = comps2.table,
>     weights = freq)
>
> Coefficients:
>              (Intercept) languageEnglish languageJapanese
> Intransitive  -1.5102351        1.392500       1.12173442
> Others        -3.9951945        7.196337       2.55254006
> Passive       -2.2605019        5.222556       0.01766591
> Transitive     0.2722331        1.556114      -2.45252235
>
> Residual Deviance: 3569.175
> AIC: 3593.175
>
> So what do these numbers represent? What can I conclude from these results (since there is no p value)? I have looked it up in several books, but I could not find an answer (even Venables and Ripley do not explain what these numbers are).

Hmm, I don't quite replicate the results you've gotten here, perhaps because of the way that you've been setting contrasts. Here is what I've done using standard contrasts.

> freqs.df
       lang       constr freq
1  Japanese   Transitive  164
2  Japanese      Passive    9
3  Japanese Intransitive  291
4  Japanese   Adjectival   36
5  Japanese       Others    8
6   Chinese   Transitive  198
7   Chinese      Passive    3
8   Chinese Intransitive  221
9   Chinese   Adjectival   69
10  Chinese       Others   17
11  English   Transitive  174
12  English      Passive   31
13  English Intransitive  214
14  English   Adjectival   57
15  English       Others   32

> contrasts(freqs.df$lang)
         English Japanese
Chinese        0        0
English        1        0
Japanese       0        1
> contrasts(freqs.df$constr)
             Intransitive Others Passive Transitive
Adjectival              0      0       0          0
Intransitive            1      0       0          0
Others                  0      1       0          0
Passive                 0      0       1          0
Transitive              0      0       0          1
> library(nnet)
> multinom(constr ~ lang, weights=freq, data=freqs.df)
# weights:  20 (12 variable)
initial  value 2452.783379
iter  10 value 1819.676942
iter  20 value 1765.034693
final  value 1765.031122
converged
Call:
multinom(formula = constr ~ lang, data = freqs.df, weights = freq)

Coefficients:
             (Intercept) langEnglish langJapanese
Intransitive    1.164056  0.15887036    0.9257406
Others         -1.400869  0.82357495   -0.1036836
Passive        -3.135871  2.52677178    1.7494599
Transitive      1.054110  0.06180441    0.4621884

Residual Deviance: 3530.062
AIC: 3554.062

Here, the first column of coefficients represents the behavior for Chinese: for example, the upper left coefficient is the relative preference for Intransitive over Adjectival in Chinese.  The second column is the difference in behavior between English and Chinese for each construction type, and so forth.

I suggest you take a look at section 6.1 of Agresti 2007 (Introduction to Categorical Data Analysis).

To infer p-values, note that you can get standard errors of the estimates by calling summary():

> summary(multinom(constr ~ lang, weights=freq, data=freqs.df))
# weights:  20 (12 variable)
initial  value 2452.783379
iter  10 value 1819.676942
iter  20 value 1765.034693
final  value 1765.031122
converged
Call:
multinom(formula = constr ~ lang, data = freqs.df, weights = freq)

Coefficients:
             (Intercept) langEnglish langJapanese
Intransitive    1.164056  0.15887036    0.9257406
Others         -1.400869  0.82357495   -0.1036836
Passive        -3.135871  2.52677178    1.7494599
Transitive      1.054110  0.06180441    0.4621884

Std. Errors:
             (Intercept) langEnglish langJapanese
Intransitive   0.1379030   0.2030598    0.2241217
Others         0.2707643   0.3494352    0.4755480
Passive        0.5898680   0.6306709    0.6977406
Transitive     0.1397966   0.2069635    0.2311227

Residual Deviance: 3530.062
AIC: 3554.062

Read section 1.4 of Agresti to understand how to get a p-value from the coefficient plus the standard error (e.g., using the Wald test, which is where the Z-score you see from glm and lmer's binomial models comes from).

>
> 2) Regarding your response to my second question, I actually did the treatment contrasts before I did the sum contrasts. And the results I got is as follows:
>
> > glm.out2<-glm(freq~language*constructions, family=poisson, data=comps2.data)
> > summary(glm.out2)
>
> Call:
> glm(formula = freq ~ language * constructions, family = poisson,
>     data = comps2.data)
>
> Deviance Residuals:
>  [1]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
>
> Coefficients:
>                                            Estimate Std. Error z value Pr(>|z|)
> (Intercept)                                 4.23411    0.12039  35.171  < 2e-16 ***
> languageEnglish                            -0.20875    0.17986  -1.161  0.24579
> languageJapanese                           -0.65059    0.20560  -3.164  0.00155 **
> constructionsIntransitive                   1.16406    0.13790   8.441  < 2e-16 ***
> constructionsOthers                        -1.40089    0.27077  -5.174 2.29e-07 ***
> constructionsPassive                       -3.13549    0.58977  -5.316 1.06e-07 ***
> constructionsTransitive                     1.05416    0.13980   7.541 4.68e-14 ***
> languageEnglish:constructionsIntransitive   0.17657    0.20383   0.866  0.38636
> languageJapanese:constructionsIntransitive  0.92918    0.22410   4.146 3.38e-05 ***
> languageEnglish:constructionsOthers         0.87205    0.34853   2.502  0.01235 *
> languageJapanese:constructionsOthers       -0.10318    0.47549  -0.217  0.82820
> languageEnglish:constructionsPassive        2.54413    0.63083   4.033 5.51e-05 ***
> languageJapanese:constructionsPassive       1.74920    0.69765   2.507  0.01217 *
> languageEnglish:constructionsTransitive     0.07954    0.20772   0.383  0.70177
> languageJapanese:constructionsTransitive    0.45607    0.23121   1.973  0.04854 *
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> (Dispersion parameter for poisson family taken to be 1)
>
>     Null deviance:  1.3756e+03  on 14  degrees of freedom
> Residual deviance: -1.6653e-14  on  0  degrees of freedom
> AIC: 116.66
>
> Number of Fisher Scoring iterations: 3
>
> > anova(glm.out2, test="Chisq")
> Analysis of Deviance Table
>
> Model: poisson, link: log
>
> Response: freq
>
> Terms added sequentially (first to last)
>
>
>                        Df Deviance Resid. Df Resid. Dev P(>|Chi|)
> NULL                                      14    1375.55
> language                2     0.00        12    1375.55         1
> constructions           4  1299.50         8      76.05 < 2.2e-16 ***
> language:constructions  8    76.05         0       0.00 3.032e-13 ***
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> However, I still have difficulty interpreting these results, and I do not think this provides me with what I want to compare. Could you tell me how you would interpret these results?

Chapter 7 of Agresti will explain how to interpret the coefficients in your model.  Once again, I think that you need to pinpoint what specific questions you want to ask with respect to your data.

Best

Roger

>
> Thank you so much again for your response and time.
>
> Regards,
> Zoe
>
> On Thu, Jul 22, 2010 at 9:53 PM, Levy, Roger <rlevy@ucsd.edu> wrote:
> On Jul 21, 2010, at 9:58 AM, Zoe Luk wrote:
>
> > Dear Dr. Levy,
> >
> > Thank you very much for your detailed response. Your information is very helpful.
>
> No problem, Zoe -- my follow-ups are below.
>
> >
> > Here are my answers to your questions, and I have some further questions below.
> > 1) The totals are all 508 for all languages because I want to make sure that I am comparing the frequencies of constructions that are conveying the same meanings. The counts are based on a novel which is originally written in Japanese, and translated into Chinese and English. When I analyzed the texts, I only included those that have equivalents in all three languages. Therefore, they always have the same totals. I guess then the totals are not something that I'm interested in. I'm interested in how the languages differ in terms of the frequencies of different constructions.
>
> Yes, this makes a lot of sense and you clearly are not interested in the marginal counts in this case (though I guess that in principle you could have had a slightly different number of clauses in each language depending on the translation).
>
> > 2) The reason I used Poisson is because I thought it is appropriate for count data. Also I have a friend who is a statistician, and this is her suggestion.
>
> You're right -- it *is* appropriate for count data, and it's perfect, but my personal experience is that the interpretation of the Poisson formulation often comes a little less naturally to people than the interpretation of the (multinomial) logistic regression formulation.  Technically, the way you're using the Poisson model is what's called "surrogate Poisson" -- you're using Poisson regression to model what are actually multinomial frequencies.  Venables & Ripley section 7.3 has some discussion of this.
>
> > 3) Regarding Q3 in my original post, my hypothesis is that English prefers passive more than Japanese does, and Japanese prefers Intransitive more than English, but Chinese can be either in both cases.
>
> OK (though I'm not sure I understood the last part regarding Chinese) -- but does "prefer passive" simply mean that a larger proportion of the clauses are passive in English than in Japanese?  Likewise for intransitive.
>
> > Here are my further questions:
> > 1) You mentioned "multinominal logistic regression". I do know a little about logistic regression, but not at all about this specific kind of logistic regression. Which command should I use? glm(), mlogit(), lmer(), or something else? And what kind of distribution (e.g., binomial) should I use?
>
> There are a few options, including the "multinom" function in the nnet package, and the "multinomial" function in the VGAM package.  If you wind up interested in mixed-effects models, some people have ben using the MCMCglmm package.
>
> > 2) Regarding the significant difference shown in "language1" in the model, you said that "the language1 variable more or less models the relative frequency of English observations to Japanese observations". I am not sure I am following. So what can I claim (or what do I know) based on this result?
>
> Given the way you have collected your data and set up your model, you would not be interested in the values of the intercept or the language parameters -- it's the construction and construction/language interaction parameters that matter.  But the way you've coded your contrasts makes the interpretation difficult; for the hypotheses you're interested in, I suggest you think about using "treatment" coding rather than the sum contrast coding that your original post had.
>
> > 3) As mentioned above, I do not have a specific hypothesis for Chinese. If I want to include it in the analysis anyway (e.g., comparing passive constructions in all three languages), what methods would you advise me to use?
>
> Well, I think it would depend on what underlying assumptions you have (e.g., do you implicitly think that Chinese is likely to be similar to English and Japanese overall, and so somehow including Chinese data would help you interpret the English & Japanese counts more reliably?).  If you have no hypotheses regarding Chinese at the moment, you might just consider not including those data in the analysis at this point.
>
> Best
>
> Roger
>
> >
> > Thank you in advance, and I look forward to hearing from you.
> >
> > Regards,
> > Zoe
> >
> > On Tue, Jul 20, 2010 at 4:43 PM, Levy, Roger <rlevy@ucsd.edu> wrote:
> > Dear Zoe,
> >
> > On Jul 19, 2010, at 10:28 AM, Zoe Luk wrote:
> >
> > > Dear R-lang users,
> > >
> > > I am new to the mailing list, and also rather new to R. I have a few questions concerning the results of glm().
> > >
> > > I am doing a study comparing the frequencies of different linguistic constructions used in a specific text that is in three languages (Japanese, Chinese, and English). The results I got are the following.
> > >
> > >
> > > Transitive    Passive Intransitive    Adjectival      Others  Total
> > > Japanese      164     9       291     36      8       508
> > > Chinese       198     3       221     69      17      508
> > > English       174     31      214     57      32      508
> > >
> > > 536   43      726     162     57      1524
> > >
> > > Chi-square test has a significant result. I intended to do further analysis to see if there is any difference among the languages, so i did the following:
> > >
> > > > glm.out4<-glm(freq~language*constructions, data=comps2.data, family=poisson, contrasts=list(language=contrastml, constructions=contrastmc))
> >
> > The first question I'd like to ask is why you're using a Poisson model to analyze your data.  I see that the marginal totals for each language are the same at 508.  Were these marginal totals under your control (e.g., did you count in each text until you got 508), or are these totals something you want your model to account for?  A Poisson model devotes parameters to accounting for the marginal totals. If instead you're thinking that the language is an "independent variable" and the construction type is a "dependent variable", then analyzing the data with multinomial logistic regression might be more appropriate.  (Now, there *are* legitimate uses of Poisson models as surrogates for multinomial logistic regression, but using them in surrogates in this way affects how you interpret the model parameters -- see below.)
> >
> > > > summary(glm.out4)
> > >
> > > Call:
> > > glm(formula = freq ~ language * constructions, family = poisson,
> > >     data = comps2.data, contrasts = list(language = contrastml,
> > >         constructions = contrastmc))
> > >
> > > Deviance Residuals:
> > >  [1]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
> > >
> > > Coefficients:
> > >                          Estimate Std. Error z value Pr(>|z|)
> > > (Intercept)               3.93111    0.05890  66.746  < 2e-16 ***
> > > language1                 0.20443    0.08435   2.424 0.015363 *
> > > language2                 0.16064    0.09497   1.692 0.090740 .
> > > constructions1           -1.25129    0.06779 -18.459  < 2e-16 ***
> > > constructions2            1.68783    0.18775   8.990  < 2e-16 ***
> > > constructions3           -0.01655    0.08647  -0.191 0.848205
> > > constructions4            1.12805    0.13321   8.468  < 2e-16 ***
> > > language1:constructions1  0.12190    0.09726   1.253 0.210090
> > > language2:constructions1  0.26651    0.10562   2.523 0.011625 *
> > > language1:constructions2  0.15838    0.24722   0.641 0.521744
> > > language2:constructions2 -0.98403    0.32782  -3.002 0.002684 **
> > > language1:constructions3 -0.15971    0.12915  -1.237 0.216218
> > > language2:constructions3  0.44708    0.12620   3.543 0.000396 ***
> > > language1:constructions4 -0.51918    0.21538  -2.411 0.015931 *
> > > language2:constructions4  0.19079    0.18724   1.019 0.308207
> > > ---
> > > Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> > >
> > > (Dispersion parameter for poisson family taken to be 1)
> > >
> > >     Null deviance:  1.3744e+03  on 14  degrees of freedom
> > > Residual deviance: -3.1086e-15  on  0  degrees of freedom
> > > AIC: 116.66
> > >
> > > Number of Fisher Scoring iterations: 3
> > >
> > > > contrasts(language)
> > >          [,1] [,2]
> > > Chinese     0   -1
> > > English     1    1
> > > Japanese   -1    0
> > > > contrasts(constructions)
> > >              [,1] [,2] [,3] [,4]
> > > Adjectival      0    0   -1    0
> > > Intransitive    1    1    1    1
> > > Others          0    0    0   -1
> > > Passive         0   -1    0    0
> > > Transitive     -1    0    0    0
> > >
> > > So my questions are:
> > > (1) I am not sure how to interpret these results. Since language1 shows a significance difference, does it mean that "English and Japanese are significantly different in terms of the distribution of the different constructions used"?
> >
> > No -- since you're using Poisson regression, the language1 variable more or less models the relative frequency of English observations to Japanese observations.
> >
> > If you were to double the number of observations in each Japanese cell, the biggest change in your model would be that the language1 parameter would decrease.  (The intercept and the language2 parameter would also adjust by smaller amounts, in compensation.)  The constructions and language:constructions parameters would stay the same
> >
> > > (2) Does the "intercept" represent anything at all? If yes, what does it represent in this case?
> >
> > It probably not anything you're interested in.  Because you're using true contrasts (i.e. each column in your contrast matrices sums to zero), the intercept is more or less modeling the total number of observations in your dataset (keep in mind that Poisson regression is trying to model cell counts, not proportions).
> >
> > If you were to double the counts of all cells in your dataset, the intercept would increase by a constant factor -- log(2) -- and the rest of the model would stay the same.
> >
> > > (3) If I want to test whether English uses passive significantly more than Japanese, and Japanese uses intransitive significantly more than English, how should I modify the contrasts/commands?
> >
> > Let's call the passive question 3a, and the intransitive question 3b.  Answering these question depends on the answers to a couple of other questions:
> >
> > * How, if at all, are the Chinese data relevant to either 3a or 3b?
> >
> > * How, if at all, are the distinctions among adjectival, intransitive, transitive, and "other" relevant to 3a?
> >
> > * How, if at all, are the distinctions among adjectival, passive, transitive, and "other" relevant to 3b?
> >
> > If the answer to all three questions is "irrelevant", you might just consider doing very simple chi-squared or Fisher's exact tests on 2x2 representations of the Japanese and English as (a) passive and non-passive counts, and (b) intransitive and non-intransitive counts.
> >
> > Also, I'd recommend Maureen Gillespie's coding tutorial as background reading:
> >
> > http://go2.wordpress.com/?id=725X1342&site=hlplab.wordpress.com&url=http%3A%2F%2Fhlplab.files.wordpress.com%2F2010%2F05%2Fcodingtutorial.pdf&sref=http%3A%2F%2Fhlplab.wordpress.com%2F2010%2F05%2F10%2Fmini-womm-montreal-slides-now-available%2F
> >
> > Best
> >
> > Roger
> >
> > --
> >
> > Roger Levy                      Email: rlevy@ling.ucsd.edu
> > Assistant Professor             Phone: 858-534-7219
> > Department of Linguistics       Fax:   858-534-4789
> > UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> > Zoe Luk
> > Department of Linguistics
> > University of Pittsburgh
> >
> >
>
> --
>
> Roger Levy                      Email: rlevy@ling.ucsd.edu
> Assistant Professor             Phone: 858-534-7219
> Department of Linguistics       Fax:   858-534-4789
> UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy
>
>
>
>
>
>
>
>
>
>
> --
> Zoe Luk
> Department of Linguistics
> University of Pittsburgh
>
>

--

Roger Levy                      Email: rlevy@ling.ucsd.edu
Assistant Professor             Phone: 858-534-7219
Department of Linguistics       Fax:   858-534-4789
UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy