From reitter at cmu.edu Wed Apr 8 05:23:00 2009 From: reitter at cmu.edu (David Reitter) Date: Wed, 8 Apr 2009 08:23:00 -0400 Subject: [R-lang] GIGP - random sample? CDF? Message-ID: <14E85287-446B-4633-959D-3A6CF58969D1@cmu.edu> I have collocation data that fits the Generalized Inverse Gaussian- Poisson distribution quite well (via the ZipfR package). Now I'd like to randomly sample from such a distribution. Does anyone know how to do that? For starters, I could probably do with an integral-free cumulative distribution function for a GIGP, as that would get me 3/4 of the way there. (I can think of some iterative/numerical way, but that wouldn't be very elegant - I might as well sample from the corpus data in that case.) Thanks - David -- Dr. David Reitter Department of Psychology Carnegie Mellon University http://www.david-reitter.com From M.N.Carminati at dundee.ac.uk Tue Apr 7 02:28:42 2009 From: M.N.Carminati at dundee.ac.uk (Maria Carminati) Date: Tue, 07 Apr 2009 10:28:42 +0100 Subject: [R-lang] mixed logit models, coding the effects and understanding the parameters References: <49DB22D5020000B700035F4B@gw-out.dundee.ac.uk> <49DB2A66020000B700035F55@gw-out.dundee.ac.uk> <49DB2ADA020000B700035F5C@gw-out.dundee.ac.uk> Message-ID: <49DB2AD9.AADC.00B7.0@dundee.ac.uk> Please, can someone help me make sense of the parameters (intercepts and coefficients) of my mixed logit model The DV of my expt is a target response (poresp) which can be either PO (coded as 1) or DO (coded as 0). So PO is a success and DO is a failure. The predictors are prime and nounrep (nounrepetition), each with 2 levels. I have effect coded the predictors (-.5 and +.5 respectively for each of the two levels of the predictors, following Barr, 2008): with this coding, which I believe is a way of centering, the intercept should correspond to the grand mean and the coefficients to the differences between the means etc., like in an ANOVA. Please correct me if I am wrong. I tried this coding in the past where the outcome was a continuous variable and indeed when I checked the means and differences in my data , I found a correspondence between the ANOVA measures and the regression intercept and coefficients; however, now I am dealing with binomially distributed data, where the model parameters are in log odds space, so it is a bit more complicated, because I have to do transformations). The problem is that when I check the output of the mixed logit regression (using lmer and family = binomial ) and try to make sense of the coefficients by checking, for example, whether the intercept corresponds to the grand mean, in log odds, of course, things do not match. I checked and re-checked, and I do not know where I am going wrong. ========== > head(verbdiff) X subject cond item myscore poresp nounrep prime primec nounrepc 1 1069 1 2 19 p 1 0 1 0.5 -0.5 2 1070 1 3 6 p 1 1 0 -0.5 0.5 3 1071 1 3 5 p 1 1 0 -0.5 0.5 4 1072 1 2 12 p 1 0 1 0.5 -0.5 5 1073 1 4 16 p 1 1 1 0.5 0.5 6 1074 1 3 14 p 1 1 0 -0.5 0.5 #primec and nounrepc are effect coded COUNTS OF 0s and 1s: > xtabs (~ poresp +prime +nounrep, data=verbdiff) nounrep = 0 prime poresp 0 1 0 89 58 1 207 235 , , nounrep = 1 prime poresp 0 1 0 91 64 1 202 228 MODEL: > verbdiff.lmer = lmer(poresp ~ primec *nounrepc + (1|subject) + (1|item), data= verbdiff, family="binomial") > print (verbdiff.lmer) Generalized linear mixed model fit using Laplace Formula: poresp ~ primec * nounrepc + (1 | subject) + (1 | item) Data: verbdiff Family: binomial(logit link) AIC BIC logLik deviance 1092 1122 -540 1080 Random effects: Groups Name Variance Std.Dev. item (Intercept) 1.2143 1.1020 subject (Intercept) 1.8470 1.3590 number of obs: 1174, groups: item, 40; subject, 32 Estimated scale (compare to 1 ) 0.917418 Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.6605 0.3121 5.320 1.04e-07 *** primec 0.7854 0.1645 4.773 1.81e-06 *** nounrepc -0.1054 0.1612 -0.653 0.513 primec:nounrepc -0.2138 0.3224 -0.663 0.507 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Correlation of Fixed Effects: (Intr) primec nonrpc primec 0.053 nounrepc -0.009 -0.023 primc:nnrpc -0.012 -0.028 0.128 ============= QUESTIONS: 1. THE ESTIMATED INTERCEPT IN THE LOGIT MODEL IS 1.66 IN LOG ODDS, WHICH should correspond to grand mean, i.e. likelihood of successes (=PO responses coded as "1") independent of the predictors. THERE WERE OVERALL 872 SUCCESSES AND 302 FAILURES IN THE EXPT, SO ODDS SHOULD BE 872/302=2.88 or (in probability space) .74/.26 = 2.85; THIS SHOULD GIVE A LOG OF ODDS OF APPROX 1.05, BUT THE INTERCEPT PREDICTED BY THE MODEL IS MUCH HIGHER (1.66) SAME THING WHEN I USE THE CORRESPONDING DUMMY CODED (0-1) V ARIABLES -INTERCEPT IS MUCH HIGHER THAN WHAT MY DATA SUGGEST 2. I ALSO TRIED TO APPLY THE Somers Dxy test, BUT I GET AN ERROR MESSAGE THAT I DO NOT UNDERSTAND; ISN'T Y BINARY (1/0) IN THE DATA FILE? > probs = binomial()$linkinv(fitted(verbdiff.lmer)) > somers2(probs,as.numeric(verbdiff$poresp)-1) Error in somers2(probs, as.numeric(verbdiff$poresp) - 1) : y must be binary DATA FRAME verbdiff ATTACHED MANY THANKS Dr. Maria Nella Carminati Department of Psychology University of Dundee Dundee DD1 4HN Tel: +44 1382 388258 Fax: +44 1382 229993 Email: m.n.carminati at dundee.ac.uk mnc at interfree.it The University of Dundee is a registered Scottish charity, No: SC015096 -------------- next part -------------- A non-text attachment was scrubbed... Name: for langR.RData Type: application/octet-stream Size: 83853 bytes Desc: not available URL: From reitter at cmu.edu Wed Apr 8 15:54:28 2009 From: reitter at cmu.edu (David Reitter) Date: Wed, 8 Apr 2009 18:54:28 -0400 Subject: [R-lang] mixed logit models, coding the effects and understanding the parameters In-Reply-To: <49DB2AD9.AADC.00B7.0@dundee.ac.uk> References: <49DB22D5020000B700035F4B@gw-out.dundee.ac.uk> <49DB2A66020000B700035F55@gw-out.dundee.ac.uk> <49DB2ADA020000B700035F5C@gw-out.dundee.ac.uk> <49DB2AD9.AADC.00B7.0@dundee.ac.uk> Message-ID: <68565E3D-F1A2-43B7-BEF4-2FB50EE07F9C@cmu.edu> Hi Maria, good to hear from you. Just briefly for lack of time: On Apr 7, 2009, at 5:28 AM, Maria Carminati wrote: > Generalized linear mixed model fit using Laplace > Formula: poresp ~ primec * nounrepc + (1 | subject) + (1 | item) > Data: verbdiff > THERE WERE OVERALL 872 SUCCESSES AND 302 FAILURES IN THE EXPT, SO ODDS > SHOULD BE 872/302=2.88 or (in probability space) .74/.26 = 2.85; > THIS SHOULD GIVE A LOG OF ODDS OF APPROX 1.05, BUT THE INTERCEPT > PREDICTED BY THE MODEL IS MUCH HIGHER (1.66) You have a random intercept for subjects (and one for items) fitted there... I would fit a fixed effects model and check that first. I'm not sure if, given the groups defined for your random terms, all data points are weighted equally (as they are in your max likelihood probability above). (Finally, by coding your binary factors as -0.5,0.5, you don't necessarily center the means at 0 - unless your design is balanced, what I almost suspect. If their means aren't 0, you wouldn't expect the fitted intercept to work out the way you're suggesting.) Also, what happens if you take the non-significant terms out? > primec:nounrepc -0.2138 0.3224 -0.663 0.507 Pity this one didn't work. Where these low-frequency nouns? Unless your design controlled their frequency, you could try adding terms for the noun log-frequency (from a corpus)... Best - David -- Dr. David Reitter Department of Psychology Carnegie Mellon University http://www.david-reitter.com -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2438 bytes Desc: not available URL: From H.Quene at uu.nl Wed Apr 8 15:47:07 2009 From: H.Quene at uu.nl (=?UTF-8?B?SHVnbyBRdWVuwo7DqQ==?=) Date: Thu, 09 Apr 2009 00:47:07 +0200 Subject: [R-lang] mixed logit models In-Reply-To: References: Message-ID: <49DD296B.9040205@uu.nl> Dear Maria, On 08-04-2009 23:06, r-lang-request at ling.ucsd.edu wrote: > The DV of my expt is a target response (poresp) which can be either > PO (coded as 1) or DO (coded as 0). So PO is a success and DO is a > failure. The predictors are prime and nounrep (nounrepetition), each > with 2 levels. I have effect coded the predictors (-.5 and +.5 > respectively for each of the two levels of the predictors, following > Barr, 2008): with this coding, which I believe is a way of centering, > the intercept should correspond to the grand mean and the coefficients > to the differences between the means etc., like in an ANOVA. Please > correct me if I am wrong. I tried this coding in the past where the > outcome was a continuous variable and indeed when I checked the means > and differences in my data , I found a correspondence between the ANOVA > measures and the regression intercept and coefficients; however, now I > am dealing with binomially distributed data, where the model parameters > are in log odds space, so it is a bit more complicated, because I have > to do transformations). > > The problem is that when I check the output of the mixed logit > regression (using lmer and family = binomial ) and try to make sense of > the coefficients by checking, for example, whether the intercept > corresponds to the grand mean, in log odds, of course, things do not > match. I checked and re-checked, and I do not know where I am going > wrong. > > ========== > >> head(verbdiff) > X subject cond item myscore poresp nounrep prime primec nounrepc > 1 1069 1 2 19 p 1 0 1 0.5 -0.5 > 2 1070 1 3 6 p 1 1 0 -0.5 0.5 > 3 1071 1 3 5 p 1 1 0 -0.5 0.5 > 4 1072 1 2 12 p 1 0 1 0.5 -0.5 > 5 1073 1 4 16 p 1 1 1 0.5 0.5 > 6 1074 1 3 14 p 1 1 0 -0.5 0.5 > > #primec and nounrepc are effect coded > > > > COUNTS OF 0s and 1s: > >> xtabs (~ poresp +prime +nounrep, data=verbdiff) > > nounrep = 0 > > prime > poresp 0 1 > 0 89 58 > 1 207 235 > > , , nounrep = 1 > > prime > poresp 0 1 > 0 91 64 > 1 202 228 > > > MODEL: > >> verbdiff.lmer = lmer(poresp ~ primec *nounrepc + (1|subject) + > (1|item), data= verbdiff, family="binomial") > >> print (verbdiff.lmer) > > > Generalized linear mixed model fit using Laplace > Formula: poresp ~ primec * nounrepc + (1 | subject) + (1 | item) > Data: verbdiff > Family: binomial(logit link) > AIC BIC logLik deviance > 1092 1122 -540 1080 > Random effects: > Groups Name Variance Std.Dev. > item (Intercept) 1.2143 1.1020 > subject (Intercept) 1.8470 1.3590 > number of obs: 1174, groups: item, 40; subject, 32 > > Estimated scale (compare to 1 ) 0.917418 > > Fixed effects: > Estimate Std. Error z value Pr(>|z|) > (Intercept) 1.6605 0.3121 5.320 1.04e-07 *** > primec 0.7854 0.1645 4.773 1.81e-06 *** > nounrepc -0.1054 0.1612 -0.653 0.513 > primec:nounrepc -0.2138 0.3224 -0.663 0.507 > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? > 0.1 ? ? 1 > > Correlation of Fixed Effects: > (Intr) primec nonrpc > primec 0.053 > nounrepc -0.009 -0.023 > primc:nnrpc -0.012 -0.028 0.128 > > ============= > > QUESTIONS: > > 1. THE ESTIMATED INTERCEPT IN THE LOGIT MODEL IS 1.66 IN LOG ODDS, > WHICH should correspond to grand mean, i.e. likelihood of successes (=PO > responses coded as "1") independent of the predictors. > > THERE WERE OVERALL 872 SUCCESSES AND 302 FAILURES IN THE EXPT, SO ODDS > SHOULD BE 872/302=2.88 or (in probability space) .74/.26 = 2.85; > THIS SHOULD GIVE A LOG OF ODDS OF APPROX 1.05, BUT THE INTERCEPT > PREDICTED BY THE MODEL IS MUCH HIGHER (1.66) Indeed: > logit(872/(872+302)) [1] 1.060362 This may be because with your coding, you are estimating the intercept if primec==0 & nounrepc==0. This means that if prime and DV are related (e.g. because there are more hits if prime==1), or likewise if noun and DV are related, then the estimated intercept is no longer the grand mean, but rather the estimated value of your DV if all predictors would be zero. > > SAME THING WHEN I USE THE CORRESPONDING DUMMY CODED (0-1) V > ARIABLES > -INTERCEPT IS MUCH HIGHER THAN WHAT MY DATA SUGGEST > In my experience, models are far easier to interpret if you do NOT centralize the binary dummy predictors. The estimated intercept then equals the mean logit for the particular cell where all dummies are zero. So, my advice is to use your dummy coded variables and not the centered dummies. The main point I guess is that the intercept does not reflect the grand mean, in your model, but it reflects the estimated DV if all predictors would be zero, which is different from the grand mean IF predictors and DV are related. Did you try this model... >> verbdiff.m0.lmer = lmer(poresp ~ 1+(1|subject) + > (1|item), data= verbdiff, family="binomial") It should give you ONLY the intercept which should then be about 1.06. More background and details are available at http://www.hugoquene.nl/mixedeffects/ and references in the articles mentioned there. Hope this helps, with kind regards, Hugo Quene -- Dr Hugo Quen? | assoc prof Phonetics | Utrecht inst of Linguistics OTS | Utrecht University | Trans 10 | 3512 JK Utrecht | The Netherlands | T +31 30 253 6070 | F +31 30 253 6000 | H.Quene at uu.nl | www.hugoquene.nl | www.hum.uu.nl From tiflo at csli.stanford.edu Wed Apr 8 16:57:23 2009 From: tiflo at csli.stanford.edu (T. Florian Jaeger) Date: Wed, 8 Apr 2009 19:57:23 -0400 Subject: [R-lang] mixed logit models, coding the effects and understanding the parameters In-Reply-To: <68565E3D-F1A2-43B7-BEF4-2FB50EE07F9C@cmu.edu> References: <49DB22D5020000B700035F4B@gw-out.dundee.ac.uk> <49DB2A66020000B700035F55@gw-out.dundee.ac.uk> <49DB2ADA020000B700035F5C@gw-out.dundee.ac.uk> <49DB2AD9.AADC.00B7.0@dundee.ac.uk> <68565E3D-F1A2-43B7-BEF4-2FB50EE07F9C@cmu.edu> Message-ID: <38dc9be90904081657k392cc7bdhec45228b756277c2@mail.gmail.com> Hey Maria, I suspect you removed some outliers and how do not have a balanced data set anymore? As David was saying, if you do not have a balanced data set sum (aka) contrast coding does *not* center your categorical predictors. (You can center them yourself if you want to. That is probably the reason for the mismatch. If so, then the intercept-only model should give you the expected estimate *unless* the random effects are not actually summing up to zero. That actually does happen (and then they essentially contain part of what you would expect to be the intercept). It's a good idea to check the distribution of the random effects anyway. Unrelated to your problem, have you tried including random slopes for the two main effects? Seems like a good idea given your data. Finally, just out of curiosity, are you looking at whether repeated nouns between prime and target affect priming? You may find Neal Snider's work interesting in that case. He has looked at how overall prime-target similarity affects the strength of priming. (he found an effect, but his study is more general than noun identity; btw, I recall that he once told me that noun repetition alone did not reach significance). Florian On Wed, Apr 8, 2009 at 6:54 PM, David Reitter wrote: > Hi Maria, > > good to hear from you. Just briefly for lack of time: > > On Apr 7, 2009, at 5:28 AM, Maria Carminati wrote: > >> Generalized linear mixed model fit using Laplace >> Formula: poresp ~ primec * nounrepc + (1 | subject) + (1 | item) >> Data: verbdiff >> > > THERE WERE OVERALL 872 SUCCESSES AND 302 FAILURES IN THE EXPT, SO ODDS >> SHOULD BE 872/302=2.88 or (in probability space) .74/.26 = 2.85; >> THIS SHOULD GIVE A LOG OF ODDS OF APPROX 1.05, BUT THE INTERCEPT >> PREDICTED BY THE MODEL IS MUCH HIGHER (1.66) >> > > You have a random intercept for subjects (and one for items) fitted > there... > I would fit a fixed effects model and check that first. I'm not sure if, > given the groups defined for your random terms, all data points are weighted > equally (as they are in your max likelihood probability above). > (Finally, by coding your binary factors as -0.5,0.5, you don't necessarily > center the means at 0 - unless your design is balanced, what I almost > suspect. If their means aren't 0, you wouldn't expect the fitted intercept > to work out the way you're suggesting.) > > Also, what happens if you take the non-significant terms out? > > > primec:nounrepc -0.2138 0.3224 -0.663 0.507 > > Pity this one didn't work. Where these low-frequency nouns? Unless your > design controlled their frequency, you could try adding terms for the noun > log-frequency (from a corpus)... > > > Best > - David > > -- > Dr. David Reitter > Department of Psychology > Carnegie Mellon University > http://www.david-reitter.com > > > _______________________________________________ > R-lang mailing list > R-lang at ling.ucsd.edu > http://pidgin.ucsd.edu/mailman/listinfo/r-lang > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefan.evert at collocations.de Thu Apr 9 10:00:34 2009 From: stefan.evert at collocations.de (Stefan Evert) Date: Thu, 9 Apr 2009 19:00:34 +0200 Subject: [R-lang] GIGP - random sample? CDF? In-Reply-To: <14E85287-446B-4633-959D-3A6CF58969D1@cmu.edu> References: <14E85287-446B-4633-959D-3A6CF58969D1@cmu.edu> Message-ID: <756791BE-4350-42B2-ACDC-66CC63D0DCDC@collocations.de> Hi David! > I have collocation data that fits the Generalized Inverse Gaussian- > Poisson distribution quite well (via the ZipfR package). Now I'd > like to randomly sample from such a distribution. Does anyone know > how to do that? There's a reason why zipfR doesn't offer a random sample generator for GIGP models: my (straightforward) implementation of random sampling transforms uniform random numbers into LNRE-distributed types using the quantile function (i.e. the inverses of the cumulative distribution function) and the cumulative type distribution function. Since ... > For starters, I could probably do with an integral-free cumulative > distribution function for a GIGP, as that would get me 3/4 of the > way there. ... I'm not aware of any closed-form expression (or even taylor expansions or such) for incomplete integrals of the GIGP density function, I haven't implemented these functions yet. The complete integrals (from 0 to +inf) have closed-form expressions involving Bessel functions, given in Baayen (2001). BTW, this is one of the main reasons why I prefer the simplistic ZM/ fZM models over GIGP. > (I can think of some iterative/numerical way, but that wouldn't be > very elegant - I might as well sample from the corpus data in that > case.) Exactly. I would stay away from numerical integration in this case. Your goal is probably to run simulation experiments, so you will need a large number of random draws, and each of this would require to calculate several numerical integrals with high precision (this is essential for the transformation from a uniform distribution to a LNRE distribution). I've toyed with the possibility of using rejection sampling or a similar approach for GIGP, but haven't found a feasible solution yet. Any suggestions (or code :-) are highly welcome. Best regards, Stefan Evert [ stefan.evert at uos.de | http://purl.org/stefan.evert ] -------------- next part -------------- An HTML attachment was scrubbed... URL: From graff at mit.edu Thu Apr 9 21:18:16 2009 From: graff at mit.edu (Peter Graff) Date: Fri, 10 Apr 2009 00:18:16 -0400 Subject: [R-lang] Poisson MLR Message-ID: <49DEC888.1040802@mit.edu> Dear R-langs, I was wondering if someone knew how to get a glm (family=poisson) to spit out the Model Likelihood ratio, or more generally how to calculate it from the deviance. Thanks very much, Peter From stgries at gmail.com Fri Apr 10 07:36:03 2009 From: stgries at gmail.com (Stefan Th. Gries) Date: Fri, 10 Apr 2009 07:36:03 -0700 Subject: [R-lang] Poisson MLR In-Reply-To: <49DEC888.1040802@mit.edu> References: <49DEC888.1040802@mit.edu> Message-ID: Hm, can't you compare the difference of the null and the residual deviances against a chi-square distribution with df=df_nulldev-df_resdev? (Just like when you use anova(model1, model2, test="Chi") during model simplification?) HTH, STG -- Stefan Th. Gries ----------------------------------------------- University of California, Santa Barbara http://www.linguistics.ucsb.edu/faculty/stgries ----------------------------------------------- From austin.frank at gmail.com Fri Apr 10 10:04:58 2009 From: austin.frank at gmail.com (Austin Frank) Date: Fri, 10 Apr 2009 13:04:58 -0400 Subject: [R-lang] [SPAM] Re: Poisson MLR References: <49DEC888.1040802@mit.edu> Message-ID: On Fri, Apr 10 2009, Peter Graff wrote: > I was wondering if someone knew how to get a glm (family=poisson) to > spit out the Model Likelihood ratio, or more generally how to > calculate it from the deviance. Likelihood ratios apply between to different models, single models have a log likelihood. Are you comparing two models? To get the log likelihood for a single model, you can use logLik(model). To do a likelihood ratio test between two models, you can use anova(model1, model2). Using the example from ?glm, --8<---------------cut here---------------start------------->8--- ## EXAMPLE ## Dobson (1990) Page 93: Randomized Controlled Trial : counts <- c(18,17,15,20,10,20,25,13,12) outcome <- gl(3,1,9) treatment <- gl(3,3) print(d.AD <- data.frame(treatment, outcome, counts)) glm.D93 <- glm(counts ~ outcome + treatment, family=poisson()) anova(glm.D93) summary(glm.D93) ## LOG LIKELIHOOD logLik(glm.D93) ## MODEL COMPARISON glm.D93.1 <- glm(counts ~ 1, family=poisson()) anova(glm.D93, glm.D93.1) --8<---------------cut here---------------end--------------->8--- HTH, /au -- Austin Frank http://aufrank.net GPG Public Key (D7398C2F): http://aufrank.net/personal.asc -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 193 bytes Desc: not available URL: From kylebgorman at gmail.com Fri Apr 10 16:28:15 2009 From: kylebgorman at gmail.com (Kyle Gorman) Date: Fri, 10 Apr 2009 19:28:15 -0400 Subject: [R-lang] residualization of a three-way contrast Message-ID: <6CBB8FED-426F-4DC7-8A65-760A450D7A6C@gmail.com> i have three positively-correlated predictors that i'd like to include in a model. any traditional measure suggests that to include them as is would introduce a good deal of collinearity. really, these are a great candidate for either taking the sum of the three, or for PCA, but hypothetically, let's say i wanted to use a residualization trick for this three-way interaction. (they are all on a 15 point scale and I predict they will all have similar positive betas) X1 will remain as is. r.X2 = residuals(lm(X2 ~ X1)) r.X3 = residuals(lm(X3 ~ X1 + r.X2) then: outcome ~ X1 + r.X2 + r.X3 this is the solution i vaguely recall seeing in a textbook somewhere under the name "partialization" - is this kosher? - should the form of r.X3 be the naive residuals(lm(X3 ~ X1 + X2)? - should the form of r.X2 be the less-naive residuals(lm(X2 ~ X1 + X3))? - kyle ps: yes, i didn't say anything about language here. but it's a language study From dutch.linguistics at gmail.com Sat Apr 11 05:55:39 2009 From: dutch.linguistics at gmail.com (Marco) Date: Sat, 11 Apr 2009 14:55:39 +0200 Subject: [R-lang] High collinearity Message-ID: <30631a790904110555v911ffafh67d6b63dff2e974e@mail.gmail.com> Dear R-langs, I have a data set that contains highly correlated variables (> .90), all of which are variables that occur on the same time scale. I crucially want to determine whether one of these variables (End) has explanatory power on top of all the other ones. In this case, is it legitimate to take the residuals of End (fitting an lm model, in which we explain End with all other, correlated variables), and then running an lmer model that only contains resid_end? When I look at the results I obtain, it seems like the other correlated variables result in corrupted residuals for End. Are there any other methods to deal with (and distinguish between) highly correlated variables in R? Or could you tell me whether it is valid to use these residuals (and the F values obtained for these residuals), even though the beta coefficients are uninterpretable? Thanks in advance! Marco -------------- next part -------------- An HTML attachment was scrubbed... URL: From rlevy at ling.ucsd.edu Sat Apr 11 12:45:46 2009 From: rlevy at ling.ucsd.edu (Roger Levy) Date: Sat, 11 Apr 2009 12:45:46 -0700 Subject: [R-lang] residualization of a three-way contrast In-Reply-To: <6CBB8FED-426F-4DC7-8A65-760A450D7A6C@gmail.com> References: <6CBB8FED-426F-4DC7-8A65-760A450D7A6C@gmail.com> Message-ID: <0BB21366-F415-46DA-938C-ABA4CC575073@ling.ucsd.edu> On Apr 10, 2009, at 4:28 PM, Kyle Gorman wrote: > i have three positively-correlated predictors that i'd like to > include in a model. any traditional measure suggests that to include > them as is would introduce a good deal of collinearity. really, > these are a great candidate for either taking the sum of the three, > or for PCA, but hypothetically, let's say i wanted to use a > residualization trick for this three-way interaction. > > (they are all on a 15 point scale and I predict they will all have > similar positive betas) > > X1 will remain as is. > > r.X2 = residuals(lm(X2 ~ X1)) > r.X3 = residuals(lm(X3 ~ X1 + r.X2) > > then: > > outcome ~ X1 + r.X2 + r.X3 > > this is the solution i vaguely recall seeing in a textbook somewhere > under the name "partialization" Hi Kyle, > - is this kosher? Yes, it's kosher, even during Passover :-) Just keep in mind what the outcome of your regression will be. The coefficient assigned to r.X3 is "that portion of the variability in your outcome that cannot be expresssed as a linear combination of X1 and X2". Likewise (more simply) for r.X2. > - should the form of r.X3 be the naive residuals(lm(X3 ~ X1 + X2)? It won't make a a difference. r.X3 will be the same in either case (modulo numerical error). > - should the form of r.X2 be the less-naive residuals(lm(X2 ~ X1 + > X3))? That would be bad. If you did this and then used your original formula outcome ~ X1 + r.X2 + r.X3 you would be in a more restricted subspace than for outcome ~ X1 + X2 + X3 which you don't want. Imagine the extreme case where X2 == X3 always. Then with your proposal, r.X2 and r.X3 would always both be 0. Roger -- Roger Levy Email: rlevy at ling.ucsd.edu Assistant Professor Phone: 858-534-7219 Department of Linguistics Fax: 858-534-4789 UC San Diego Web: http://ling.ucsd.edu/~rlevy From njs at pobox.com Sat Apr 11 12:47:35 2009 From: njs at pobox.com (Nathaniel Smith) Date: Sat, 11 Apr 2009 12:47:35 -0700 Subject: [R-lang] residualization of a three-way contrast In-Reply-To: <6CBB8FED-426F-4DC7-8A65-760A450D7A6C@gmail.com> References: <6CBB8FED-426F-4DC7-8A65-760A450D7A6C@gmail.com> Message-ID: <961fa2b40904111247q7883642fwda00251228d1402a@mail.gmail.com> On Fri, Apr 10, 2009 at 4:28 PM, Kyle Gorman wrote: > X1 will remain as is. > > r.X2 = residuals(lm(X2 ~ X1)) > r.X3 = residuals(lm(X3 ~ X1 + r.X2) > > then: > > outcome ~ X1 + r.X2 + r.X3 > > this is the solution i vaguely recall seeing in a textbook somewhere under > the name "partialization" > - is this kosher? Sure. r.X2 == X2 - alpha - beta*X1 (for some alpha and beta) Which means: X2 == r.X2 + alpha + beta*X1 and a similar rule is true of r.X3. That means that if the linear model with residuals wants to use, say, X3 to predict the outcome, then it can reconstruct it from X1, r.X2, r.X3 by choosing the right linear coefficients. In other words, outcome ~ X1 + r.X2 + r.X3 and outcome ~ X1 + X2 + X3 end up fitting exactly the same linear models. The only differences are in numerical stability (the model with residuals is better), and that you have to interpret the fitted coefficients differently (and the default t-tests as well, of course, since those are testing the hypothesis that each coefficient is non-zero). If you need other hypothesis tests, you can use linear.hypothesis from library(car), which lets you test things like "these coefficients are equal to each other", or "these coefficients sum to 0". It can help in interpretation to rescale X1 -- I often fit models like p.X1 <- predict(lm(X2 ~ X1)) r.X2 <- resid(lm(X2 ~ X1)) lm(outcome ~ p.X1 + r.X2) p.X1 is just a rescaling/recentering of X1 to put it on the same scale as X2. What's nice is that X2 is the simple sum of p.X1 + r.X2. That means that if I see the same coefficients on p.X1 and r.X2, the model is reconstructing X2, if both are non-zero but different, then the model wants to use a mix, etc. I haven't really thought about how to do this for more than 2 variables, but maybe it'll give you some ideas. > - should the form of r.X3 be the naive residuals(lm(X3 ~ X1 + X2)? It makes no difference. As in, whichever way you calculate them, you will get exactly the same values for r.X3 (except that lm(X3 ~ X1 + X2) might be less numerically stable, as above). > - should the form of r.X2 be the less-naive residuals(lm(X2 ~ X1 + X3))? I wouldn't, since it breaks that logic that you're fitting "the same linear model". > ps: yes, i didn't say anything about language here. but it's a language > study Doesn't bother me! These issues are endemic in language studies, and most stats books are completely unhelpful. ("Don't put correlated predictors in!" is fine advice if your goal is just prediction, but when the whole point of your study is to compare the two predictors, well...) -- Nathaniel From tiflo at csli.stanford.edu Sat Apr 11 14:47:18 2009 From: tiflo at csli.stanford.edu (T. Florian Jaeger) Date: Sat, 11 Apr 2009 17:47:18 -0400 Subject: [R-lang] High collinearity In-Reply-To: <30631a790904110555v911ffafh67d6b63dff2e974e@mail.gmail.com> References: <30631a790904110555v911ffafh67d6b63dff2e974e@mail.gmail.com> Message-ID: <38dc9be90904111447m256086b4g7c24089756d40674@mail.gmail.com> On Sat, Apr 11, 2009 at 8:55 AM, Marco wrote: > Dear R-langs, > > I have a data set that contains highly correlated variables (> .90), all of > which are variables that occur on the same time scale. I crucially want to > determine whether one of these variables (End) has explanatory power on top > of all the other ones. In this case, is it legitimate to take the residuals > of End (fitting an lm model, in which we explain End with all other, > correlated variables), and then running an lmer model that only contains > resid_end? When I look at the results I obtain, it seems like the other > correlated variables result in corrupted residuals for End. Are there any > other methods to deal with (and distinguish between) highly correlated > variables in R? Or could you tell me whether it is valid to use these > residuals (and the F values obtained for these residuals), even though the > beta coefficients are uninterpretable? I think in your situation, you should first do some model comparison, preferably bootstrapping over it (i.e. test e.g. 10,000 times which of the two predictors would be removed from the model if you sampled from your data randomly with replacement). that's the best thing to do if you have such high correlations. residuals can be used, but you would have to residualize both ways and test the two resulting models. florian > > > Thanks in advance! > > Marco > > _______________________________________________ > R-lang mailing list > R-lang at ling.ucsd.edu > http://pidgin.ucsd.edu/mailman/listinfo/r-lang > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hammond at u.arizona.edu Sun Apr 12 15:31:29 2009 From: hammond at u.arizona.edu (Michael T Hammond) Date: Sun, 12 Apr 2009 15:31:29 -0700 (MST) Subject: [R-lang] minF' In-Reply-To: References: Message-ID: Hi Has anybody implemented minF' in R? I'm just looking for a way to automate that with Sweave/LaTeX. (And I'm sure somebody will say something here about lmer, which I agree is interesting, but I'm still looking for minF'.) thank you! Mike Hammond From stgries at gmail.com Sun Apr 12 19:04:41 2009 From: stgries at gmail.com (Stefan Th. Gries) Date: Sun, 12 Apr 2009 23:04:41 -0300 Subject: [R-lang] minF' In-Reply-To: References: Message-ID: > Has anybody implemented minF' in R? Maybe this is useful: HTH, STG -- Stefan Th. Gries ----------------------------------------------- University of California, Santa Barbara http://www.linguistics.ucsb.edu/faculty/stgries ----------------------------------------------- From hammond at u.arizona.edu Mon Apr 13 08:56:41 2009 From: hammond at u.arizona.edu (Michael T Hammond) Date: Mon, 13 Apr 2009 08:56:41 -0700 (MST) Subject: [R-lang] minF' In-Reply-To: References: Message-ID: Stefan Ah, excellent. Thanks so much. mike h. On Sun, 12 Apr 2009, Stefan Th. Gries wrote: > > Has anybody implemented minF' in R? > Maybe this is useful: > > > HTH, > STG > -- > Stefan Th. Gries > ----------------------------------------------- > University of California, Santa Barbara > http://www.linguistics.ucsb.edu/faculty/stgries > ----------------------------------------------- > > From H.Quene at uu.nl Mon Apr 13 12:32:51 2009 From: H.Quene at uu.nl (=?UTF-8?B?SHVnbyBRdWVuwo7DqQ==?=) Date: Mon, 13 Apr 2009 21:32:51 +0200 Subject: [R-lang] R-lang Digest, Vol 22, Issue 7 -- minFprime In-Reply-To: References: Message-ID: <49E39363.9020003@uu.nl> Dear Mark, On 13-04-2009 21:00, r-lang-request at ling.ucsd.edu wrote: > Has anybody implemented minF' in R? I'm just looking for a way to automate > that with Sweave/LaTeX. > > source( file=url("http://www.let.uu.nl/~Hugo.Quene/personal/tools/minFprime.ssc") ) This function also returns a p value, yes I'm aware of the discussion. Just my 2ct. Kind regards, Hugo Quen? -- Dr Hugo Quen? | Utrecht inst of Linguistics OTS | Utrecht University | Trans 10 | 3512 JK Utrecht | The Netherlands | T +31 30 253 6070 | F +31 30 253 6000 | H.Quene at uu.nl | www.hugoquene.nl | www.hum.uu.nl From Claire.Delleluche at univ-lyon2.fr Mon Apr 20 01:41:58 2009 From: Claire.Delleluche at univ-lyon2.fr (Claire Delle Luche) Date: Mon, 20 Apr 2009 10:41:58 +0200 (CEST) Subject: [R-lang] Collinearity and condition number Message-ID: <2287884.11766.1240216919830.JavaMail.root@co4> Dear R-users, I am running a mixed effect model on a written corpus. When I check for collinearity, I get a value of 35 for condition number when my predictors are entered as names then transformed as numeric (the values are 1 and 2 for two level predictors after the transormation). However, when I enter the predictors as factors and assign levels of 0 and 1 instead of names (and convert them as numeric), I get a condition number of 12. For the same data, depending on how I code the predictors, I either have moderate or important collinearity. What shall I do? Which coding is more acceptable? Thanks very much in advance. Yours, Claire Delle Luche Laboratoire Dynamique du Langage 14, avenue Berthelot 69 007 Lyon France From Matthew.Roberts at ed.ac.uk Mon Apr 20 03:16:58 2009 From: Matthew.Roberts at ed.ac.uk (Matthew Roberts) Date: Mon, 20 Apr 2009 11:16:58 +0100 Subject: [R-lang] Collinearity and condition number In-Reply-To: <2287884.11766.1240216919830.JavaMail.root@co4> References: <2287884.11766.1240216919830.JavaMail.root@co4> Message-ID: <20090420101658.GB3080@Schultze> Are you sure that the numeric values are the same in each case? When R converts factor() to numeric(), it take the numerical factor id rather than the name. To avoid this do: numeric.condition.number <- as.numeric(as.character(factor.condition.number)) Hope this helps, Matthew * Claire Delle Luche [2009-04-20 10:41:58 +0200]: > Dear R-users, > > I am running a mixed effect model on a written corpus. > When I check for collinearity, I get a value of 35 for condition number when my predictors are entered as names then transformed as numeric (the values are 1 and 2 for two level predictors after the transormation). > However, when I enter the predictors as factors and assign levels of 0 and 1 instead of names (and convert them as numeric), I get a condition number of 12. > > For the same data, depending on how I code the predictors, I either have moderate or important collinearity. What shall I do? > Which coding is more acceptable? > > Thanks very much in advance. > > Yours, > > Claire Delle Luche > Laboratoire Dynamique du Langage > 14, avenue Berthelot > 69 007 Lyon > France > > _______________________________________________ > R-lang mailing list > R-lang at ling.ucsd.edu > http://pidgin.ucsd.edu/mailman/listinfo/r-lang > -- Matthew A. J. Roberts Department of Psychology, University of Edinburgh, 7 George Square EH8 9JZ +44 (0)131 6511302 -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From tiflo at csli.stanford.edu Mon Apr 20 07:49:27 2009 From: tiflo at csli.stanford.edu (T. Florian Jaeger) Date: Mon, 20 Apr 2009 10:49:27 -0400 Subject: [R-lang] Collinearity and condition number In-Reply-To: <20090420101658.GB3080@Schultze> References: <2287884.11766.1240216919830.JavaMail.root@co4> <20090420101658.GB3080@Schultze> Message-ID: <38dc9be90904200749v60646059id45acc255af6ae57@mail.gmail.com> Hi Claire, More generally, too, it is not surprising to find that two codings differ in collinearity (e.g. treatment coding vs. contrast/sum coding). Which coding you should use depends on (a) what hypothesis you want to test and (b) if you need to reduced collinearity (b/c you need reliable estimates of effect direction and/or effect shape, since otherwise you could simply use model comparison). florian On Mon, Apr 20, 2009 at 6:16 AM, Matthew Roberts wrote: > Are you sure that the numeric values are the same in each case? When R > converts factor() to numeric(), it take the numerical factor id rather > than the name. To avoid this do: > > numeric.condition.number <- > as.numeric(as.character(factor.condition.number)) > > Hope this helps, > > Matthew > > * Claire Delle Luche [2009-04-20 > 10:41:58 +0200]: > > > Dear R-users, > > > > I am running a mixed effect model on a written corpus. > > When I check for collinearity, I get a value of 35 for condition number > when my predictors are entered as names then transformed as numeric (the > values are 1 and 2 for two level predictors after the transormation). > > However, when I enter the predictors as factors and assign levels of 0 > and 1 instead of names (and convert them as numeric), I get a condition > number of 12. > > > > For the same data, depending on how I code the predictors, I either have > moderate or important collinearity. What shall I do? > > Which coding is more acceptable? > > > > Thanks very much in advance. > > > > Yours, > > > > Claire Delle Luche > > Laboratoire Dynamique du Langage > > 14, avenue Berthelot > > 69 007 Lyon > > France > > > > _______________________________________________ > > R-lang mailing list > > R-lang at ling.ucsd.edu > > http://pidgin.ucsd.edu/mailman/listinfo/r-lang > > > > -- > Matthew A. J. Roberts > Department of Psychology, > University of Edinburgh, > 7 George Square > EH8 9JZ > +44 (0)131 6511302 > -- > > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > > _______________________________________________ > R-lang mailing list > R-lang at ling.ucsd.edu > http://pidgin.ucsd.edu/mailman/listinfo/r-lang > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shane.lindsay at gmail.com Thu Apr 23 09:13:57 2009 From: shane.lindsay at gmail.com (Shane Lindsay) Date: Thu, 23 Apr 2009 17:13:57 +0100 Subject: [R-lang] Is aovlmer.fnc working? Message-ID: <4a6cd13e0904230913r6ed85b9dn5879e3eaa4e9741f@mail.gmail.com> Hi all, I was getting errors trying to to aovlmer.fnc() to work, so I updated R and packages to: R version 2.9.0 (2009-04-17) languageR_0.953 lme4_0.999375-28 And now it still won't work but I get a different error message: Error in anova(object) : Calculated PWRSS for a LMM is negative This is for the example in the help file, but I tried with other models and other parameters, with the same problem. I did a little searching, and found this: http://www.nabble.com/Problem-with-aovlmer.fnc-in-languageR-tt20706128.html#a21365322 Is this a problem that everyone still has with this function? Shane -------------- next part -------------- An HTML attachment was scrubbed... URL: From tiflo at csli.stanford.edu Wed Apr 29 06:30:13 2009 From: tiflo at csli.stanford.edu (T. Florian Jaeger) Date: Wed, 29 Apr 2009 09:30:13 -0400 Subject: [R-lang] High collinearity In-Reply-To: <30631a790904290605r28187abq77cf5c8d62b67940@mail.gmail.com> References: <30631a790904110555v911ffafh67d6b63dff2e974e@mail.gmail.com> <38dc9be90904111447m256086b4g7c24089756d40674@mail.gmail.com> <30631a790904290605r28187abq77cf5c8d62b67940@mail.gmail.com> Message-ID: <38dc9be90904290630u7d681a48l620d5f73442a7ade@mail.gmail.com> Hi Marco, yes, you can use residuals as predictors even if the residuals are derived from models with collinear multiple predictors. The fitted values are not affected by collinearity (and hence neither the residuals). Only the SE(betas) are biases and the betas themselves become hard to interpret. With regard to your other question: if you residualize a predictor xi in several different ways by regressing it against different combinations of other predictors x1 ... xk, leading to different residualized versions of xi, say r_xi1 to r_xik, and only one (or some) of these residualized predictors results in significance, then you have to be careful in the interpretation of the effect. You may find Victor Kuperman and my slides at http://hlplab.wordpress.com/2009-pre-cuny-workshop-on-ordinary-and-multilevel-models-womm/useful (see residualization) where we talk about the interpretation of a residualized variable. *Just to be clear, all predictors another predictor is residualized against should be in the final model.* Although model comparison do not quite always have the same result, significance of a residualized predictor r_xi in the SE(beta)-based test (in the absence of remaining collinearity) is essentially saying that the *un*residualized (i.e. original) predictor xi improves the model significantly *beyond the predictors that xi was residualized against.* So significance tests over different residualized r_xi (xi residualized against different sets of other predictors x1 ... xk) are actually testing different hypotheses? Not sure, this is clear from what I am saying. Let me know, Florian On Wed, Apr 29, 2009 at 9:05 AM, Marco wrote: > Dear Florian, > > Thank you for you comments. I have more than two correlated variables, > though. Is it possible to use the residuals of models that contain multiple > correlated variables? For as far as I know, the residuals are not affected > by the collinearity; only the beta estimates for the individual variables in > the model, right? Is the equation below statistically alright? > > residuals_end <- lm( End ~ correlatedvar1 + correlatedvar2 + > correlatedvar3) > > I have run similar models for the other variables (i.e. correlatedvar1, > correlatedvar2, etc.). Subsequently, I fitted similar models for the other > variables. If only one of these residualised variables shows up as > significant, does that prove its additional value? Or should I only > residualise for variables one by one? > > Thanks in advance, > > Marco > > > > > > 2009/4/11 T. Florian Jaeger > > >> >> On Sat, Apr 11, 2009 at 8:55 AM, Marco wrote: >> >>> Dear R-langs, >>> >>> I have a data set that contains highly correlated variables (> .90), all >>> of which are variables that occur on the same time scale. I crucially want >>> to determine whether one of these variables (End) has explanatory power on >>> top of all the other ones. In this case, is it legitimate to take the >>> residuals of End (fitting an lm model, in which we explain End with all >>> other, correlated variables), and then running an lmer model that only >>> contains resid_end? When I look at the results I obtain, it seems like the >>> other correlated variables result in corrupted residuals for End. Are there >>> any other methods to deal with (and distinguish between) highly correlated >>> variables in R? Or could you tell me whether it is valid to use these >>> residuals (and the F values obtained for these residuals), even though the >>> beta coefficients are uninterpretable? >> >> >> I think in your situation, you should first do some model comparison, >> preferably bootstrapping over it (i.e. test e.g. 10,000 times which of the >> two predictors would be removed from the model if you sampled from your data >> randomly with replacement). that's the best thing to do if you have such >> high correlations. >> >> residuals can be used, but you would have to residualize both ways and >> test the two resulting models. >> >> florian >> >>> >>> >>> Thanks in advance! >>> >>> Marco >>> >>> _______________________________________________ >>> R-lang mailing list >>> R-lang at ling.ucsd.edu >>> http://pidgin.ucsd.edu/mailman/listinfo/r-lang >>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From dutch.linguistics at gmail.com Wed Apr 29 07:15:20 2009 From: dutch.linguistics at gmail.com (Marco) Date: Wed, 29 Apr 2009 16:15:20 +0200 Subject: [R-lang] High collinearity In-Reply-To: <38dc9be90904290630u7d681a48l620d5f73442a7ade@mail.gmail.com> References: <30631a790904110555v911ffafh67d6b63dff2e974e@mail.gmail.com> <38dc9be90904111447m256086b4g7c24089756d40674@mail.gmail.com> <30631a790904290605r28187abq77cf5c8d62b67940@mail.gmail.com> <38dc9be90904290630u7d681a48l620d5f73442a7ade@mail.gmail.com> Message-ID: <30631a790904290715m180e1449w8946d8fbc16df06f@mail.gmail.com> Dear Florian, Thank you for your response. I think my previous message was not entirely clear. Please let me try once more: Let's say that we have predictor 1, 2, 3, and 4, all of which are highly correlated. I want to investigate which of these correlated variables have explanatory power on top of all the other variables when explaining the dependent variable Dep. In order to (try to) avoid multicollinearity issues, I fitted four models: resid_1 <- lm(1~2+3+4) resid_2 <- lm(2~1+3+4) resid_3 <- lm(3~2+1+4) resid_4 <- lm(4~2+3+1) Subsequently, I fitted four linear models, one for each resid-variable I created, thus: 1.lm <- lm(dep~resid_1) 2.lm <- lm(dep~resid_2) 3.lm <- lm(dep~resid_3) 4.lm <- lm(dep~resid_4) I then corrected the p-values for the multiple comparisons. Is this a reasonable way to test the unique contributions of these four variables? Thanks in advance, Marco 2009/4/29 T. Florian Jaeger > Hi Marco, > > yes, you can use residuals as predictors even if the residuals are derived > from models with collinear multiple predictors. The fitted values are not > affected by collinearity (and hence neither the residuals). Only the > SE(betas) are biases and the betas themselves become hard to interpret. > > With regard to your other question: if you residualize a predictor xi in > several different ways by regressing it against different combinations of > other predictors x1 ... xk, leading to different residualized versions of > xi, say r_xi1 to r_xik, and only one (or some) of these residualized > predictors results in significance, then you have to be careful in the > interpretation of the effect. You may find Victor Kuperman and my slides at > http://hlplab.wordpress.com/2009-pre-cuny-workshop-on-ordinary-and-multilevel-models-womm/useful (see residualization) where we talk about the interpretation of a > residualized variable. > > *Just to be clear, all predictors another predictor is residualized > against should be in the final model.* Although model comparison do not > quite always have the same result, significance of a residualized predictor > r_xi in the SE(beta)-based test (in the absence of remaining collinearity) > is essentially saying that the *un*residualized (i.e. original) predictor > xi improves the model significantly *beyond the predictors that xi was > residualized against.* So significance tests over different residualized > r_xi (xi residualized against different sets of other predictors x1 ... xk) > are actually testing different hypotheses? > > Not sure, this is clear from what I am saying. Let me know, > > Florian > > > On Wed, Apr 29, 2009 at 9:05 AM, Marco wrote: > >> Dear Florian, >> >> Thank you for you comments. I have more than two correlated variables, >> though. Is it possible to use the residuals of models that contain multiple >> correlated variables? For as far as I know, the residuals are not affected >> by the collinearity; only the beta estimates for the individual variables in >> the model, right? Is the equation below statistically alright? >> >> residuals_end <- lm( End ~ correlatedvar1 + correlatedvar2 + >> correlatedvar3) >> >> I have run similar models for the other variables (i.e. correlatedvar1, >> correlatedvar2, etc.). Subsequently, I fitted similar models for the other >> variables. If only one of these residualised variables shows up as >> significant, does that prove its additional value? Or should I only >> residualise for variables one by one? >> >> Thanks in advance, >> >> Marco >> >> >> >> >> >> 2009/4/11 T. Florian Jaeger >> >> >>> >>> On Sat, Apr 11, 2009 at 8:55 AM, Marco wrote: >>> >>>> Dear R-langs, >>>> >>>> I have a data set that contains highly correlated variables (> .90), all >>>> of which are variables that occur on the same time scale. I crucially want >>>> to determine whether one of these variables (End) has explanatory power on >>>> top of all the other ones. In this case, is it legitimate to take the >>>> residuals of End (fitting an lm model, in which we explain End with all >>>> other, correlated variables), and then running an lmer model that only >>>> contains resid_end? When I look at the results I obtain, it seems like the >>>> other correlated variables result in corrupted residuals for End. Are there >>>> any other methods to deal with (and distinguish between) highly correlated >>>> variables in R? Or could you tell me whether it is valid to use these >>>> residuals (and the F values obtained for these residuals), even though the >>>> beta coefficients are uninterpretable? >>> >>> >>> I think in your situation, you should first do some model comparison, >>> preferably bootstrapping over it (i.e. test e.g. 10,000 times which of the >>> two predictors would be removed from the model if you sampled from your data >>> randomly with replacement). that's the best thing to do if you have such >>> high correlations. >>> >>> residuals can be used, but you would have to residualize both ways and >>> test the two resulting models. >>> >>> florian >>> >>>> >>>> >>>> Thanks in advance! >>>> >>>> Marco >>>> >>>> _______________________________________________ >>>> R-lang mailing list >>>> R-lang at ling.ucsd.edu >>>> http://pidgin.ucsd.edu/mailman/listinfo/r-lang >>>> >>>> >>> > -- > -------------- next part -------------- An HTML attachment was scrubbed... URL: