From mlopez at iattc.org Tue Sep 15 11:45:30 2009 From: mlopez at iattc.org (Milton Lopez) Date: Tue, 15 Sep 2009 11:45:30 -0700 Subject: [R-lang] Moving from Linux to Mac OS Message-ID: One of our researchers is considering replacing her Red Hat Linux workstation with a Mac Pro. One of her concerns is being able to build R libraries on the Mac. As an R-illiterate sysadmin I suspect this would no more difficult on the Mac, but we would like to make sure. Also, some years ago I was told that R does not take advantage of multiple processors. Has there been any improvement in that area? This would have an impact on configuring the new system with more processors vs. fewer but faster ones. Thanks in advance for your comments. Milton F. Lopez Inter-American Tropical Tuna Commission (IATTC) 8604 La Jolla Shores Drive La Jolla, CA 92037 - USA mlopez at iattc.org www.iattc.org From ken.williams at thomsonreuters.com Tue Sep 15 11:52:22 2009 From: ken.williams at thomsonreuters.com (Ken Williams) Date: Tue, 15 Sep 2009 13:52:22 -0500 Subject: [R-lang] Moving from Linux to Mac OS In-Reply-To: Message-ID: I use a Mac Pro as my primary workstation, and installing R packages is a snap. Just "R CMD INSTALL packagefoo.tar.gz" as usual, or use the stuff inside the R gui. Tell her that the R gui (R.app) is really excellent, btw. Out of the box R doesn't do special stuff to use multiple processors, but there are various existing ways to get it to, and Grand Central Dispatch gives me optimism that even better multiproc support is around the corner. -Ken On 9/15/09 1:45 PM, "Milton Lopez" wrote: > One of our researchers is considering replacing her Red Hat Linux > workstation with a Mac Pro. One of her concerns is being able to build R > libraries on the Mac. As an R-illiterate sysadmin I suspect this would > no more difficult on the Mac, but we would like to make sure. > > Also, some years ago I was told that R does not take advantage of > multiple processors. Has there been any improvement in that area? This > would have an impact on configuring the new system with more processors > vs. fewer but faster ones. > > Thanks in advance for your comments. > > > Milton F. Lopez > Inter-American Tropical Tuna Commission (IATTC) > 8604 La Jolla Shores Drive > La Jolla, CA 92037 - USA > mlopez at iattc.org > www.iattc.org > > > > _______________________________________________ > R-lang mailing list > R-lang at ling.ucsd.edu > http://pidgin.ucsd.edu/mailman/listinfo/r-lang -- Ken Williams Sr. Research Scientist Thomson Reuters Phone: 651-848-7712 ken.williams at thomsonreuters.com From rachelbaker2010 at u.northwestern.edu Tue Sep 15 21:16:54 2009 From: rachelbaker2010 at u.northwestern.edu (Rachel Baker) Date: Tue, 15 Sep 2009 23:16:54 -0500 Subject: [R-lang] Contrast Coding in R Regressions Message-ID: Hi, I've recently started using R to do regressions, using the 'lmer' function. I am currently re-running some analyses that originally had treatment coding, so that they now have contrast coding. My question is about how to interpret contrast coded regression outputs. One of my independent variables (nativeLanguage) has 3 levels: English, Chinese, and Korean. As this experiment was conducted in English, participants in the English group were native speakers, and participants in the other two groups were non-native speakers. In my original treatment-coded analysis, English was the reference level. My output for e.g. 'langCompare.lmer = lmer(duration~nativeLanguage+(1|Subject), data=myData)' had lines like: Estimate Std. Error t value nativeLanguageChinese 0.025920 0.002384 10.872 nativeLanguageKorean -0.004416 0.002091 -2.112 As I understood it, such lines gave information about the comparison between Chinese and English, and between Korean and English, respectively. I contrast coded this variable with the code: 'contrasts(myData$nativeLanguage) = c(-1, .5, .5)' (after ordering the levels: English, Chinese, Korean). This was in order to compare the native (English) group to the non-native (Chinese and Korean) groups. After this contrast coding, my output had lines like: Estimate Std. Error t value nativeLanguage1 0.10002 0.010113 11.242 nativeLanguage2 -0.00046 0.639887 1.388 I was wondering how to interpret this output. My guess is that nativeLanguage1 is the comparison between the native and non-native groups, and native_language2 is the comparison between Chinese and Korean, but I haven't been able to find any resources to confirm this. I would greatly appreciate any advice on how to interpret regressions after contrast coding, or pointers to appropriate resources on this topic! Thanks very much, Rachel -------------- next part -------------- An HTML attachment was scrubbed... URL: From mlopez at iattc.org Wed Sep 16 10:18:29 2009 From: mlopez at iattc.org (Milton Lopez) Date: Wed, 16 Sep 2009 10:18:29 -0700 Subject: [R-lang] Moving from Linux to Mac OS In-Reply-To: References: Message-ID: Ken. Thanks for the feedback. I am also curious to see how Apple's GCD will find its way into R. FYI I got another reply from campus pointing to http://www.revolution-computing.com/ I talked with a sales rep who told me they will be looking closely at GCD. Cheers M. -----Original Message----- From: Ken Williams [mailto:ken.williams at thomsonreuters.com] Sent: Tuesday, September 15, 2009 11:52 AM To: Milton Lopez; r-lang at ling.ucsd.edu Cc: Cleridy Lennert Subject: Re: [R-lang] Moving from Linux to Mac OS I use a Mac Pro as my primary workstation, and installing R packages is a snap. Just "R CMD INSTALL packagefoo.tar.gz" as usual, or use the stuff inside the R gui. Tell her that the R gui (R.app) is really excellent, btw. Out of the box R doesn't do special stuff to use multiple processors, but there are various existing ways to get it to, and Grand Central Dispatch gives me optimism that even better multiproc support is around the corner. -Ken On 9/15/09 1:45 PM, "Milton Lopez" wrote: > One of our researchers is considering replacing her Red Hat Linux > workstation with a Mac Pro. One of her concerns is being able to build R > libraries on the Mac. As an R-illiterate sysadmin I suspect this would > no more difficult on the Mac, but we would like to make sure. > > Also, some years ago I was told that R does not take advantage of > multiple processors. Has there been any improvement in that area? This > would have an impact on configuring the new system with more processors > vs. fewer but faster ones. > > Thanks in advance for your comments. > > > Milton F. Lopez > Inter-American Tropical Tuna Commission (IATTC) > 8604 La Jolla Shores Drive > La Jolla, CA 92037 - USA > mlopez at iattc.org > www.iattc.org > > > > _______________________________________________ > R-lang mailing list > R-lang at ling.ucsd.edu > http://pidgin.ucsd.edu/mailman/listinfo/r-lang -- Ken Williams Sr. Research Scientist Thomson Reuters Phone: 651-848-7712 ken.williams at thomsonreuters.com From njsmith at cogsci.ucsd.edu Wed Sep 16 15:08:45 2009 From: njsmith at cogsci.ucsd.edu (Nathaniel Smith) Date: Wed, 16 Sep 2009 15:08:45 -0700 Subject: [R-lang] Moving from Linux to Mac OS In-Reply-To: References: Message-ID: <961fa2b40909161508k7df643a4j959dde7e104384fb@mail.gmail.com> On Tue, Sep 15, 2009 at 11:45 AM, Milton Lopez wrote: > Also, some years ago I was told that R does not take advantage of > multiple processors. Has there been any improvement in that area? This > would have an impact on configuring the new system with more processors > vs. fewer but faster ones. That's correct, R cannot automatically take advantage of multiple processors. There are more and more packages that will help you more-or-less easily write code that can run in parallel over multiple processors: http://cran.r-project.org/web/views/HighPerformanceComputing.html But you still have to actually learn how to use the package, write your code in a parallel way, etc., if you want to get the speedup. -- Nathaniel From rlevy at ling.ucsd.edu Sat Sep 19 20:19:26 2009 From: rlevy at ling.ucsd.edu (Roger Levy) Date: Sat, 19 Sep 2009 20:19:26 -0700 Subject: [R-lang] Contrast Coding in R Regressions In-Reply-To: References: Message-ID: <2BC4A78A-C269-47FD-B531-FB3F826ED084@ling.ucsd.edu> On Sep 15, 2009, at 9:16 PM, Rachel Baker wrote: > Hi, > > I've recently started using R to do regressions, using the 'lmer' > function. I am currently re-running some analyses that originally > had treatment coding, so that they now have contrast coding. My > question is about how to interpret contrast coded regression outputs. > > One of my independent variables (nativeLanguage) has 3 levels: > English, Chinese, and Korean. As this experiment was conducted in > English, participants in the English group were native speakers, and > participants in the other two groups were non-native speakers. In > my original treatment-coded analysis, English was the reference > level. My output for e.g. 'langCompare.lmer = > lmer(duration~nativeLanguage+(1|Subject), data=myData)' had lines > like: > > Estimate Std. Error > t value > nativeLanguageChinese 0.025920 0.002384 10.872 > nativeLanguageKorean -0.004416 0.002091 -2.112 > > As I understood it, such lines gave information about the comparison > between Chinese and English, and between Korean and English, > respectively. > > I contrast coded this variable with the code: 'contrasts(myData > $nativeLanguage) = c(-1, .5, .5)' (after ordering the levels: > English, Chinese, Korean). This was in order to compare the native > (English) group to the non-native (Chinese and Korean) groups. > After this contrast coding, my output had lines like: > > Estimate Std. Error t value > nativeLanguage1 0.10002 0.010113 11.242 > nativeLanguage2 -0.00046 0.639887 1.388 > > I was wondering how to interpret this output. My guess is that > nativeLanguage1 is the comparison between the native and non-native > groups, and native_language2 is the comparison between Chinese and > Korean, but I haven't been able to find any resources to confirm this. Hi Rachel, Your guess is correct, but the situation may be a little more complicated than you think. First, you need to realize that you didn't specify a complete contrast. Here's a little code snippet to illustrate: > m <- 20 > n <- 3 > lang <- factor(rep(c("English", "Chinese", "Korean"),m*n), levels=c("English", "Chinese", "Korean")) > old.contrasts <- contrasts(lang) > contrasts(lang) <- c(1,-.5,-.5) > new.contrasts <- contrasts(lang) Now, let's take a look at the old and new contrast matrices: > old.contrasts Chinese Korean English 0 0 Chinese 1 0 Korean 0 1 > new.contrasts [,1] [,2] English 1.0 -5.551115e-17 Chinese -0.5 -7.071068e-01 Korean -0.5 7.071068e-01 The value of old.contrasts derives from the fact that by default, R uses contr.treatment for unordered factors, with the first level of the factor being the baseline (which for you is English, so that the contrast matrix is all zeroes in the English row): > options()$contrasts unordered ordered "contr.treatment" "contr.poly" The value of contrasts(lang) reflects the fact that -- quoting from ? contrasts -- "If too few [entries for the contrast matrix] are supplied, a suitable contrast matrix is created by extending value after ensuring its columns are contrasts (orthogonal to the constant term) and not collinear." Now let's generate some artificial data and look at how to interpret models fit using the old and new contrast matrices: > set.seed(3) > beta <- c(0,0.26,-0.004) > speaker <- rep(1:m,langs*n) > b <- rnorm(m,0,0.1) > y <- beta[lang] + b[speaker] + rnorm(3*m*n) > contrasts(lang) <- old.contrasts > print(m.old <- lmer(y ~ lang + (1 | speaker),REML=F)) [...] Fixed effects: Estimate Std. Error t value (Intercept) -0.01403 0.13236 -0.1060 langChinese 0.49860 0.18719 2.6636 langKorean -0.11447 0.18719 -0.6115 [...] > contrasts(lang) <- new.contrasts > print(m.new <- lmer(y ~ lang + (1 | speaker),REML=F)) [...] Fixed effects: Estimate Std. Error t value (Intercept) 0.11402 0.07642 1.492 lang1 -0.12804 0.10807 -1.185 lang2 -0.43350 0.13236 -3.275 [...] Ignoring speaker-specific effects, the predicted mean for a given language is the intercept plus the dot product of the language's contrast-matrix representation with the coefficients for the language factor. Since the two models are equivalent, their predicted means should be the same for each language. And they are: > ## compare old contrasts and new contrasts > ## English: old model > fixef(m.old)[1] + sum(old.contrasts["English",] * fixef(m.old)[2:3]) (Intercept) -0.01402749 > ## English: new model > fixef(m.new)[1] + sum(new.contrasts["English",] * fixef(m.new)[2:3]) (Intercept) -0.01402749 The same will come out to be the case for the other two languages. So -- to get back to your question: what do the nativeLanguage1 and nativeLanguage2 coefficients mean in your new model? First, your contrast matrix has columns summing to 0, so the intercept can loosely be thought of as the predicted grand mean. The coefficient for nativeLanguage1 is the difference between (a) the intercept and the English mean, and (b) twice the difference between the intercept and the average of the Chinese and Korean means. The coefficient for nativeLanguage2 is the difference between Chinese and Korean divided by the square root of two. So your guess was basically correct. But it is important to recognize that these two coefficients operate on different scales, as reflected by the fact that the two columns of new.contrasts are vectors of different lengths. > I would greatly appreciate any advice on how to interpret > regressions after contrast coding, or pointers to appropriate > resources on this topic! So -- I wish I knew a really good reference on contrast coding. There is some useful information in Chambers & Hastie 1991, Section 2.3.2, and in Venables & Ripley 2002, Section 6.2. I think that Healy 2000 ("Matrices for Statistics") is a useful book that has some pertinent information. But if anyone out there knows a great reference for contrast coding -- I'd love to hear it too! Best Roger -- Roger Levy Email: rlevy at ling.ucsd.edu Assistant Professor Phone: 858-534-7219 Department of Linguistics Fax: 858-534-4789 UC San Diego Web: http://ling.ucsd.edu/~rlevy From dderrick at interchange.ubc.ca Wed Sep 23 23:42:25 2009 From: dderrick at interchange.ubc.ca (Donald Derrick) Date: Wed, 23 Sep 2009 23:42:25 -0700 Subject: [R-lang] MCMCglmm with relatively small multinomial data set Message-ID: <8122C193-959B-4E59-8120-5CBDA2DE31A6@interchange.ubc.ca> Hello R-lang, My name is Donald Derrick. Along with my supervisor Bryan Gick, I am researching categorical kinematic variation in flaps and taps in English, and have identified four forms: 1) Alveolar taps - where the tongue come from below the alveolar ridge, hits it and returns to a lower position 2) Down flaps - where the tongue comes from above the alveolar ridge, hits it and continues downward 3) Up flaps - where the tongue comes from below the alveolar ridge, hits it and continues upward 4) Post-alveolar taps - where the tongue comes from above the alveolar ridge, goes straight forward, hits a spot at or above the alveolar ridge and comes back. In words/phrases with only 1 flap, the independent variables are: 1) Subject variability (of course) 2) tongue speed (which I have measured) 3) tongue tip position before and after the flap The tongue tip is always low with regular vowels, but it can be either high for rhotic vowels (tongue tip up bunched 'r' or retroflex 'r'), or low (tongue tip down bunched 'r') This leaves the factors of 3a) tongue tip position immediately before the flap 3b) tongue tip position immediately after the flap If I use more traditional balanced statistics (like a Friedman test) to compare rhotic and non-rhotic vowels, I cannot add in the details of type of rhotic vowel (up or down) because different subjects use different tongue positions and I no longer have balanced data. Also, the Friedman test is ordinal, so I have to argue for a ranking of flaps. [An easy ranking is tongue height favoring end position, where I get 1) (low-low) alveolar tap, 2) (high-low) down flap, 3) (low-high) up- flap and 4) (high-high) post-alveolar-tap. It's a bit bogus, but then so is every other ordinal ranking scheme I've ever seen!] For both these reasons, I think I should be using a mixed effects linear model. Unfortunately, It seems I cannot use the glmer() command here because my dependent variable is multinomial SO, I am tring to use MCMCglmm, but I don't know if I am using it right, and also I cannot get convergence. I need help. My first question is, should I even bother? I have small datasets by the standards of many of you, and since they took an entire year to build and need to graduate someday, I will never get bigger ones. When I say small, I mean 18 subjects, and typically 12-24 phrases for any given category of analysis, as little as 216 phrases in each category, and as many as 432. On the other hand, these are typical of my datasets in that the fixed predictors are hugely predictive. SO, my next question is, can I even use MCMCglmm and hope to ever get convergence? Now, if all is not hopeless, I will go on: To make it easy to help, I've provided code I am using based on the HLP/Jaeger blog, and ask detailed questions throughout: /////////// prior = NULL flap.glmm = NULL k <- length(levels(flap$firstflaptext)) I <- diag(k-1) J <- matrix(rep(1, (k-1)^2), c(k-1, k-1)) # k = the number of levels for type of flap. This is 4 - as described above. I coded them "U", "D", "H", and "L" prior <- list(R = list(fix=1, V=0.5 * (I + J), n = 3), G = list( G1 = list(V = diag(3), n = 3), G2 = list(V = diag(4), n = 4) )) # R = the standard setup for a "categorical" family test based on the HLP/Jaeger lab blog help # fix=1 I believe is standard for a test of this kind # n=3 I believe is the lowest we should use. Higher numbers seem to work better, but not good enough. # I do not know why :( # G1 is for the flap types # G2 is for the vowel context. (more later) for(i in c(1:3)) # I run the same model 3 times for covariance testing - I may want to try different priors, # but I barely understand how to build them and cannot even get convergance this way yet. { flap.glmm[[i]] <-MCMCglmm(firstflaptext ~ -1 + trait + context, random = ~us(trait):subject + us(context):subject, rcov = ~us(trait):units, prior = prior, family="categorical", burnin = 5000, nitt = 50000, data=flap,verbose=TRUE) } # firstflaptext ~ -1 + trait + context # I would love to be able to use something like # precedingContext * followingContext # - but every time I tried I got a "bad prior" error and absolutely no hint as to how to fix it. # So I just combined the preceding and following context into a single variable - this gave me # VV (for "autumn"), VR (for "otter"), RV (for "Berta"), and RR (for "murder") # This solution does not seem ideal, so any help would be appreciated # random = ~us(trait):subject + us(context):subject # This seemed standard based on on the HLP/Jaeger lab blog help # nitt = minimum for raftery test below flap.coda <- mcmc.list( chain1=mcmc(flap.glmm[[1]]$Sol), chain2=mcmc(flap.glmm[[2]]$Sol), chain3=mcmc(flap.glmm[[3]]$Sol)) # I built a mcmc list based on the fixed effects # Q: How do I do this for the random effects, or does it matter? flap.gelman <- gelman.diag(flap.coda, transform = TRUE) flap.raftery <- raftery.diag(flap.coda, q=0.025, r=0.005, s=0.95, converge.eps=0.001) # Two covariance tests. # Below are the results > flap.gelman Potential scale reduction factors: Point est. 97.5% quantile traitfirstflaptext.H 1.05 1.13 traitfirstflaptext.L 1.03 1.09 traitfirstflaptext.U 1.03 1.07 contextRV 1.27 1.79 contextVR 2.41 4.45 contextVV 2.29 4.14 Multivariate psrf 3.38 # Note, these numbers do not get better with higher burnin or nitt :( > flap.raftery $chain1 Quantile (q) = 0.025 Accuracy (r) = +/- 0.005 Probability (s) = 0.95 Burn-in Total Lower bound Dependence (M) (N) (Nmin) factor (I) traitfirstflaptext.H 3 4212 3746 1.12 traitfirstflaptext.L 3 4368 3746 1.17 traitfirstflaptext.U 4 4615 3746 1.23 contextRV 25 22940 3746 6.12 contextVR 4 4968 3746 1.33 contextVV 8 10200 3746 2.72 $chain2 Quantile (q) = 0.025 Accuracy (r) = +/- 0.005 Probability (s) = 0.95 Burn-in Total Lower bound Dependence (M) (N) (Nmin) factor (I) traitfirstflaptext.H 3 4449 3746 1.19 traitfirstflaptext.L 2 3776 3746 1.01 traitfirstflaptext.U 2 3987 3746 1.06 contextRV 25 31880 3746 8.51 contextVR 15 13680 3746 3.65 contextVV 6 8844 3746 2.36 $chain3 Quantile (q) = 0.025 Accuracy (r) = +/- 0.005 Probability (s) = 0.95 Burn-in Total Lower bound Dependence (M) (N) (Nmin) factor (I) traitfirstflaptext.H 3 4061 3746 1.08 traitfirstflaptext.L 3 4368 3746 1.17 traitfirstflaptext.U 3 4289 3746 1.14 contextRV 15 16749 3746 4.47 contextVR 18 18096 3746 4.83 contextVV 15 15918 3746 4.25 ## In this run, the third chain suggested convergance. I'm not sure it is real though :( # Especially since it has never happened before, and higher burnin or nitt tend to not help. ////// Any and all help is appreciated - and apologies for such a hideously long post!