Hi Randy et al,<br>As it turns out, we&#39;re mostly all behind on this... Here&#39;s an idea I had for a project:<br>I&#39;m interested in eye movements during reading. There&#39;s a dataset called the Dundee corpus with eye-tracking data for people reading large amounts of text. There&#39;s been some work on predicting reading times from this dataset using hierarchical models, e.g. <img src="http://www.sciencedirect.com/scidirimg/clear.gif" alt="" border="0" width="1" height="10"><a href="http://dx.doi.org/10.1016/j.cognition.2008.07.008" target="doilink" onclick="var doiWin; doiWin=window.open(&#39;http://dx.doi.org/10.1016/j.cognition.2008.07.008&#39;,&#39;doilink&#39;,&#39;scrollbars=yes,resizable=yes,directories=yes,toolbar=yes,menubar=yes,status=yes&#39;); doiWin.focus()">doi:10.1016/j.cognition.2008.07.008</a>. One of big questions is the roles of frequency versus contextual probability as predictors of reading time. (Of course both will on average lead to faster reading times, but what about a low-frequency word that&#39;s highly predictable in context, or a high-frequency word that&#39;s unpredicted given the context?) The linked paper above begins to answer this question. Another angle to approach this question from (with credit to Nathaniel Smith for suggesting this to me) is whether frequency and predictability would play different roles for different ranges of fixation durations: he suggests that frequency is playing a larger role for very short fixations, while contextual probability is playing a larger role in longer fixations. Hierarchical models seem like a great tool to address this question. Anyone interested in working on it with me? (I&#39;m a first year student in the linguistics PhD program, by the way.)<br>

<br>~Emily<br><br><br><div class="gmail_quote">2010/2/22 Randy West <span dir="ltr">&lt;<a href="mailto:rdwest@ucsd.edu" target="_blank">rdwest@ucsd.edu</a>&gt;</span><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


Hi All,<br>

<br>

I might be a little bit late to the punch here, but I&#39;m still looking for group member(s) for our final project. Briefly, I&#39;m a first year Master&#39;s student in computer science with a focus on artificial intelligence. I have a little bit of formal training in English syntax, but other than that I have very little linguistics background. I&#39;ve included an outline of my idea for a final project below, but if everyone has already formed groups then please let me know so that I can help out with one of your projects.<br>


<br>

Here&#39;s my idea:<br>

<br>

I&#39;d like to do an analysis of the efficacy of various methods that we&#39;ve covered in class for use in search engine technology. The dataset could be any collection of documents, but I was thinking of building a web-crawling script to run over, say, Google news for a few days and build up a database that way. The idea would be to produce for each model (or mixture of models) an ordering of documents in the dataset based on an ordered vector of search terms, i.e. a vector of documents ordered by p(document | search terms). The simplest such model would be the product of unigram likelihoods for each search term in the document, while something more complex might be using LDA to determine topic distributions over words and documents and leveraging those distributions for search.<br>


<br>

Please let me know what you all think, and again, if everyone has already settled into groups then please let me know so that I can help out.<br>

<br>

Best,<br>

Randy<br>

<br>

_______________________________________________<br>

Lign274 mailing list<br>

<a href="mailto:Lign274@ling.ucsd.edu" target="_blank">Lign274@ling.ucsd.edu</a><br>

<a href="http://pidgin.ucsd.edu/mailman/listinfo/lign274" target="_blank">http://pidgin.ucsd.edu/mailman/listinfo/lign274</a><br>

</blockquote></div><br>