[Lign274] Final project topic and groups

Tue Feb 23 17:56:47 PST 2010

Hi Randy et al,
As it turns out, we're mostly all behind on this... Here's an idea I had for
a project:
I'm interested in eye movements during reading. There's a dataset called the
Dundee corpus with eye-tracking data for people reading large amounts of
text. There's been some work on predicting reading times from this dataset
using hierarchical models, e.g.
doi:10.1016/j.cognition.2008.07.008<http://dx.doi.org/10.1016/j.cognition.2008.07.008>.
One of big questions is the roles of frequency versus contextual probability
as predictors of reading time. (Of course both will on average lead to
faster reading times, but what about a low-frequency word that's highly
predictable in context, or a high-frequency word that's unpredicted given
the context?) The linked paper above begins to answer this question. Another
angle to approach this question from (with credit to Nathaniel Smith for
suggesting this to me) is whether frequency and predictability would play
different roles for different ranges of fixation durations: he suggests that
frequency is playing a larger role for very short fixations, while
contextual probability is playing a larger role in longer fixations.
Hierarchical models seem like a great tool to address this question. Anyone
interested in working on it with me? (I'm a first year student in the
linguistics PhD program, by the way.)

~Emily

2010/2/22 Randy West <rdwest at ucsd.edu>

> Hi All,
>
> I might be a little bit late to the punch here, but I'm still looking for
> group member(s) for our final project. Briefly, I'm a first year Master's
> student in computer science with a focus on artificial intelligence. I have
> a little bit of formal training in English syntax, but other than that I
> have very little linguistics background. I've included an outline of my idea
> for a final project below, but if everyone has already formed groups then
> please let me know so that I can help out with one of your projects.
>
> Here's my idea:
>
> I'd like to do an analysis of the efficacy of various methods that we've
> covered in class for use in search engine technology. The dataset could be
> any collection of documents, but I was thinking of building a web-crawling
> script to run over, say, Google news for a few days and build up a database
> that way. The idea would be to produce for each model (or mixture of models)
> an ordering of documents in the dataset based on an ordered vector of search
> terms, i.e. a vector of documents ordered by p(document | search terms). The
> simplest such model would be the product of unigram likelihoods for each
> search term in the document, while something more complex might be using LDA
> to determine topic distributions over words and documents and leveraging
> those distributions for search.
>
> Please let me know what you all think, and again, if everyone has already
> settled into groups then please let me know so that I can help out.
>
> Best,
> Randy
>
> _______________________________________________
> Lign274 mailing list
> Lign274 at ling.ucsd.edu
> http://pidgin.ucsd.edu/mailman/listinfo/lign274
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pidgin.ucsd.edu/pipermail/lign274/attachments/20100223/398aa405/attachment.htm>