[Ligncse256] Word Segmentation Mini-example

Sun Feb 15 17:23:34 PST 2009

Hi folks,

I want to make sure I understand the word segmentation mini example. I 
have worked out and
explained my reasoning behind example exercise and I have questions 
regarding the lexical
entries terms.

 From the paper.

The goal is to estimate P(h|d) which is the stationary distribution of 
the Markov Chain
which we get when the Gibbs sampling has converged. To do this we 
compute the Bayes factor
which is a ratio of two hypothesis. The first hypothesis that there is 
no word segment at the position
and the second hypothesis that there is a segment at the position.

Exercise

If there are two different entries for ba then the what is P(ba.di.ba)

P(ba.di.ba) is determined by 3 terms:

corpus size: (1 - p_c#)^3 p_c#  

word type distribution: (alpha/alpha)( alpha/(1 + alpha))(alpha/ 2 + alpha)

lexical entries: (1 - p_w#)^6 p_w#^3 1/V^6

Is this correct?

Explanation

There are 3 words in the corpus and it follows a geometric distribution 
so that's the information the corpus size term
captures. Equation 6 in the Goldwater paper contains the word type 
distribution and the lexical entries for the hypothesis
that a segmentation occurs. The P_0 term is factored into the lexical 
entries term in the segmentation mini example.
Equation 6 has 3 terms that are multiplied together.

The first term appears to equal 1 (alpha/alpha) in the for the 2 examples
in the mini example because it hasn't seen the word yet, so the word is 
generated anew from the Dirichlet Process Model
see equation 3 in Goldwater. When the word is generated anew and it's 
the first word the probability is (alpha/alpha)P_0(ba).
The second term is also generated anew, but we've seen one word 
previously so the term is (alpha/1 + alpha)P_0(di). Now for the third
term I am assuming that this is the different ba so it is also generated 
anew. The third term is equal to (alpha/2 + alpha)P_0(ba).

For the lexical entries we calculate P_0(ba)P_0(di)P_0(ba). I'm a little 
bit confused on this step. Is each letter considered a phoneme?
For example H1 in the example should be P_0(ba) P_0(di). The first term 
is (1 - p_w#)^4. I don't quite understand why, I think it should be (1 - 
p_w#)^2.  I think P_0(ba) should be (1- p_w#)p_w# and so should P_0(di). 
In the example it appears that P_0(ba) is (1 - p_w#)^2 p_w# and so is 
P_0(di). Is this correct? So for the ba example are the two boundaries 
where there are no word segmentations .b.a that is before and after the 
b?  For H1 in the example there are 2 word boundaries so the second term 
is p_w#^2 and there are 4 different phonemes that have been used and 
each phoneme is uniformly likely so the final term is 1/V^4.

Thanks,
Matt