[Lign274] lign274-homework1-questions

Fri Jan 29 12:14:09 PST 2010

Hi Wen,

On Jan 29, 2010, at 11:54 AM, Wen-Hsuan Chan wrote:

> Hi Roger,
>
> A few questions:
>
> So basically, the number of sequence in question is the number of  
> possible phonological sequence, and we'd like to know what is a  
> better distribution of them.

Yes, this is correct.

> Weight-file is a file to weight each instance with their frequency  
> count.

Let me clarify this.  The output from training the log-linear model is  
an estimate of the model parameters.  In log-linear/maximum-entropy  
models, the model parameters are often called "feature weights".  This  
output -- the feature weights -- is what I am redirecting to the ~/ 
weights file in the example below.

Distinct from the notion of "feature weights" is the notion of weights  
on the observations themselves from which the model is to be trained.   
A weight w_i on an observation o_i=<x_i,y_i> says, "in training the  
model, treat o_i as if it occurred w_i times".  Formally, the  
contribution of o_i to the log-likelihood is taken to be

   w_i log P(y_i | x_i)

By default, all observation weights are taken to be 1.

> Is my understanding correct? In addition, what is the format of  
> weight-file? can you put one example which could weight  
> sample_data_file into sample_data_file.weight?

I've added a file with the output of model training to the directory.   
It's called sample_weights, and was produced with the invocation

   ../../bin/megam.opt -lambda 0 -explicit -fvals -nobias multiclass  
sample_data_file > sample_weights

The file sample_data_file.weighted contains a training set with  
weighted observations which is equivalent to sample_data_file.  So  
training the model with sample_data_file.weighted gives the same  
results (modulo numerical error).

I hope this clarifies things, and don't hesitate to continue to ask  
questions!

Roger

>
> Thanks,
> Wen
> On 2010/1/28, at 上午 11:10, Roger Levy wrote:
>
>> Hi Wen,
>>
>> Good questions.  Here are some answers:
>>
>> On Jan 28, 2010, at 1:05 AM, Wen-Hsuan Chan wrote:
>>
>>> Hi Roger,
>>>
>>> I try to use the software on the server (just follow the MegaM How- 
>>> To Guide)
>>> Here are some basic questions:
>>>
>>> 1. Should we run those procedures in our home directory?
>>> if running it in /local/contrib/lign274/data/ 
>>> log_linear_phonotactics, it shows "Permission denied"
>>>
>>> wec017 at morel:/local/contrib/lign274/data/log_linear_phonotactics 
>>> $ ../../bin/megam.opt -lambda 0 -explicit -fvals -nobias  
>>> multiclass sample_data_file > weights
>>> -bash: weights: Permission denied
>>>
>>> However, it does not work to run megam in my home directory. (- 
>>> bash: ../../bin/megam.opt: No such file or directory)
>>
>> Right -- the problem is that you don't have permission to write  
>> into the log_linear_phonotactics directory. You should redirect the  
>> output to your home directory somewhere, e.g.:
>>
>> $ /local/contrib/lign274/bin/megam.opt -lambda 0 -explicit -fvals - 
>> nobias multiclass /local/contrib/lign274/data/ 
>> log_linear_phonotactics/sample_data_file > ~/weights
>>
>>>
>>> 2.
>>> Another error occurs when calling MegM with -predict<weight> option:
>>>
>>> wec017 at morel:/local/contrib/lign274/data/log_linear_phonotactics 
>>> $ ../../bin/megam.opt -samefeat -explicit -fvals -predict  
>>> sample_data_file.weighted multiclass \ sample_data_file
>>> Fatal error: exception Failure("error: "float_of_string" occured  
>>> on line 1 when trying to parse "#" as a float")
>>>
>>> I'm not sure what it means??
>>
>> Right -- sample_data_file.weighted is not a weights file, it is a  
>> training-data file where the observations are weighted (e.g.,  
>> setting a weight of 2 means to treat this observation as if it  
>> occured twice).  After calling the command I mentioned above to  
>> train weights and record them in your home directory, you could call
>>
>> $ /local/contrib/lign274/bin/megam.opt -explicit -fvals -predict ~/ 
>> weights multiclass /local/contrib/lign274/data/ 
>> log_linear_phonotactics/sample_data_file > ~/predictions
>>
>> The individual predictions would then be recorded in ~/predictions.
>>
>>> 3.  For output of 2, like this:
>>> 0 	0.73949975031254744362 	0.24389340542259144162 	 
>>> 0.01660684426486126741
>>> first column is predicted outcome class.
>>>
>>> i think i still feel confused of the response class. Under the  
>>> context of phonotactics knowledge, my understanding is that our  
>>> goal is to find out the distribution of possible sound sequence.  
>>> So what does the "response class" mean here? Also, what do the  
>>> following 3 number stand for? Can we think they are predicted  
>>> probabilities for each class?
>>
>> The response class is the phonological sequence in question. You  
>> have to associate an integer with each sequence (basically just put  
>> the sequences in some order and then number them starting from  
>> zero).  The output of -predict is then just (1) the response class  
>> with maximum probability; and (2) the probabilities of all the  
>> response classes.
>>
>>>
>>> 4. "Define a small set of feature functions..." Does it mean that  
>>> converting word_suffix into the format like sample_data_file? Each  
>>> phoneme is represented as   "i # f_11 r_11..." according to  
>>> features we define. And then we can use this file in the MegaM..
>>
>> Yes, that's right.
>>
>> Don't hesitate to ask more!
>>
>> --
>>
>> Roger Levy                      Email: rlevy at ling.ucsd.edu
>> Assistant Professor             Phone: 858-534-7219
>> Department of Linguistics       Fax:   858-534-4789
>> UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy
>>
>>
>>
>>
>>
>>

--

Roger Levy                      Email: rlevy at ling.ucsd.edu
Assistant Professor             Phone: 858-534-7219
Department of Linguistics       Fax:   858-534-4789
UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy