[Lign274] lign274-homework1-questions
Roger Levy
rlevy at ling.ucsd.edu
Fri Jan 29 12:14:09 PST 2010
Hi Wen,
On Jan 29, 2010, at 11:54 AM, Wen-Hsuan Chan wrote:
> Hi Roger,
>
> A few questions:
>
> So basically, the number of sequence in question is the number of
> possible phonological sequence, and we'd like to know what is a
> better distribution of them.
Yes, this is correct.
> Weight-file is a file to weight each instance with their frequency
> count.
Let me clarify this. The output from training the log-linear model is
an estimate of the model parameters. In log-linear/maximum-entropy
models, the model parameters are often called "feature weights". This
output -- the feature weights -- is what I am redirecting to the ~/
weights file in the example below.
Distinct from the notion of "feature weights" is the notion of weights
on the observations themselves from which the model is to be trained.
A weight w_i on an observation o_i=<x_i,y_i> says, "in training the
model, treat o_i as if it occurred w_i times". Formally, the
contribution of o_i to the log-likelihood is taken to be
w_i log P(y_i | x_i)
By default, all observation weights are taken to be 1.
> Is my understanding correct? In addition, what is the format of
> weight-file? can you put one example which could weight
> sample_data_file into sample_data_file.weight?
I've added a file with the output of model training to the directory.
It's called sample_weights, and was produced with the invocation
../../bin/megam.opt -lambda 0 -explicit -fvals -nobias multiclass
sample_data_file > sample_weights
The file sample_data_file.weighted contains a training set with
weighted observations which is equivalent to sample_data_file. So
training the model with sample_data_file.weighted gives the same
results (modulo numerical error).
I hope this clarifies things, and don't hesitate to continue to ask
questions!
Roger
>
> Thanks,
> Wen
> On 2010/1/28, at 上午 11:10, Roger Levy wrote:
>
>> Hi Wen,
>>
>> Good questions. Here are some answers:
>>
>> On Jan 28, 2010, at 1:05 AM, Wen-Hsuan Chan wrote:
>>
>>> Hi Roger,
>>>
>>> I try to use the software on the server (just follow the MegaM How-
>>> To Guide)
>>> Here are some basic questions:
>>>
>>> 1. Should we run those procedures in our home directory?
>>> if running it in /local/contrib/lign274/data/
>>> log_linear_phonotactics, it shows "Permission denied"
>>>
>>> wec017 at morel:/local/contrib/lign274/data/log_linear_phonotactics
>>> $ ../../bin/megam.opt -lambda 0 -explicit -fvals -nobias
>>> multiclass sample_data_file > weights
>>> -bash: weights: Permission denied
>>>
>>> However, it does not work to run megam in my home directory. (-
>>> bash: ../../bin/megam.opt: No such file or directory)
>>
>> Right -- the problem is that you don't have permission to write
>> into the log_linear_phonotactics directory. You should redirect the
>> output to your home directory somewhere, e.g.:
>>
>> $ /local/contrib/lign274/bin/megam.opt -lambda 0 -explicit -fvals -
>> nobias multiclass /local/contrib/lign274/data/
>> log_linear_phonotactics/sample_data_file > ~/weights
>>
>>>
>>> 2.
>>> Another error occurs when calling MegM with -predict<weight> option:
>>>
>>> wec017 at morel:/local/contrib/lign274/data/log_linear_phonotactics
>>> $ ../../bin/megam.opt -samefeat -explicit -fvals -predict
>>> sample_data_file.weighted multiclass \ sample_data_file
>>> Fatal error: exception Failure("error: "float_of_string" occured
>>> on line 1 when trying to parse "#" as a float")
>>>
>>> I'm not sure what it means??
>>
>> Right -- sample_data_file.weighted is not a weights file, it is a
>> training-data file where the observations are weighted (e.g.,
>> setting a weight of 2 means to treat this observation as if it
>> occured twice). After calling the command I mentioned above to
>> train weights and record them in your home directory, you could call
>>
>> $ /local/contrib/lign274/bin/megam.opt -explicit -fvals -predict ~/
>> weights multiclass /local/contrib/lign274/data/
>> log_linear_phonotactics/sample_data_file > ~/predictions
>>
>> The individual predictions would then be recorded in ~/predictions.
>>
>>> 3. For output of 2, like this:
>>> 0 0.73949975031254744362 0.24389340542259144162
>>> 0.01660684426486126741
>>> first column is predicted outcome class.
>>>
>>> i think i still feel confused of the response class. Under the
>>> context of phonotactics knowledge, my understanding is that our
>>> goal is to find out the distribution of possible sound sequence.
>>> So what does the "response class" mean here? Also, what do the
>>> following 3 number stand for? Can we think they are predicted
>>> probabilities for each class?
>>
>> The response class is the phonological sequence in question. You
>> have to associate an integer with each sequence (basically just put
>> the sequences in some order and then number them starting from
>> zero). The output of -predict is then just (1) the response class
>> with maximum probability; and (2) the probabilities of all the
>> response classes.
>>
>>>
>>> 4. "Define a small set of feature functions..." Does it mean that
>>> converting word_suffix into the format like sample_data_file? Each
>>> phoneme is represented as "i # f_11 r_11..." according to
>>> features we define. And then we can use this file in the MegaM..
>>
>> Yes, that's right.
>>
>> Don't hesitate to ask more!
>>
>> --
>>
>> Roger Levy Email: rlevy at ling.ucsd.edu
>> Assistant Professor Phone: 858-534-7219
>> Department of Linguistics Fax: 858-534-4789
>> UC San Diego Web: http://ling.ucsd.edu/~rlevy
>>
>>
>>
>>
>>
>>
--
Roger Levy Email: rlevy at ling.ucsd.edu
Assistant Professor Phone: 858-534-7219
Department of Linguistics Fax: 858-534-4789
UC San Diego Web: http://ling.ucsd.edu/~rlevy
More information about the Lign274
mailing list