If I were you, I wouldn’t actually explicitly set counts in your countermap for trigrams that didn’t occur in training. This will require a lot of memory. Instead, store the necessary constants as instance variables in your Model and then in the call to getProbability, do the calculation on the fly.

The semantics of storing C(h, <UNK>) would be strange. Remember <UNK> is the out-of-vocabulary word. You may never have seen “the funny bone” together even if you did see “bone” in other contexts (in other words, “bone” would not be the out-of-vocabulary word “<unk>”). In this case, “the funny bone” is an unseen trigram even though there are no out-of-vocab words. Under the scheme described above, you will never have an explicit entry in the countermap for <UNK>.

Computation of C*(h) can be done analytically. Again, just store C*(h,w) as you have already done. Then: • For each h o For each w seen with h  Add c*(h,w) to an accumulator

This is the summation on the left hand side of the last equation on page 4, slide 2. The right hand side of the equation is the sum of c*(h,w) for all of the w in the vocab (plus stop) that weren’t seen with the history. Because we assumed the probability was distributed uniformly, things simplified nicely and we just take (r/u)*(|v|-e(h)+1). The reason for this is rather simple: a single unseen n-gram has count r/u. How many unseen n-grams are there with history h? (I’ll let you think that over, and if you still don’t understand, let me know).

A good data structure for storing c*(h) is a counter where the key is the history.

So in your language model, in getProbability, when you are asked for the probability of h,w where h was never seen, return 1/(|V|+1) (if you don’t follow this logic from the slide, please tell me so I can explain it better). If h was seen in training, but not with w, then return (R/U)/C*(h) (if you can’t see why, please ask me). If h and w were seen together in training, return c*(h,w)/c*(h) as computed above.

Where does <UNK> fall into this? You will normally never be asked to compute p(<UNK>|h); nor will you normally be asked p(w|h_1<unk>)—although your algorithm should do the right thing in cases like these. Normally you will be asked p(“antidisestablishmentarianism”| ”<S>” ”the”) or p(“in”|”the”,”antidisestablishmentarianism”). Notice that these cases are taken care of automatically with the algorithm above. In the case of p(“anti..”|”<S>” “the”), “anti” and ”<S>” “the” were never seen together (if “anti…” was never seen at all in training, then obviously it was never seen with a specific history) and that equation applies. In the case of p(“in”|”the,”anti…”), the history was never seen so the 1/(|V|+1) applies.

The reason the math works out and we have proper distributions is because we defined V to be all words seen in training + “<unk>”.

The word “improvements” on the wiki does imply a comparison. You don’t have to specifically analyze the unigram model like you do the trigram (in other words, you don’t have to answer all of the questions for that model), hence the note about analyzing one model. In other words everything is focused on the trigram, bringing in info about the unigram as necessary. Improvement also implies you do the analysis on the better model. I will ask Dr. Ringger about clarification on the wiki.

The difference between the word error rate (WER) and n-best list is that the former is the metric and the latter is the data (used to compute the metric). I’m not sure I totally understood your confusion. You are required to (1) sift through the n-best lists to find examples where your trigram got a better WER than your unigram and explain “why” in terms of the extra context in the trigram (2) Look at the n-best lists, and find cases where the trigram model did not choose the correct sentence (irrespective of the unigram) and explain “why” and (3) propose what you would need to know and do to fix that (but obviously, you won’t do that).

Question 1 is success analysis, because you are examining what went right. Question number 2 is error analysis because you are looking at what went wrong. To answer question one you are directly looking at the WER to find examples of improvement. For question number two, you do not need to examine the WER (though choosing the wrong candidate will certainly affect the WER). You can find sentences that were incorrect regardless of the WER. For instance, it could be the case that the unigram and trigram got the same WER, but both chose the same (wrong) sentence. In fact, you don’t even need to compare the unigram to the trigram in order to find examples that the trigram failed on.

You can choose to minimize either WER or perplexity. I believe we report both in the “hall of fame”. As a part of your assignment, you should probably explain why your choices bettered one but not necessarily the other.

In the TA Help session for this lab, we covered the analysis part of the assignment. I encourage you to attend whenever possible. “Local modeling scores”: can you explain the result of the given metric based on the unigram/trigram assumption? For instance, can you find some examples sentences in which certain trigrams clearly led to more probable sentences because the history really provided useful information that a unigram (or even bigram) would not have? You may also consider printing out the p(w|h) for the trigram and unigram to help show this.

The n-best list is produced by a speech recognizer. It hypothesis the top-n sentences that it thinks are most probable. You can then use a language model to re-order the list. If your language model really reflects true English, the most English-like sentence will be chosen first (there are alternative ways of explaining the n-best list, but this should suffice). The word-error rate compares the “correct” sentence (which is what the person really spoke) to the hypothesis your language model chose. There are penalties for insertions/deletions/substitutions that are used to compute the word error rate.

There are a couple of approaches to doing the analysis. One is to use the debugger. Stick a break point in the metric calculators that compute perplexity and word-error rate (to add a breakpoint, just expand the statnlp.jar icon in the project and navigate to the file specified in the XML). Then when things go wrong/right you can use the debugger to evaluate expressions and peek into variables that you can use in your report. This is somewhat tedious and difficult. The other way is to cut-and-paste the source code from the provided metric calculators to make your own metric calculator. Then you can do things like add special code when things are right/wrong. This is still not ideal, and more difficult than it sounds—in the future we plan to add hooks to make this easier.

I might also inform you that you can train/test multiple models in the same XML file that could make it easier to correlate sentences that one model gets wrong or right.

The main point is that to receive full credit you need specific examples from specific sentences and you will probably also want to include probability tables based on those examples. You should also compare and contrast your unigram and trigram models where appropriate.

It turns out that the regression is rather easy to do. Let’s consider each (r,N_r) as the (x,y) pairs. Then, take the log of the frequencies (that is, r) and the log of the frequencies of frequencies (that is, N_r) and treat the results as a (new) list of (x,y) pairs. Then do least-squares fit of the points to a line as if these new values were normal (x,y) pairs. Information on least-square fit is later in the book (ch. 15???), which I don’t have on hand here in St. Louis. Also, if you read very closely there is other important information, for example, how to choose the cutoff point, that is found in your book under the section about Good-Turing (again, sorry I don’t have specific page numbers). It’s subtle and not particularly clear (part of it is in the footnote), but it is all there in some form or another.

I think the least-squares method is described in http://en.wikipedia.org/wiki/Least_squares. I hope it is the same as the book.

You are correct about vocab, include unk, not start nor stop.

You want to see unks in your output. What it means is that where you generated an unk, that might have been “antidisestablishmentarianism” or some other rare, but real word. Unks have probability mass so they should happen.

Bad smoothing algorithms will produce way too many unks. If you have a proper distribution and you see tons of unks, then you should explain this a little in your write up (the book actually talks about this on pg. 203 (and surrounding text). However, if you’re doing GT you shouldn’t see a whole lot of them (unfortunately, I can’t give you a ballpark estimate b/c I don’t have one; try to validate the distribution as exhaustively as possible and if that passes, you should be fine!)

Good Luck! Robbie

From: Chris Wilson [mailto:sushicw@gmail.com] Sent: Monday, October 08, 2007 7:40 PM To: Robbie Haertel Subject: Re: 401 Lab

Robbie,

Found it. You also need to add an extra evaluation for it to work. It's not documented, but it's easy enough to do once you look. :)

On a side note, my generated sentences have unknowns in them. That seems wierd to me, but the sentence generator gets its ideas from the “getVocabulary” function, so I can see why it happens. As I've understood, the vocabulary I return should have no stop symbols but should have the unk token in it. Or should that function return a version of the vocab that does indeed have stop and no unk?

– Chris

LDAP: couldn't connect to LDAP server