## Instructions

For each experiment,

• commit your code (with code review, etc.)
• create a new experiment directory in entropy:/home/data/experiments/alfa
• place all files containing complete experimental results in that directory
• complete results
• experimental data
• .xml experiment config. file
• experimenter
• SVN revision numbers (for each relevant repository)
• date
• brief statement of purpose
• brief summary of parameter settings (from the experiment .xml file)
• brief summary of results:
• highest point reached by this technique and which amount of data (coordinates for the point)
• starting conditions for that best curve
• path in entropy:/home/data/experiments/alfa containing the complete results
• send a brief “experiment commit” message to alert folks on the list.

## Syriac Stem to Root Analysis

### Terminology

As a reminder, we agreed upon three precise definitions to use while talking about Syriac words. Stem - the inflected part of the word between the suffix and the prefix. By stripping the suffix and prefix, you get the stem.<br> Baseform - the form the stem is derived from. <br> Root - the form the baseform is derived from. Typically, this is the tri-literal root.<br> Headword - this form is used only in conjunction with a dictionary. Depending on the “helpfulness” of the dictionary, root, baseform, and stem may all be headwords.<br> Word Type - a unique word (stem + prefix + suffix). This is white-space delimited text.<br> Word Token - an actual word (stem + prefix + suffix). The phrase, “this word is a word token” has 5 word types and 6 tokens.<br>

### Problem

We wanted to know how difficult is it to map from a Stem to a baseform or a root. When an annotator annotates a stem, we want to be able to link that to a dictionary entry (possibly all three: the stem itself, the baseform, and the root).

### Methodology

I computed the average ambiguity and the expected ambiguity for the following three mappings: stem -> root, stem -> baseform, and baseform -> root. Average Ambiguity is the sum of the ambiguity for each type divided by the total number of types. The expected ambiguity is the sum of the ambiguity for each type times its probability. There were a number of issues as I will discuss in the results section

### Results

All counts are for types (not tokens).

NOTE: FOR THESE TESTS, STEM INCLUDES THE SUFFIX

Stem is not vocalized:

Stem-baseform average ambiguity:	1.019647201946472        [16763 / 16440]
Stem-root average ambiguity:	        1.012287104622871        [16642 / 16440]
Baseform-root average ambiguity:	1.0181040157998684       [3093  / 3038 ]
Stem-baseform expected ambiguity:	1.087058823529223
Stem-root expected ambiguity:	        1.0675695394434954
Baseform-root exptected ambiguity:	1.049630642954838

With vocalized stem:

Stem-baseform average ambiguity:	1.0043230819743898       [18353 / 18274]
Stem-root average ambiguity:	        1.0020247345956004       [18311 / 18274]
Baseform-root average ambiguity:	1.0181040157998684       [3093  / 3038 ]
Stem-baseform expected ambiguity:	1.0162152302779135
Stem-root expected ambiguity:	        1.008399452804122
Baseform-root exptected ambiguity:	1.049630642954838

Ambiguity from word type -> vocalized stem AND morphological tag:

Type-stem+tag average ambiguity:	1.1703739620583065       [19310 / 16499]
Type-stem+tag expected ambiguity:	1.797008663930918

### Conclusions

This last test (although not exact) represents somewhat of a potentially ideal situation with the dictionary. It asks, “from the word type (a word that's unsegmented, unvoweled, and not annotated), what is the ambiguity we're looking at? In other words, how many choices would I have to show on average in a list the user would choose from? It answer appears to be less than two. In order to make sure that sinks in, that means, show me the characters of an unvoweled word, and I can show you (on average) less than two choices – one of which is correct. Now that's a pretty quick way of annotating! With this in mind, perhaps we should reconsider how our annotation process is set up.

## Second Round of Diacritization Results

### Results

Known words WER:                        13.81  (48237 / 55966) 9/8/2008
Known words WITHOUT case ending WER:    2.430  (54606 / 55966) 9/8/2008
Unknown words WER:                      74.977 (832 / 3325)  9/8/2008
Unknown words WITHOUT case ending WER:  68.752 (1039 / 3325)  9/8/2008

All words WER:                          17.240 (49069 / 59291) 9/8/2008
All words WITHOUT case ending WER:      6.149  (55645 / 59291) 9/8/2008

Kuebler, Mohamed
All words WITHOUT case ending WER:      6.64

Zitouni, Sorensen, Sarikaya
All words WER:                          17.3
All words WITHOUT case ending WER:      7.2

Habash and Rambow
All words WER:                          14.9
All words WITHOUT case ending WER:      5.5

## Preliminary Diacritization Results

These are stats about the data split suggested by Zitouni et al (2006).

Num of train tokens: 340879
Num of test tokens: 59291
Num of tokens: 400170
Num of train types: 31968
Num of test types: 11361
Num of types: 43329
Num of unknown types: 2700
Num of unknown tokens: 3325
Num of ambiguous types in training: 8280
Num of ambiguous types in test: 2364
Percent of test that are unknown tokens: 0.05607933750484897
Percent of tokens that are ambiguous types: 0.6418147287402853
Percent of tokens in test set that are ambiguous types: 0.5378725270277108
Average ambiguity: 1.4082715963904082

### Methodology

The latest two papers about vocalization (Zitouni et al., and Kubler) use a character-granularity. That is, instead of predicting the voweling of a word, they predict whether or not a vowel occurs after a given letter, and if it does, which vowel. We theorize that using a character-granularity optimizes the character error rate with respect to accuracy, but leaves the word error rate lagging. Thus, we predict that a word-level approach can ultimately achieve a lower word error rate. For known words (seen in training), we train a maxent classifier for each word type. This, in and of itself, is novel. Robbie might recall something I don't, but the only paper I've read about word-level approaches uses an HMM and did not perform outstandingly. Since a word-level approach for unknown words is futile (plus we have no classifier for them!), we back off to what we think is a sub-par approach–namely, the character-based model. Like Zitouni et al., we will train an MEMM, with each “node” in the MEMM representing a character.

### Results

Known words WER:                        13.81 (48237 / 55966)
Known words WITHOUT case ending WER:    3.463 (54028 / 55966)
Unknown words WER:                      75.038 (830 / 3325)
Unknown words WITHOUT case ending WER:  43.759 (1870 / 3325)

All words WER:                          17.247 (49065 / 59291)
All words WITHOUT case ending WER:      5.723  (55898 / 59291)

Zitouni, Sorensen, Sarikaya
All words WER:                          17.3
All words WITHOUT case ending WER:      7.2

Habash and Rambow
All words WER:                          14.9
All words WITHOUT case ending WER:      5.5

### Features

For each word, we extract the following features:

1. Previous word
2. Previous previous word
3. Previous previous previous word
4. Following word
5. Following following word
6. Following following following word
7. 2 previous words
8. 2 following words
9. previous word and following word
10. previous 3 words
11. following 3 words
12. previous 2 words and following 2 words
13. For each of the previous and following words, prefix and suffix features up to 10 characters in length

## Preliminary Arabic Results

The following results are preliminary and tested only on the Arabic Treebank part 1. Habash and Rambow report on parts 1 and 2 separately. We are sure that our part 1 data is correct. We use a split similar to Habash and Rambow's.

Buckwalter tags are non-vectorized tags that are segmented: prefixTag+stemTag+suffixTag+caseTag, with as little as just the stemTag necessary.

Vectorized tags are achieved by mapping the Buckwalter tags to a vectorized tag. The vectorized tag is the same vector that Hajic uses in one of his papers (I believe). I'll check this. Note, the “POS” tag is subtag 0.

Results:

Most-frequent tagger with Buckwalter tags
0.8934054054054054 (Unknown Accuracy: 0.2754569190600522), Sentence Accuracy: 0.11088295687885011 Decoder Suboptimalities Detected: 0

Maxent tagger with Buckwalter tags
0.9432072072072072 (Unknown Accuracy: 0.664490861618799), Sentence Accuracy: 0.36960985626283366 Decoder Suboptimalities Detected: 0

Most-frequent tagger with monolithic vector tags
0.8961441441441441 (Unknown Accuracy: 0.3002610966057441), Sentence Accuracy: 0.12320328542094455 Decoder Suboptimalities Detected: 0
[java] subtag: 0	0.9198558558558558	12763.0
[java] subtag: 1	0.9494774774774775	13174.0
[java] subtag: 2	0.9564684684684684	13271.0
[java] subtag: 3	0.9556036036036036	13259.0
[java] subtag: 4	0.9564684684684684	13271.0
[java] subtag: 5	0.9527207207207207	13219.0
[java] subtag: 6	0.9522162162162162	13212.0
[java] subtag: 7	0.9518558558558559	13207.0
[java] subtag: 8	0.9378018018018018	13012.0
[java] subtag: 9	0.9531531531531532	13225.0

Maxent tagger with monolithic vector tags
0.9437837837837838 (Unknown Accuracy: 0.6618798955613577), Sentence Accuracy: 0.38193018480492813 Decoder Suboptimalities Detected: 0
[java] subtag: 0   0.9512072072072072      13198.0
[java] subtag: 1   0.9757837837837838      13539.0
[java] subtag: 2   0.9833513513513513      13644.0
[java] subtag: 3   0.9823423423423423      13630.0
[java] subtag: 4   0.9834234234234234      13645.0
[java] subtag: 5   0.9790990990990991      13585.0
[java] subtag: 6   0.9775135135135136      13563.0
[java] subtag: 7   0.976072072072072       13543.0
[java] subtag: 8   0.9796036036036037      13592.0
[java] subtag: 9   0.9795315315315315      13591.0

Most-frequent tagger with vector tags (fully independent)
0.8849009009009009 (Unknown Accuracy: 0.10574412532637076), Sentence Accuracy: 0.10882956878850103 Decoder Suboptimalities Detected: 0
[java] subtag: 0   0.9231711711711712      12809.0
[java] subtag: 1   0.9626666666666667      13357.0
[java] subtag: 2   0.9772972972972973      13560.0
[java] subtag: 3   0.9771531531531531      13558.0
[java] subtag: 4   0.9781621621621621      13572.0
[java] subtag: 5   0.9667747747747748      13414.0
[java] subtag: 6   0.9553873873873874      13256.0
[java] subtag: 7   0.9538018018018019      13234.0
[java] subtag: 8   0.9553153153153153      13255.0
[java] subtag: 9   0.9612252252252252      13337.0

MaxEnt with vector tags (fully independent)

0.9176936936936937 (Unknown Accuracy: 0.5469973890339426), Sentence Accuracy: 0.29568788501026694 Decoder Suboptimalities Detected: 0
[java] subtag: 0   0.9519279279279279      13208.0
[java] subtag: 1   0.9727567567567568      13497.0
[java] subtag: 2   0.9807567567567568      13608.0
[java] subtag: 3   0.9793873873873874      13589.0
[java] subtag: 4   0.9804684684684685      13604.0
[java] subtag: 5   0.9757837837837838      13539.0
[java] subtag: 6   0.9744144144144145      13520.0
[java] subtag: 7   0.9727567567567568      13497.0
[java] subtag: 8   0.9672072072072072      13420.0
[java] subtag: 9   0.9771531531531531      13558.0

Results with the new mapping (no ????s)

MaxEnt with monolithic vector tags
0.9437837837837838 (Unknown Accuracy: 0.6631853785900783), Sentence Accuracy: 0.37166324435318276 Decoder Suboptimalities Detected: 0
[java] subtag: 0   0.9522162162162162      13212.0
[java] subtag: 1   0.9899099099099099      13735.0
[java] subtag: 2   0.9997117117117117      13871.0
[java] subtag: 3   0.9975495495495496      13841.0
[java] subtag: 4   1.0     13875.0
[java] subtag: 5   0.992936936936937       13777.0
[java] subtag: 6   0.9891171171171171      13724.0
[java] subtag: 7   0.987027027027027       13695.0
[java] subtag: 8   0.994954954954955       13805.0
[java] subtag: 9   0.9927927927927928      13775.0

Most Frequent with monolithic vector tags
0.8939099099099099 (Unknown Accuracy: 0.2754569190600522), Sentence Accuracy: 0.11088295687885011 Decoder Suboptimalities Detected: 0
[java] subtag: 0   0.9185585585585586      12745.0
[java] subtag: 1   0.9807567567567568      13608.0
[java] subtag: 2   0.998990990990991       13861.0
[java] subtag: 3   0.9978378378378379      13845.0
[java] subtag: 4   1.0     13875.0
[java] subtag: 5   0.9858018018018018      13678.0
[java] subtag: 6   0.9739819819819819      13514.0
[java] subtag: 7   0.971963963963964       13486.0
[java] subtag: 8   0.9766486486486486      13551.0
[java] subtag: 9   0.9824144144144145      13631.0

Most Frequent with independent vector tags

0.8845405405405405 (Unknown Accuracy: 0.10574412532637076), Sentence Accuracy: 0.1026694045174538 Decoder Suboptimalities Detected: 0
[java] subtag: 0   0.9233873873873873      12812.0
[java] subtag: 1   0.9811891891891892      13614.0
[java] subtag: 2   0.998990990990991       13861.0
[java] subtag: 3   0.9976936936936937      13843.0
[java] subtag: 4   1.0     13875.0
[java] subtag: 5   0.9858018018018018      13678.0
[java] subtag: 6   0.9739819819819819      13514.0
[java] subtag: 7   0.971963963963964       13486.0
[java] subtag: 8   0.9766486486486486      13551.0
[java] subtag: 9   0.9824144144144145      13631.0

We see here that the results from Arabic are similar to the Syriac results we saw in that the independent taggers achieve higher individual accuracies, but a lower total accuracy. This can be explained by the following two insights:

1. The monolithic tag gets the full tag correct more often than the independent version.
2. When the monolithic tagger is wrong, it gets more subtags incorrect than the independent tagger

## Do Random Runs Make a Difference on POS Taggers?

• Experimenter: Peter McClanahan & Robbie Haertel
• Rev:
• Date: March 4, 2008

### Problem

Are discrepancies in accuracies okay for a POS Tagger? Even if it's not using maximum entropy?

### Method

With the new Syriac data, we ran the most frequent tagger with different random seeds. We also ran the same tagger with the same random seeds.

### Results

After some testing, we found the following to be true.

• A Greedy Decoder with a Most-Frequent Scorer produced the exact same results with different random seeds.
• A Beam Decoder with a Most-Frequent Scorer produced differing results with different random seeds. The results usually differed by one or two. For a fixed seed, the results never differed.
• Any decoder with a Most-Frequent Scorer and a different order of subtags within a cluster, caused different results only to the changed cluster. For example, assume 2 clusters {0,2} and {1,3,4}. Tagging with {0,2} and {1,3,4} produce different results than {0,2} and {1,4,3}, but only subtags 1, 3, and 4 are candidates for slight differences.

We ran a quick program to count the number of ties seen by the most-frequent scorer. With the monolithic tag, there are 367 words that have more than one most-frequent tag. With almost all clusterings, there are more than one ambiguities (at least one tie for max) for at least one word. Some clusterings produce amounts close to the monolithic tag.

Note: We never saw subtags 0-5 change.

### Conclusions

We hypothesize that a change in the order of subtags within a cluster changes the hash code of the string that is hashed in the Counter. For instance, if we swap subtag one and zero in “1#0” we obtain “0#1”; the hashcodes of these strings should be different. This allows the possibility of switches to the chosen max in ambiguous cases (the counter stores the tags in different orders based on hashcode). Thus, for the most frequent tagger, the ties will be broken differently depending on the random seed. The beam decoder is similarly affected. However, we cannot yet find evidence of why dataset ordering affects results.

We finally note that, with George's help, this problem has disappeared on at least one problem when the convergence criterion was tightened.

## Initial Sentence Selection

• Experimenter: Robbie Haertel
• Rev: 182
• Date: Jan 14, 2008

### Problem

How do I choose the first sentence (without unsupervised methods)?

Basic idea: we know that the user is going to have to correct every tag in the first sentence, hence we can directly compute the cost. Furthermore, we can estimate the accuracy we might acheive by choosing a particular sentence (see below). Thus, we choose the sentence that gives us the biggest bang for our buck.

### Method

First, we assume that when the oracle corrects a word type, the machine will correctly tag that word with 100% accuracy in the rest of the data. Therefore, we can estimate what the tagger accuracy would be after seeing each sentence by simply summing up the count of the number of times each word type (i.e. duplicates removed) in the sentence occurs in the rest of the sentences and divide by the total number of words. Since we are interested in bang-per-buck, we divide this estimate by the estimate of our cost. We can estimate the number of words to be changed in that sentence as the number of occurences of words NOT in the training data (which, for the very first sentence, is all of them).

After some initial testing, two approaches were tried. In the first (QBP2), the accuracy is updated as explained, but the cost estimate is not. In other words QBP2 always assumes that the user will have to correct every word. QBP3 attempts to estimate how many words will need to be changed as described above.

### Results

All three algorithms were run a total of 1000 times. Even though QBP2 and QBP3 are deterministic, ties are broken arbitrarily.

As can be seen from the figure, QBP2 and QBP3 yield very similar results for the first three sentences (arguably, until about 18 hours of cost). QBP2 has a slight advantage for points two and three (the algorithms are algebraically equivalent for the first data point). This may indicate that, since we are overestimating accuracy, it's also better to overestimate cost–at first.

The random baseline, on average, starts with a minimum cost of around 3.7. On the other hand, QBP2 and QBP3 have 2 and 3 data points, respectively, before that cost. By the time the baseline has 3 data points, the QBP variants have around 8. This is important because the other algorithms need a minimum of 1-3 data points (depending on the algorithm, and depending on the waiting scheme (i.e. computer waits, human waits)) until they begin. Thus, even though the QBP variants seem to offer little advantage over the baseline (.92 and .98 for QBP2 and QBP3, respectively–but note that these are interpolated over a large enough range to be weary of the actual results) when the baseline actually starts, smarter algorithms may be able to start with a cost below one hour rather than around 4 hours. This savings hopefully accrues.

### Conclusion

Query by “perfect information” has been shown to start earlier than baseline

## Old QBU vs. new QBU (with bug fix)

• Experimenter: Peter McClanahan
• SVN Revision Number: 157, 158
• Date: December 17, 2007

### Purpose

To see the effects of the new bug fix.

### Method

Ran QBU with revision number 157 and compared it to 158. 5 averaged runs on the supercomputer.

### Results

As seen below, there is not a significant change. It is interesting to note that the fixed version actually does better for part of the curve. I haven't thought about why this is much, but there should be a logical explanation.

Here's also a quick version of time comparisons.

## QBUE

• Experimenter: George Busby
• SVN Revision Number: 130
• Date: October 27, 2007

### Purpose

Implement true query by uncertainty using the forward entropy algorithm as QBUE. Compare this to the previous approximate, QBU, which estimated the true sequence entropy by computing some of the entropy using the Viterbi sequence.

### Method

Using 5 generated random seeds, 1192729648377 1192729649225 1192729648535 1192729648011 1192729648770 ran on Baseline, LS, QBU, QBUV, and QBUE and produced graphs.

### Results

Data located on entropy at home/data/experiments/alfa/QBUE including spreadsheets of averaged runs and images. First notice, especially from the 4th graph, that QBU and QBUE are very similar. Further experiments will follow to prove the difference is statistically insignificant.

0%-100%
90%-96%
Y-Axis
X-Axis

Difference from QBU: Y-Axis: sentences trained on X-Axis: Difference in Accuracy from the QBU's Accuracy

Difference from QBUV: Y-Axis: sentences trained on X-Axis: Difference in Accuracy from the QBUV's Accuracy

## QBU Derivative

• Experimenters: Robbie Haertel and Peter McClanahan
• SVN Revision Number: 127
• Date: October 17, 2007

### Purpose

To try and find a meaningful play to switch from QBU to random.

### Method

We ran baseline and QBU with no cutoffs and no switchover. The derivative is a simple one, computed at each point by taking the following point and subtracting the previous point.

### Results

Here is a graph showing the derivatives and the 10-period moving average of the lines in order to remove noise.

thumbnail|none No image to be found!

## Switch-over Point

• Experimenters: Robbie Haertel and Peter McClanahan
• SVN Revision Number: 127
• Date: October 17, 2007

### Purpose

To see where we should switch over from a specific active learning algorithm to a random baseline.

### Method

Since zero count cutoffs were the best, we used those, with a batch-query size of 1 sentence per iteration. We ran switchover points at 50, 100, and 200 sentences compared to a baseline and to a QBU run with no switch over point.

### Results

Here is a graph clearly showing that each of the tested switchover points are still too early. thumbnail|none

A graph comparing cumulative word changes and batch size. thumbnail|none

## Feature Cutoff & Derivative

• Experimenters: Robbie Haertel and Peter McClanahan
• SVN Revision Number: 121
• Date: October 12, 2007

### Purpose

To investigate the effects of the feature selector on efficacy of active learning, especially during initial iterations. As a secondary goal, we provide an initial attempt at indentifying the crossover point–the point after which it is equally effective to use expensive querying methods as random selection.

### Method

Using 10% of the PTB (seed 1192131744), we removed feature cutoffs completely and used a batch query size of 1 sentence. A batch query size of 1 sentence seems to be *much* more efficient than the batch query size of 100, as we had originally thought.

### Results

Initial results show that batch query size of 1 sentence and no feature cutoffs prove to be much more effective than earlier methods used.

On ten percent of the PTB, with QBU we can reach 90% accuracy with 101 sentences (5198 words). Previous experiments reach 90% with 300 sentences (18364 words). This is a third of the amount of sentences required, and (possibly more importantly) .283% of the words.

It is also important to note that the baseline does better as well with the changes made; however, QBU outperforms the baseline by approximately the same factor as with the count cutoffs; as far as we've run, we consistently see that 1/3 of the data using QBU produces the same accuracy as the new baseline.

Results with 100% of the PTB are slightly better.

Also note that for both the baseline and QBU the slopes of the respective curves are approximately equally at around 125 sentences (this is by eyeball). This possibly indicates that we are okay to apply count cutoffs around this point, though the exact point is probably dataset dependent (including total amount of data, highest possible accuracy, etc.).

Accuracy per Iteration

On the first graph, the y-axis is accuracy, and the x-axis is the number of iterations (also the number of sentences trained on). On the second graphs the axes are reversed (hence the graph is mirrored about the y=x line). This simplifies visual estimation of the benefit over baseline.

We hypothesize that the crossover point (the point at which using a random baseline thereafter will provide the same results as other schemes) is where the slopes of an active learning technique and the random baseline cross. We therefore considered the (discrete) derivative of these curves. The derivative was estimated using the central difference (i.e. next accuracy - previous accuracy). The results are as follows:

<center> thumbnail|none (Y-axis is the derivative, x-axis is iteration/sentence number) </center>

The graph is fairly noisy; this noise would be almost entirely removed by averaging over several runs. This makes it difficult to estimate precisely where the slope of the QBU line is equal the baseline; it may be anywhere from 3-50 sentences (or more, but not likely)!

We next ran QBU for three iterations after which we switched to the random baseline (code not versioned); the process was repeated for a switchover point of 5:

I wish to further note that while it is possible to compute the derivative of the QBU curve in a real situation, the derivative of the random curve will not usually be available, hence this could not be used as a stopping criterion in a real-world task.

### Future Work

• Smooth the derviatives by averaging over several runs
• See if results (particularly crossover) hold for other algorithms, in particular, QBUV and a more robust QBC
• Suggest instead of purely random, we leave the remaining sentences ordered the way the last worthwhile iteration of QBU ordered them. They should be in that order anyways, and in practice, that could provide slightly better results for any given round. (By the way, this is essentially what was happening in the bug that only used the model from the first round to score sentences)

## 200-hour Syriac Active Learning

• Experimenter: Peter McClanahan
• SVN Revision Number: 116
• Date: October 12, 2007

### Purpose

Same as 100-hour experiment

### Method

Similar method. I validated that we are using fast maxent (at least for this experiment set).

Pending

## 100-hour Syriac Active Learning

Note - This experiment is not accurate due to the October 10 bug

• Experimenter: Peter McClanahan
• SVN Revision Number: 116
• Date: October 2, 2007

### Purpose

1) To plot time graphs for Syriac active learning.<br> 2) To plot max uncertainty vs. min uncertainty for each iteration (i.e. top value and last value of queue)

### Method

This wasn't an exhaustive test; I just put a few jobs on the supercomputer in order to see what times were for Syriac active learning. I only tested QBU, Baseline, and LS<br> These graphs are only one run; keep that in mind<br> We should run this again with other “comp” values

## Quick Syriac Data Analysis

• Experimenter: Peter McClanahan
• SVN Revision Number: 116 - 142
• Date: September 7, 2007 - updated November 5, 2007

### Purpose

To weigh the possibilities of creating different trie types

### Method

You can use

edu.byu.nlp.alfa.activeLearner.syriac.tools.ComputeDatasetStats

to get stats about the dataset. I added a few more lines of code that currently isn't checked in to see the number of tag/word and word/tag combinations.

### Results

Here are the summarized results for the dataset:

Rare word tokens: 32661.0
Rare word types: 14475
Rare tag tokens: 6414.0
Rare tag types: 1861
Word tokens: 87074.0
Word types: 15214
Tag tokens: 87074.0
Tag types: 2370
Tag tokens in devtest: 10791.0
Tag types in devtest: 1049
Word tokens in devtest, but not in training: 1050.0
Word types in devtest, but not in training: 1006
Tag tokens in devtest, but not in training: 78.0 (.0898%)
Tag types in devtest, but not in training: 75 (3.2%)

## Full Sweep on PTB

Note - This experiment is not accurate due to the October 10 bug

• Experimenter: Peter McClanahan
• SVN Revision Number: 113
• Date: August 14, 2007

### Purpose

To determine what possible changes we would see if we ran our results on the full treebank.

### Method

I ran Active Learning (AL) over the full PTB using the following COMP parameters: Baseline, LS, QBU, and QBUV . Each experiment was run 5 times, and the results were averaged by COMP parameter.

### Commands

For similar results, use the following command on Marylou4:

python scripts/submit.py -t 200 -P 100 -a 1 -cQBUV -cQBU -cLS -cBaseline -m 1 -n5 -v -dPTB -xActiveLearner.xml

### Results

Surprisingly, our POS Tagger did much better on the full PTB set than I would have guessed. Averaged final values were: 96.6858 (QBUV), 96.6871 (QBU), 96.6865 (LS), and 96.6830 (Baseline) with an average over all 20 runs equaling 96.6856. Our previous high using the first 25% of the PTB ended around 95.7 percent.

Figure 1
Figure 2
Figure 3

Figure 1 shows the Baseline, LS, QBU, and QBUV on the first 25 percent of the PTB. It is worth noticing that there is no distinguishable difference (at this resolution) between LS, QBU, and QBUV. All three, however, are superior to the random baseline.

Figure 2 shows similar results to those of Figure 1 – QBU, LS, and QBUV all do much better than the baseline. There are, however, a few interesting results we can see from this experiment that I will mention later.

Figure 3 shows three major groups: 1) the baselines, 2) algorithms at 25% of the PTB and 3) algorithms at 100% of the PTB. From this graph it is easy to see the advantage of having more data – the accuracies grow quicker.



Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9

Figures 3-5 show that with more data, QBU, QBUV, and LS all tend to pick longer sentences, getting more 'bang for their buck.' Thus, 100% of the PTB has a distinct advantage, because there are more long sentences.

Figures 6-8 confirm that this is indeed happening, as the number of words changed on each algorithm is significantly higher than the baseline. (The graphs for comparison with total words are not included because all of those graphs have looked remarkably similar to the words_changed graphs) With this metric, however, the algorithms use many more words than the baseline to achieve similar accuracies. To me, this begs the question, “When should we switch to the baseline?”

Here are the results I have found most interesting:

1. Figure 1 appears to level off near the end of its tail, but Figure 2 shows that that isn't the case. At 5,000 sentences, there is at least about 1 percent to grow.

2. Figure 3 demonstrates that more data allows the AL to grow much more quickly than with small amounts of data.

3. Figures 4-6 indicate that when looking at number of sentences, more data is much better, but when looking at words, hopefully there is a way to figure out when to switch to the baseline so that we don't waste large sentences trying to improve the accuracy a lot, when it is just causing more time for the annotators.

### Future Work

Apply the results of the user study to see if sentences or words seems to make the most difference in annotating a sentence.

Find a way where we can “back off” to the random baseline so that we don't send the oracle long sentences when short ones will be just as effective.

## Clustering Syriac

• Experimenter: Peter McClanahan
• SVN Revision Number: ?
• Date: Summer of 2007 somtime

### Purpose

To determine if clustering can help reduce time and improve accuracy for POS tagging. Clustering will hopefully help us find independence assumptions, therefore allowing us to use multiple POS taggers.

### Method

Run different agglomerative clustering algorithms with a distance-metric determined by mutual information between subtag pairs. I ran to following: Single Link:<br> Average Link:<br> Complete Link:<br> Complex Link (with Robbie's normalization):<br> Normalized Total Correlation:<br>

### Table

Subtag Complex Hand-picked Single Link Average HP 2 HP 3 HP 4 HP 5 Newest 1 Newest 2 HP 6 HP 7 HP 8
Enclitic 0.9893 0.9917 0.9857 0.9857 0.9893 0.9917 0.9902 0.9909 0.9894 0.9880 0.9857 0.9857 0.9910
suffixGender 0.9829 0.9851 0.9881 0.9881 0.9869 0.9881 0.9881 0.9881 0.9851 0.9871 0.9881 0.9881 0.9881
suffixPerson 0.9900 0.9882 0.9904 0.9900 0.9904 0.9904 0.9896 0.9893 0.9882 0.9900 0.9904 0.9898 0.9895
suffixNumber 0.9933 0.9914 0.9942 0.9931 0.9942 0.9942 0.9927 0.9929 0.9914 0.9931 0.9942 0.9929 0.9925
suffixContraction 0.9968 0.9968 0.9975 0.9975 0.9975 0.9975 0.9975 0.9977 0.9968 0.9973 0.9975 0.9975 0.9975
Prefix 0.9932 0.9914 0.9941 0.9931 0.9941 0.9941 0.9922 0.9929 0.9914 0.9931 0.9941 0.9929 0.9925
Gender 0.9318 0.9411 0.9410 0.9327 0.9400 0.9411 0.9406 0.9432 0.9442 0.9327 0.9442 0.9451 0.9436
Person 0.9430 0.9447 0.9444 0.9429 0.9389 0.9447 0.9435 0.9477 0.9438 0.9342 0.9430 0.9481 0.9467
Number 0.9618 0.9699 0.9706 0.9639 0.9702 0.9699 0.9690 0.9710 0.9726 0.9639 0.9726 0.9725 0.9715
State 0.9565 0.9625 0.9629 0.9539 0.9629 0.9625 0.9609 0.9627 0.9676 0.9643 0.9676 0.9632 0.9625
Tense 0.941 0.9425 0.9426 0.9400 0.9421 0.9425 0.9426 0.9440 0.9416 0.9383 0.9410 0.9461 0.9448
Form 0.9347 0.9364 0.9360 0.9336 0.9320 0.9364 0.9352 0.9380 0.9363 0.9334 0.9347 0.9376 0.9368
Grammatical Category 0.9249 0.9313 0.9323 0.9276 0.9302 0.9313 0.9319 0.9337 0.9373 0.9325 0.9373 0.9350 0.9339

## Fast Maxent

• Experimenter: Robbie Haertel
• SVN Revision Number: 81

### Purpose

To determine if FastMaxent needs to be tuned better, i.e. if it is hurting our accuracy.

### Method

Run active learning over the full PTB for 10 iterations starting with 1% of the (sentence) data and adding 3960 sentences each iteration. The comparator used is irrelevant (LS was used). Also run active learning over full PTB using all data. If the difference in the final accuracy (i.e. when active learning is done) between the two is not statistically significant then fast maxent is properly tuned and appropriate.

The following commands were executed on Marylou4 on 3/9/2007 @ 4:06 pm:

python scripts/submit.py -cLS -p1 -isentence -s3960 -bsentence -P100 -Tsentence -n 10
python scripts/submit.py -cLS -p100 -isentence -s1 -bword -P100 -Tsentence -n 10 -t2

### Results

I somehow ended up with 11 runs of the Full maxent (from an earlier run), so I threw out the smallest value when computing the statistics. The average final accuracy for full maxent was 0.966870605 and the average for the incremental maxent was 0.966891912. Fast maxent had higher accuracy on 6 paired trials. These results are not significant at a .95 level (and even if they were, they would favor the incremental maxent).

### Future Work

• Try an experiment that only updates maxent once (preferably with no features being added and/or cutoff)
• See if count cutoffs are playing a role
• Verify results on Poetry data (smaller data set, lower overall accuracy)
• Would it be necessary to fix the random seed?

## Word-based batch query

• Experimenter - Peter
• SVN Revision - 74
• Date - Feb 21, 2006
• This continuing experiment is to see if a word-based batch query will remove some of the “interesting results” we saw in the word-based batch query.
• The first test was a test to see if the size of the batch query mattered for words. I tested QBU, QBUN, and WQBU at sizes of 10, 50, 100, 500, 1000, 5000, and 10000. The different sizes make a little bit of difference (mostly in the beginning stages), but soon converge and become very similar.
• Since this experiment is in progress, I will continue updating the wiki. After completion, I will notify everyone on the ALA list.

## QBC

### Experiment 1

Experimenter: Marc Carmen<br/> SVN Revision Numbers:<br/>

• ALFA 35<br/>
• Statistical NLP 28<br/>

Date Completed: 2/5/2007<br/> Purpose: Get initial QBC results using new framework along with baseline<br/> Path on Entropy: /home/data/experiments/alfa/bnc/qbc/experiment1<br/> Results:<br/>

• All of the experiments achieved around 88% overall accuracy<br/>

## Batch Query

Experimenter: Marc Carmen<br/> SVN Revision Numbers<br/>

• ALFA 35<br/>
• Statistical NLP 28<br/>

Path on Entropy: /home/data/experiments/alfa/pennTreeBank/batchQuery

## MC

Experiementer: George Busby<br/> Purpose: This is a small report of all the results and work on the MC algorithm to date. Results: