For each experiment,
As a reminder, we agreed upon three precise definitions to use while talking about Syriac words. Stem - the inflected part of the word between the suffix and the prefix. By stripping the suffix and prefix, you get the stem.<br> Baseform - the form the stem is derived from. <br> Root - the form the baseform is derived from. Typically, this is the tri-literal root.<br> Headword - this form is used only in conjunction with a dictionary. Depending on the “helpfulness” of the dictionary, root, baseform, and stem may all be headwords.<br> Word Type - a unique word (stem + prefix + suffix). This is white-space delimited text.<br> Word Token - an actual word (stem + prefix + suffix). The phrase, “this word is a word token” has 5 word types and 6 tokens.<br>
We wanted to know how difficult is it to map from a Stem to a baseform or a root. When an annotator annotates a stem, we want to be able to link that to a dictionary entry (possibly all three: the stem itself, the baseform, and the root).
I computed the average ambiguity and the expected ambiguity for the following three mappings: stem -> root, stem -> baseform, and baseform -> root. Average Ambiguity is the sum of the ambiguity for each type divided by the total number of types. The expected ambiguity is the sum of the ambiguity for each type times its probability. There were a number of issues as I will discuss in the results section
All counts are for types (not tokens).
NOTE: FOR THESE TESTS, STEM INCLUDES THE SUFFIX Stem is not vocalized: Stem-baseform average ambiguity: 1.019647201946472 [16763 / 16440] Stem-root average ambiguity: 1.012287104622871 [16642 / 16440] Baseform-root average ambiguity: 1.0181040157998684 [3093 / 3038 ] Stem-baseform expected ambiguity: 1.087058823529223 Stem-root expected ambiguity: 1.0675695394434954 Baseform-root exptected ambiguity: 1.049630642954838 With vocalized stem: Stem-baseform average ambiguity: 1.0043230819743898 [18353 / 18274] Stem-root average ambiguity: 1.0020247345956004 [18311 / 18274] Baseform-root average ambiguity: 1.0181040157998684 [3093 / 3038 ] Stem-baseform expected ambiguity: 1.0162152302779135 Stem-root expected ambiguity: 1.008399452804122 Baseform-root exptected ambiguity: 1.049630642954838 Ambiguity from word type -> vocalized stem AND morphological tag: Type-stem+tag average ambiguity: 1.1703739620583065 [19310 / 16499] Type-stem+tag expected ambiguity: 1.797008663930918
This last test (although not exact) represents somewhat of a potentially ideal situation with the dictionary. It asks, “from the word type (a word that's unsegmented, unvoweled, and not annotated), what is the ambiguity we're looking at? In other words, how many choices would I have to show on average in a list the user would choose from? It answer appears to be less than two. In order to make sure that sinks in, that means, show me the characters of an unvoweled word, and I can show you (on average) less than two choices – one of which is correct. Now that's a pretty quick way of annotating! With this in mind, perhaps we should reconsider how our annotation process is set up.
Known words WER: 13.81 (48237 / 55966) 9/8/2008 Known words WITHOUT case ending WER: 2.430 (54606 / 55966) 9/8/2008 Unknown words WER: 74.977 (832 / 3325) 9/8/2008 Unknown words WITHOUT case ending WER: 68.752 (1039 / 3325) 9/8/2008 All words WER: 17.240 (49069 / 59291) 9/8/2008 All words WITHOUT case ending WER: 6.149 (55645 / 59291) 9/8/2008 Kuebler, Mohamed All words WITHOUT case ending WER: 6.64 Zitouni, Sorensen, Sarikaya All words WER: 17.3 All words WITHOUT case ending WER: 7.2 Habash and Rambow All words WER: 14.9 All words WITHOUT case ending WER: 5.5
These are stats about the data split suggested by Zitouni et al (2006).
Num of train tokens: 340879 Num of test tokens: 59291 Num of tokens: 400170 Num of train types: 31968 Num of test types: 11361 Num of types: 43329 Num of unknown types: 2700 Num of unknown tokens: 3325 Num of ambiguous types in training: 8280 Num of ambiguous types in test: 2364 Percent of test that are unknown tokens: 0.05607933750484897 Percent of tokens that are ambiguous types: 0.6418147287402853 Percent of tokens in test set that are ambiguous types: 0.5378725270277108 Average ambiguity: 1.4082715963904082
The latest two papers about vocalization (Zitouni et al., and Kubler) use a character-granularity. That is, instead of predicting the voweling of a word, they predict whether or not a vowel occurs after a given letter, and if it does, which vowel. We theorize that using a character-granularity optimizes the character error rate with respect to accuracy, but leaves the word error rate lagging. Thus, we predict that a word-level approach can ultimately achieve a lower word error rate. For known words (seen in training), we train a maxent classifier for each word type. This, in and of itself, is novel. Robbie might recall something I don't, but the only paper I've read about word-level approaches uses an HMM and did not perform outstandingly. Since a word-level approach for unknown words is futile (plus we have no classifier for them!), we back off to what we think is a sub-par approach–namely, the character-based model. Like Zitouni et al., we will train an MEMM, with each “node” in the MEMM representing a character.
Known words WER: 13.81 (48237 / 55966) Known words WITHOUT case ending WER: 3.463 (54028 / 55966) Unknown words WER: 75.038 (830 / 3325) Unknown words WITHOUT case ending WER: 43.759 (1870 / 3325) All words WER: 17.247 (49065 / 59291) All words WITHOUT case ending WER: 5.723 (55898 / 59291) Zitouni, Sorensen, Sarikaya All words WER: 17.3 All words WITHOUT case ending WER: 7.2 Habash and Rambow All words WER: 14.9 All words WITHOUT case ending WER: 5.5
For each word, we extract the following features:
The following results are preliminary and tested only on the Arabic Treebank part 1. Habash and Rambow report on parts 1 and 2 separately. We are sure that our part 1 data is correct. We use a split similar to Habash and Rambow's.
Buckwalter tags are non-vectorized tags that are segmented: prefixTag+stemTag+suffixTag+caseTag, with as little as just the stemTag necessary.
Vectorized tags are achieved by mapping the Buckwalter tags to a vectorized tag. The vectorized tag is the same vector that Hajic uses in one of his papers (I believe). I'll check this. Note, the “POS” tag is subtag 0.
Results: Most-frequent tagger with Buckwalter tags 0.8934054054054054 (Unknown Accuracy: 0.2754569190600522), Sentence Accuracy: 0.11088295687885011 Decoder Suboptimalities Detected: 0 Maxent tagger with Buckwalter tags 0.9432072072072072 (Unknown Accuracy: 0.664490861618799), Sentence Accuracy: 0.36960985626283366 Decoder Suboptimalities Detected: 0 Most-frequent tagger with monolithic vector tags 0.8961441441441441 (Unknown Accuracy: 0.3002610966057441), Sentence Accuracy: 0.12320328542094455 Decoder Suboptimalities Detected: 0 [java] subtag: 0 0.9198558558558558 12763.0 [java] subtag: 1 0.9494774774774775 13174.0 [java] subtag: 2 0.9564684684684684 13271.0 [java] subtag: 3 0.9556036036036036 13259.0 [java] subtag: 4 0.9564684684684684 13271.0 [java] subtag: 5 0.9527207207207207 13219.0 [java] subtag: 6 0.9522162162162162 13212.0 [java] subtag: 7 0.9518558558558559 13207.0 [java] subtag: 8 0.9378018018018018 13012.0 [java] subtag: 9 0.9531531531531532 13225.0 Maxent tagger with monolithic vector tags 0.9437837837837838 (Unknown Accuracy: 0.6618798955613577), Sentence Accuracy: 0.38193018480492813 Decoder Suboptimalities Detected: 0 [java] subtag: 0 0.9512072072072072 13198.0 [java] subtag: 1 0.9757837837837838 13539.0 [java] subtag: 2 0.9833513513513513 13644.0 [java] subtag: 3 0.9823423423423423 13630.0 [java] subtag: 4 0.9834234234234234 13645.0 [java] subtag: 5 0.9790990990990991 13585.0 [java] subtag: 6 0.9775135135135136 13563.0 [java] subtag: 7 0.976072072072072 13543.0 [java] subtag: 8 0.9796036036036037 13592.0 [java] subtag: 9 0.9795315315315315 13591.0 Most-frequent tagger with vector tags (fully independent) 0.8849009009009009 (Unknown Accuracy: 0.10574412532637076), Sentence Accuracy: 0.10882956878850103 Decoder Suboptimalities Detected: 0 [java] subtag: 0 0.9231711711711712 12809.0 [java] subtag: 1 0.9626666666666667 13357.0 [java] subtag: 2 0.9772972972972973 13560.0 [java] subtag: 3 0.9771531531531531 13558.0 [java] subtag: 4 0.9781621621621621 13572.0 [java] subtag: 5 0.9667747747747748 13414.0 [java] subtag: 6 0.9553873873873874 13256.0 [java] subtag: 7 0.9538018018018019 13234.0 [java] subtag: 8 0.9553153153153153 13255.0 [java] subtag: 9 0.9612252252252252 13337.0 MaxEnt with vector tags (fully independent) 0.9176936936936937 (Unknown Accuracy: 0.5469973890339426), Sentence Accuracy: 0.29568788501026694 Decoder Suboptimalities Detected: 0 [java] subtag: 0 0.9519279279279279 13208.0 [java] subtag: 1 0.9727567567567568 13497.0 [java] subtag: 2 0.9807567567567568 13608.0 [java] subtag: 3 0.9793873873873874 13589.0 [java] subtag: 4 0.9804684684684685 13604.0 [java] subtag: 5 0.9757837837837838 13539.0 [java] subtag: 6 0.9744144144144145 13520.0 [java] subtag: 7 0.9727567567567568 13497.0 [java] subtag: 8 0.9672072072072072 13420.0 [java] subtag: 9 0.9771531531531531 13558.0
Results with the new mapping (no ????s)
MaxEnt with monolithic vector tags 0.9437837837837838 (Unknown Accuracy: 0.6631853785900783), Sentence Accuracy: 0.37166324435318276 Decoder Suboptimalities Detected: 0 [java] subtag: 0 0.9522162162162162 13212.0 [java] subtag: 1 0.9899099099099099 13735.0 [java] subtag: 2 0.9997117117117117 13871.0 [java] subtag: 3 0.9975495495495496 13841.0 [java] subtag: 4 1.0 13875.0 [java] subtag: 5 0.992936936936937 13777.0 [java] subtag: 6 0.9891171171171171 13724.0 [java] subtag: 7 0.987027027027027 13695.0 [java] subtag: 8 0.994954954954955 13805.0 [java] subtag: 9 0.9927927927927928 13775.0 Most Frequent with monolithic vector tags 0.8939099099099099 (Unknown Accuracy: 0.2754569190600522), Sentence Accuracy: 0.11088295687885011 Decoder Suboptimalities Detected: 0 [java] subtag: 0 0.9185585585585586 12745.0 [java] subtag: 1 0.9807567567567568 13608.0 [java] subtag: 2 0.998990990990991 13861.0 [java] subtag: 3 0.9978378378378379 13845.0 [java] subtag: 4 1.0 13875.0 [java] subtag: 5 0.9858018018018018 13678.0 [java] subtag: 6 0.9739819819819819 13514.0 [java] subtag: 7 0.971963963963964 13486.0 [java] subtag: 8 0.9766486486486486 13551.0 [java] subtag: 9 0.9824144144144145 13631.0 Most Frequent with independent vector tags 0.8845405405405405 (Unknown Accuracy: 0.10574412532637076), Sentence Accuracy: 0.1026694045174538 Decoder Suboptimalities Detected: 0 [java] subtag: 0 0.9233873873873873 12812.0 [java] subtag: 1 0.9811891891891892 13614.0 [java] subtag: 2 0.998990990990991 13861.0 [java] subtag: 3 0.9976936936936937 13843.0 [java] subtag: 4 1.0 13875.0 [java] subtag: 5 0.9858018018018018 13678.0 [java] subtag: 6 0.9739819819819819 13514.0 [java] subtag: 7 0.971963963963964 13486.0 [java] subtag: 8 0.9766486486486486 13551.0 [java] subtag: 9 0.9824144144144145 13631.0
We see here that the results from Arabic are similar to the Syriac results we saw in that the independent taggers achieve higher individual accuracies, but a lower total accuracy. This can be explained by the following two insights:
Are discrepancies in accuracies okay for a POS Tagger? Even if it's not using maximum entropy?
With the new Syriac data, we ran the most frequent tagger with different random seeds. We also ran the same tagger with the same random seeds.
After some testing, we found the following to be true.
We ran a quick program to count the number of ties seen by the most-frequent scorer. With the monolithic tag, there are 367 words that have more than one most-frequent tag. With almost all clusterings, there are more than one ambiguities (at least one tie for max) for at least one word. Some clusterings produce amounts close to the monolithic tag.
Note: We never saw subtags 0-5 change.
We hypothesize that a change in the order of subtags within a cluster changes the hash code of the string that is hashed in the Counter. For instance, if we swap subtag one and zero in “1#0” we obtain “0#1”; the hashcodes of these strings should be different. This allows the possibility of switches to the chosen max in ambiguous cases (the counter stores the tags in different orders based on hashcode). Thus, for the most frequent tagger, the ties will be broken differently depending on the random seed. The beam decoder is similarly affected. However, we cannot yet find evidence of why dataset ordering affects results.
We finally note that, with George's help, this problem has disappeared on at least one problem when the convergence criterion was tightened.
How do I choose the first sentence (without unsupervised methods)?
Basic idea: we know that the user is going to have to correct every tag in the first sentence, hence we can directly compute the cost. Furthermore, we can estimate the accuracy we might acheive by choosing a particular sentence (see below). Thus, we choose the sentence that gives us the biggest bang for our buck.
First, we assume that when the oracle corrects a word type, the machine will correctly tag that word with 100% accuracy in the rest of the data. Therefore, we can estimate what the tagger accuracy would be after seeing each sentence by simply summing up the count of the number of times each word type (i.e. duplicates removed) in the sentence occurs in the rest of the sentences and divide by the total number of words. Since we are interested in bang-per-buck, we divide this estimate by the estimate of our cost. We can estimate the number of words to be changed in that sentence as the number of occurences of words NOT in the training data (which, for the very first sentence, is all of them).
After some initial testing, two approaches were tried. In the first (QBP2), the accuracy is updated as explained, but the cost estimate is not. In other words QBP2 always assumes that the user will have to correct every word. QBP3 attempts to estimate how many words will need to be changed as described above.
All three algorithms were run a total of 1000 times. Even though QBP2 and QBP3 are deterministic, ties are broken arbitrarily.
As can be seen from the figure, QBP2 and QBP3 yield very similar results for the first three sentences (arguably, until about 18 hours of cost). QBP2 has a slight advantage for points two and three (the algorithms are algebraically equivalent for the first data point). This may indicate that, since we are overestimating accuracy, it's also better to overestimate cost–at first.
The random baseline, on average, starts with a minimum cost of around 3.7. On the other hand, QBP2 and QBP3 have 2 and 3 data points, respectively, before that cost. By the time the baseline has 3 data points, the QBP variants have around 8. This is important because the other algorithms need a minimum of 1-3 data points (depending on the algorithm, and depending on the waiting scheme (i.e. computer waits, human waits)) until they begin. Thus, even though the QBP variants seem to offer little advantage over the baseline (.92 and .98 for QBP2 and QBP3, respectively–but note that these are interpolated over a large enough range to be weary of the actual results) when the baseline actually starts, smarter algorithms may be able to start with a cost below one hour rather than around 4 hours. This savings hopefully accrues.
Query by “perfect information” has been shown to start earlier than baseline
To see the effects of the new bug fix.
Ran QBU with revision number 157 and compared it to 158. 5 averaged runs on the supercomputer.
As seen below, there is not a significant change. It is interesting to note that the fixed version actually does better for part of the curve. I haven't thought about why this is much, but there should be a logical explanation.
Here's also a quick version of time comparisons.
Implement true query by uncertainty using the forward entropy algorithm as QBUE. Compare this to the previous approximate, QBU, which estimated the true sequence entropy by computing some of the entropy using the Viterbi sequence.
Using 5 generated random seeds, 1192729648377 1192729649225 1192729648535 1192729648011 1192729648770 ran on Baseline, LS, QBU, QBUV, and QBUE and produced graphs.
Data located on entropy at home/data/experiments/alfa/QBUE including spreadsheets of averaged runs and images. First notice, especially from the 4th graph, that QBU and QBUE are very similar. Further experiments will follow to prove the difference is statistically insignificant.
Difference from QBU: Y-Axis: sentences trained on X-Axis: Difference in Accuracy from the QBU's Accuracy
Difference from QBUV: Y-Axis: sentences trained on X-Axis: Difference in Accuracy from the QBUV's Accuracy
To try and find a meaningful play to switch from QBU to random.
We ran baseline and QBU with no cutoffs and no switchover. The derivative is a simple one, computed at each point by taking the following point and subtracting the previous point.
Here is a graph showing the derivatives and the 10-period moving average of the lines in order to remove noise.
thumbnail|none No image to be found!
To see where we should switch over from a specific active learning algorithm to a random baseline.
Since zero count cutoffs were the best, we used those, with a batch-query size of 1 sentence per iteration. We ran switchover points at 50, 100, and 200 sentences compared to a baseline and to a QBU run with no switch over point.
Here is a graph clearly showing that each of the tested switchover points are still too early. thumbnail|none
A graph comparing cumulative word changes and batch size. thumbnail|none
To investigate the effects of the feature selector on efficacy of active learning, especially during initial iterations. As a secondary goal, we provide an initial attempt at indentifying the crossover point–the point after which it is equally effective to use expensive querying methods as random selection.
Using 10% of the PTB (seed 1192131744), we removed feature cutoffs completely and used a batch query size of 1 sentence. A batch query size of 1 sentence seems to be *much* more efficient than the batch query size of 100, as we had originally thought.
Initial results show that batch query size of 1 sentence and no feature cutoffs prove to be much more effective than earlier methods used.
On ten percent of the PTB, with QBU we can reach 90% accuracy with 101 sentences (5198 words). Previous experiments reach 90% with 300 sentences (18364 words). This is a third of the amount of sentences required, and (possibly more importantly) .283% of the words.
It is also important to note that the baseline does better as well with the changes made; however, QBU outperforms the baseline by approximately the same factor as with the count cutoffs; as far as we've run, we consistently see that 1/3 of the data using QBU produces the same accuracy as the new baseline.
Results with 100% of the PTB are slightly better.
Also note that for both the baseline and QBU the slopes of the respective curves are approximately equally at around 125 sentences (this is by eyeball). This possibly indicates that we are okay to apply count cutoffs around this point, though the exact point is probably dataset dependent (including total amount of data, highest possible accuracy, etc.).
On the first graph, the y-axis is accuracy, and the x-axis is the number of iterations (also the number of sentences trained on). On the second graphs the axes are reversed (hence the graph is mirrored about the y=x line). This simplifies visual estimation of the benefit over baseline.
We hypothesize that the crossover point (the point at which using a random baseline thereafter will provide the same results as other schemes) is where the slopes of an active learning technique and the random baseline cross. We therefore considered the (discrete) derivative of these curves. The derivative was estimated using the central difference (i.e. next accuracy - previous accuracy). The results are as follows:
<center> thumbnail|none (Y-axis is the derivative, x-axis is iteration/sentence number) </center>
The graph is fairly noisy; this noise would be almost entirely removed by averaging over several runs. This makes it difficult to estimate precisely where the slope of the QBU line is equal the baseline; it may be anywhere from 3-50 sentences (or more, but not likely)!
We next ran QBU for three iterations after which we switched to the random baseline (code not versioned); the process was repeated for a switchover point of 5:
I wish to further note that while it is possible to compute the derivative of the QBU curve in a real situation, the derivative of the random curve will not usually be available, hence this could not be used as a stopping criterion in a real-world task.
Same as 100-hour experiment
Similar method. I validated that we are using fast maxent (at least for this experiment set).
Pending
Note - This experiment is not accurate due to the October 10 bug
1) To plot time graphs for Syriac active learning.<br> 2) To plot max uncertainty vs. min uncertainty for each iteration (i.e. top value and last value of queue)
This wasn't an exhaustive test; I just put a few jobs on the supercomputer in order to see what times were for Syriac active learning. I only tested QBU, Baseline, and LS<br> These graphs are only one run; keep that in mind<br> We should run this again with other “comp” values
thumbnail|none|Time thumbnail|none|Min vs. Max (with accuracies) Excel file: 100hrSyriacActiveLearning.xlsx
To weigh the possibilities of creating different trie types
You can use
edu.byu.nlp.alfa.activeLearner.syriac.tools.ComputeDatasetStats
to get stats about the dataset. I added a few more lines of code that currently isn't checked in to see the number of tag/word and word/tag combinations.
Here are the summarized results for the dataset:
Rare word tokens: 32661.0 Rare word types: 14475 Rare tag tokens: 6414.0 Rare tag types: 1861 Word tokens: 87074.0 Word types: 15214 Tag tokens: 87074.0 Tag types: 2370 Tag tokens in devtest: 10791.0 Tag types in devtest: 1049 Word tokens in devtest, but not in training: 1050.0 Word types in devtest, but not in training: 1006 Tag tokens in devtest, but not in training: 78.0 (.0898%) Tag types in devtest, but not in training: 75 (3.2%)
<br /> Number of words associated with each tag Number of tags associated with each word
Excel 2007 Data WordsAndTags.xlsx
Note - This experiment is not accurate due to the October 10 bug
To determine what possible changes we would see if we ran our results on the full treebank.
I ran Active Learning (AL) over the full PTB using the following COMP parameters: Baseline, LS, QBU, and QBUV . Each experiment was run 5 times, and the results were averaged by COMP parameter.
For similar results, use the following command on Marylou4:
python scripts/submit.py -t 200 -P 100 -a 1 -cQBUV -cQBU -cLS -cBaseline -m 1 -n5 -v -dPTB -xActiveLearner.xml
Surprisingly, our POS Tagger did much better on the full PTB set than I would have guessed. Averaged final values were: 96.6858 (QBUV), 96.6871 (QBU), 96.6865 (LS), and 96.6830 (Baseline) with an average over all 20 runs equaling 96.6856. Our previous high using the first 25% of the PTB ended around 95.7 percent.
Figure 1 shows the Baseline, LS, QBU, and QBUV on the first 25 percent of the PTB. It is worth noticing that there is no distinguishable difference (at this resolution) between LS, QBU, and QBUV. All three, however, are superior to the random baseline.
Figure 2 shows similar results to those of Figure 1 – QBU, LS, and QBUV all do much better than the baseline. There are, however, a few interesting results we can see from this experiment that I will mention later.
Figure 3 shows three major groups: 1) the baselines, 2) algorithms at 25% of the PTB and 3) algorithms at 100% of the PTB. From this graph it is easy to see the advantage of having more data – the accuracies grow quicker.
Figures 3-5 show that with more data, QBU, QBUV, and LS all tend to pick longer sentences, getting more 'bang for their buck.' Thus, 100% of the PTB has a distinct advantage, because there are more long sentences.
Figures 6-8 confirm that this is indeed happening, as the number of words changed on each algorithm is significantly higher than the baseline. (The graphs for comparison with total words are not included because all of those graphs have looked remarkably similar to the words_changed graphs) With this metric, however, the algorithms use many more words than the baseline to achieve similar accuracies. To me, this begs the question, “When should we switch to the baseline?”
Here are the results I have found most interesting:
1. Figure 1 appears to level off near the end of its tail, but Figure 2 shows that that isn't the case. At 5,000 sentences, there is at least about 1 percent to grow.
2. Figure 3 demonstrates that more data allows the AL to grow much more quickly than with small amounts of data.
3. Figures 4-6 indicate that when looking at number of sentences, more data is much better, but when looking at words, hopefully there is a way to figure out when to switch to the baseline so that we don't waste large sentences trying to improve the accuracy a lot, when it is just causing more time for the annotators.
Apply the results of the user study to see if sentences or words seems to make the most difference in annotating a sentence.
Find a way where we can “back off” to the random baseline so that we don't send the oracle long sentences when short ones will be just as effective.
To determine if clustering can help reduce time and improve accuracy for POS tagging. Clustering will hopefully help us find independence assumptions, therefore allowing us to use multiple POS taggers.
Run different agglomerative clustering algorithms with a distance-metric determined by mutual information between subtag pairs. I ran to following: Single Link:<br> Average Link:<br> Complete Link:<br> Complex Link (with Robbie's normalization):<br> Normalized Total Correlation:<br>
Single Link thumbnail|none Average Link thumbnail|none Complete Link thumbnail|none Complex Link thumbnail|none Normalized Total Correlation thumbnail|none Clustering - table coming soon
Subtag | Complex | Hand-picked | Single Link | Average | HP 2 | HP 3 | HP 4 | HP 5 | Newest 1 | Newest 2 | HP 6 | HP 7 | HP 8 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Enclitic | 0.9893 | 0.9917 | 0.9857 | 0.9857 | 0.9893 | 0.9917 | 0.9902 | 0.9909 | 0.9894 | 0.9880 | 0.9857 | 0.9857 | 0.9910 |
suffixGender | 0.9829 | 0.9851 | 0.9881 | 0.9881 | 0.9869 | 0.9881 | 0.9881 | 0.9881 | 0.9851 | 0.9871 | 0.9881 | 0.9881 | 0.9881 |
suffixPerson | 0.9900 | 0.9882 | 0.9904 | 0.9900 | 0.9904 | 0.9904 | 0.9896 | 0.9893 | 0.9882 | 0.9900 | 0.9904 | 0.9898 | 0.9895 |
suffixNumber | 0.9933 | 0.9914 | 0.9942 | 0.9931 | 0.9942 | 0.9942 | 0.9927 | 0.9929 | 0.9914 | 0.9931 | 0.9942 | 0.9929 | 0.9925 |
suffixContraction | 0.9968 | 0.9968 | 0.9975 | 0.9975 | 0.9975 | 0.9975 | 0.9975 | 0.9977 | 0.9968 | 0.9973 | 0.9975 | 0.9975 | 0.9975 |
Prefix | 0.9932 | 0.9914 | 0.9941 | 0.9931 | 0.9941 | 0.9941 | 0.9922 | 0.9929 | 0.9914 | 0.9931 | 0.9941 | 0.9929 | 0.9925 |
Gender | 0.9318 | 0.9411 | 0.9410 | 0.9327 | 0.9400 | 0.9411 | 0.9406 | 0.9432 | 0.9442 | 0.9327 | 0.9442 | 0.9451 | 0.9436 |
Person | 0.9430 | 0.9447 | 0.9444 | 0.9429 | 0.9389 | 0.9447 | 0.9435 | 0.9477 | 0.9438 | 0.9342 | 0.9430 | 0.9481 | 0.9467 |
Number | 0.9618 | 0.9699 | 0.9706 | 0.9639 | 0.9702 | 0.9699 | 0.9690 | 0.9710 | 0.9726 | 0.9639 | 0.9726 | 0.9725 | 0.9715 |
State | 0.9565 | 0.9625 | 0.9629 | 0.9539 | 0.9629 | 0.9625 | 0.9609 | 0.9627 | 0.9676 | 0.9643 | 0.9676 | 0.9632 | 0.9625 |
Tense | 0.941 | 0.9425 | 0.9426 | 0.9400 | 0.9421 | 0.9425 | 0.9426 | 0.9440 | 0.9416 | 0.9383 | 0.9410 | 0.9461 | 0.9448 |
Form | 0.9347 | 0.9364 | 0.9360 | 0.9336 | 0.9320 | 0.9364 | 0.9352 | 0.9380 | 0.9363 | 0.9334 | 0.9347 | 0.9376 | 0.9368 |
Grammatical Category | 0.9249 | 0.9313 | 0.9323 | 0.9276 | 0.9302 | 0.9313 | 0.9319 | 0.9337 | 0.9373 | 0.9325 | 0.9373 | 0.9350 | 0.9339 |
To determine if FastMaxent needs to be tuned better, i.e. if it is hurting our accuracy.
Run active learning over the full PTB for 10 iterations starting with 1% of the (sentence) data and adding 3960 sentences each iteration. The comparator used is irrelevant (LS was used). Also run active learning over full PTB using all data. If the difference in the final accuracy (i.e. when active learning is done) between the two is not statistically significant then fast maxent is properly tuned and appropriate.
The following commands were executed on Marylou4 on 3/9/2007 @ 4:06 pm:
python scripts/submit.py -cLS -p1 -isentence -s3960 -bsentence -P100 -Tsentence -n 10 python scripts/submit.py -cLS -p100 -isentence -s1 -bword -P100 -Tsentence -n 10 -t2
I somehow ended up with 11 runs of the Full maxent (from an earlier run), so I threw out the smallest value when computing the statistics. The average final accuracy for full maxent was 0.966870605 and the average for the incremental maxent was 0.966891912. Fast maxent had higher accuracy on 6 paired trials. These results are not significant at a .95 level (and even if they were, they would favor the incremental maxent).
Experimenter: Marc Carmen<br/> SVN Revision Numbers:<br/>
Date Completed: 2/5/2007<br/> Purpose: Get initial QBC results using new framework along with baseline<br/> Path on Entropy: /home/data/experiments/alfa/bnc/qbc/experiment1<br/> Results:<br/>
Experimenter: Marc Carmen<br/> SVN Revision Numbers<br/>
Path on Entropy: /home/data/experiments/alfa/pennTreeBank/batchQuery
Experiementer: George Busby<br/> Purpose: This is a small report of all the results and work on the MC algorithm to date. Results: