Update from my laptop:
I discovered a bug (with Dr. Ringger's help) in my distsributional word clusterer–only one label from each document was being considered when estimating different probabilities. This is causing an unnacceptable drop in accuracy. Having fixed this, here is a result, which does not indicate that I've actually fixed the problem. data3/full_set MI
20,13.29
Update from marylou:
data3/full_set LFF
20,14.0 40,23.4 60,33.86 80,36.34 100,39.0 140,43.08
This shows a clear upward trend proportionate to the number of clusters, although accuracy is awful.
data3/full_set MI
20,12.41 40,19.68 60,24.11 80,27.83 100,34.39 140,38.47
This shows that LFF is a better choice than MI for initializing clusters.
data3/full_set IDF
20,10.46 40,25.88 60,30.49 80,36.52 100,39.36 140,46.63
data3/full_set TFIDF
20,9.04 40,22.34 60,27.83 80,30.85 100,39.0 140,42.19
data3/full_set TF
20,10.99 40,21.45 60,25.35 80,30.31 100,37.41 140,41.84
Update:
I found a bug in my clusterer – it was always and exclusively using a most frequent feature ranking block of code, which completely invalidates all the results for marylou thus far. However, I have remedied the situation, and am currently running 35 jobs (well, 33, 2 won't start for some reason). Hopefully, I will have more results in a day or so.
Latest results from marylou4: data3/full_set LFF
50,28.36 70,37.23 90,40.95 110,45.56 130,44.68
data3/full_set IDF-1
50,28.36 70,37.23 90,40.95 110,45.56 130,44.68
data3/full_set TF-1
50,28.36 70,37.23 90,40.95 110,45.56 130,44.68
Latest results for data3/full_set and LFF clusters with multiple labels.
50,28.36 70,35.81 90,39.53 110,42.55
Latest results for data3/full_set and LFF clusters. This is with predictions limited to one label.
k,accuracy 50,16.47 70,23.76 90,23.99 110,27.35
I haven't run a large enough batch, but I'm working on that. In the mean time, however, culling mostly uniformly distributed clusters does help when the number of clusters increases.
Here's the cm of 150 clusters with culling turned on:
GVOTE GREL GENT GVOTE 0.96825 0.01587 0.01587 GREL 0.03703 0.77777 0.18518 GENT 0.08333 0.05555 0.86111 Accuracy: 0.8968253968253969
and without culling:
GVOTE GREL GENT GVOTE 0.98412 0.00000 0.01587 GREL 0.14814 0.77777 0.07407 GENT 0.16666 0.11111 0.72222 Accuracy: 0.8650793650793651
So, we ge a 3.5% increase in accuracy. 11 clusters were culled. epsilon = .0001 (this is the variable for estimating distributional uniformity for a cluster).
Cluster (100) initialization by token descending frequency feature ranking:
Now testing classifier: Naive Bayes Multinomial (Mark's) GVOTE GREL GENT GVOTE 0.96825 0.01587 0.01587 GREL 0.25925 0.62962 0.11111 GENT 0.22222 0.11111 0.66666 Accuracy: 0.8095238095238095
Cluster (100) initialization by token ascending frequency feature ranking:
Now testing classifier: Naive Bayes Multinomial (Mark's) GVOTE GREL GENT GVOTE 0.98412 0.01587 0.00000 GREL 0.03703 0.92592 0.03703 GENT 0.11111 0.16666 0.72222 Accuracy: 0.8968253968253969
Cluster (200) initialization by token ascending frequency feature ranking:
Now testing classifier: Naive Bayes Multinomial (Mark's) GVOTE GREL GENT GVOTE 0.98412 0.01587 0.00000 GREL 0.18518 0.74074 0.07407 GENT 0.19444 0.05555 0.75000 Accuracy: 0.8650793650793651
Cluster (50) initialization by token ascending frequency feature ranking:
Now testing classifier: Naive Bayes Multinomial (Mark's) GVOTE GREL GENT GVOTE 0.96825 0.01587 0.01587 GREL 0.22222 0.66666 0.11111 GENT 0.05555 0.13888 0.80555 Accuracy: 0.8571428571428571
Cluster (90) initialization by token ascending frequency feature ranking:
Now testing classifier: Naive Bayes Multinomial (Mark's) GVOTE GREL GENT GVOTE 0.93650 0.03174 0.03174 GREL 0.11111 0.81481 0.07407 GENT 0.08333 0.00000 0.91666 Accuracy: 0.9047619047619048
Cluster (80) initialization by token ascending frequency feature ranking:
Now testing classifier: Naive Bayes Multinomial (Mark's) GVOTE GREL GENT GVOTE 0.95238 0.00000 0.04761 GREL 0.22222 0.62962 0.14814 GENT 0.08333 0.02777 0.88888 Accuracy: 0.8650793650793651
Cluster (110) initialization by token ascending frequency feature ranking:
Now testing classifier: Naive Bayes Multinomial (Mark's) GVOTE GREL GENT GVOTE 0.92063 0.03174 0.04761 GREL 0.07407 0.85185 0.07407 GENT 0.05555 0.08333 0.86111 Accuracy: 0.8888888888888888