Differences

This shows you the differences between two versions of the page.

Link to this comparison view

nlp:distributional-word-clustering [2015/04/23 21:43] (current)
ryancha created
Line 1: Line 1:
 +===Wed Apr 19 21:21:38 MDT 2006===
 +Update from my laptop:
  
 +I discovered a bug (with Dr. Ringger'​s help) in my distsributional word clusterer--only one label from each document was being considered when estimating different probabilities. ​ This is causing an unnacceptable drop in accuracy. ​ Having fixed this, here is a result, which does not indicate that I've actually fixed the problem.
 +data3/​full_set MI
 +<pre>
 +20,13.29
 +</​pre>​
 +
 +===Wed Apr 19 09:16:20 MDT 2006===
 +Update from marylou:
 +
 +data3/​full_set LFF
 +<pre>
 +20,14.0
 +40,23.4
 +60,33.86
 +80,36.34
 +100,39.0
 +140,43.08
 +</​pre>​
 +This shows a clear upward trend proportionate to the number of clusters, although accuracy is  awful.
 +
 +data3/​full_set MI
 +<pre>
 +20,12.41
 +40,19.68
 +60,24.11
 +80,27.83
 +100,34.39
 +140,38.47
 +</​pre>​
 +This shows that LFF is a better choice than MI for initializing clusters.
 +
 +data3/​full_set IDF
 +<pre>
 +20,10.46
 +40,25.88
 +60,30.49
 +80,36.52
 +100,39.36
 +140,46.63
 +</​pre>​
 +
 +data3/​full_set TFIDF
 +<pre>
 +20,9.04
 +40,22.34
 +60,27.83
 +80,30.85
 +100,39.0
 +140,42.19
 +</​pre>​
 +
 +data3/​full_set TF
 +<pre>
 +20,10.99
 +40,21.45
 +60,25.35
 +80,30.31
 +100,37.41
 +140,41.84
 +</​pre>​
 +
 +===Tue Apr 18 22:47:58 MDT 2006===
 +Update:
 +
 +I found a bug in my clusterer -- it was always and exclusively using a most frequent feature ranking block of code, which completely invalidates all the results for marylou thus far.  However, I have remedied the situation, and am currently running 35 jobs (well, 33, 2 won't start for some reason). ​ Hopefully, I will have more results in a day or so.
 +
 +Latest results from marylou4:
 +data3/​full_set LFF
 +<pre>
 +50,28.36
 +70,37.23
 +90,40.95
 +110,45.56
 +130,44.68
 +</​pre>​
 +
 +data3/​full_set IDF-1
 +<pre>
 +50,28.36
 +70,37.23
 +90,40.95
 +110,45.56
 +130,44.68
 +</​pre>​
 +
 +data3/​full_set TF-1
 +<pre>
 +50,28.36
 +70,37.23
 +90,40.95
 +110,45.56
 +130,44.68
 +</​pre>​
 +
 +Latest results for data3/​full_set and LFF clusters with multiple labels.
 +<pre>
 +50,28.36
 +70,35.81
 +90,39.53
 +110,42.55
 +</​pre>​
 +
 +Latest results for data3/​full_set and LFF clusters. ​ This is with predictions limited to one label.
 +<pre>
 +k,accuracy
 +50,16.47
 +70,23.76
 +90,23.99
 +110,27.35
 +</​pre>​
 +
 +
 +I haven'​t run a large enough batch, but I'm working on that. In the mean time, however, culling mostly uniformly distributed clusters does help when the number of clusters increases.
 +
 +Here's the cm of 150 clusters with culling turned on:
 +<pre>
 + GVOTE GREL GENT
 +GVOTE 0.96825 0.01587 0.01587
 +GREL 0.03703 0.77777 0.18518
 +GENT 0.08333 0.05555 0.86111
 +Accuracy: 0.8968253968253969
 +</​pre>​
 +
 +and without culling:
 +<pre>
 + GVOTE GREL GENT
 +GVOTE 0.98412 0.00000 0.01587
 +GREL 0.14814 0.77777 0.07407
 +GENT 0.16666 0.11111 0.72222
 +Accuracy: 0.8650793650793651
 +</​pre>​
 +
 +So, we ge a 3.5% increase in accuracy. ​ 11 clusters were culled. epsilon = .0001 (this is the variable for estimating distributional uniformity for a cluster).
 +
 +
 +
 +Cluster (100) initialization by token descending frequency feature ranking:
 +<pre>
 +Now testing classifier: Naive Bayes Multinomial (Mark'​s)
 + GVOTE GREL GENT
 +GVOTE 0.96825 0.01587 0.01587
 +GREL 0.25925 0.62962 0.11111
 +GENT 0.22222 0.11111 0.66666
 +Accuracy: 0.8095238095238095
 +</​pre>​
 +
 +Cluster (100) initialization by token ascending frequency feature ranking:
 +<pre>
 +Now testing classifier: Naive Bayes Multinomial (Mark'​s)
 + GVOTE GREL GENT
 +GVOTE 0.98412 0.01587 0.00000
 +GREL 0.03703 0.92592 0.03703
 +GENT 0.11111 0.16666 0.72222
 +Accuracy: 0.8968253968253969
 +</​pre>​
 +
 +Cluster (200) initialization by token ascending frequency feature ranking:
 +<pre>
 +Now testing classifier: Naive Bayes Multinomial (Mark'​s)
 + GVOTE GREL GENT
 +GVOTE 0.98412 0.01587 0.00000
 +GREL 0.18518 0.74074 0.07407
 +GENT 0.19444 0.05555 0.75000
 +Accuracy: 0.8650793650793651
 +</​pre>​
 +
 +Cluster (50) initialization by token ascending frequency feature ranking:
 +<pre>
 +Now testing classifier: Naive Bayes Multinomial (Mark'​s)
 + GVOTE GREL GENT
 +GVOTE 0.96825 0.01587 0.01587
 +GREL 0.22222 0.66666 0.11111
 +GENT 0.05555 0.13888 0.80555
 +Accuracy: 0.8571428571428571
 +</​pre>​
 +
 +Cluster (90) initialization by token ascending frequency feature ranking:
 +<pre>
 +Now testing classifier: Naive Bayes Multinomial (Mark'​s)
 + GVOTE GREL GENT
 +GVOTE 0.93650 0.03174 0.03174
 +GREL 0.11111 0.81481 0.07407
 +GENT 0.08333 0.00000 0.91666
 +Accuracy: 0.9047619047619048
 +</​pre>​
 +
 +Cluster (80) initialization by token ascending frequency feature ranking:
 +<pre>
 +Now testing classifier: Naive Bayes Multinomial (Mark'​s)
 + GVOTE GREL GENT
 +GVOTE 0.95238 0.00000 0.04761
 +GREL 0.22222 0.62962 0.14814
 +GENT 0.08333 0.02777 0.88888
 +Accuracy: 0.8650793650793651
 +</​pre>​
 +
 +Cluster (110) initialization by token ascending frequency feature ranking:
 +<pre>
 +Now testing classifier: Naive Bayes Multinomial (Mark'​s)
 + GVOTE GREL GENT
 +GVOTE 0.92063 0.03174 0.04761
 +GREL 0.07407 0.85185 0.07407
 +GENT 0.05555 0.08333 0.86111
 +Accuracy: 0.8888888888888888
 +</​pre>​
nlp/distributional-word-clustering.txt ยท Last modified: 2015/04/23 21:43 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0