nlp:distributional-word-clustering [CS Wiki]

Wed Apr 19 21:21:38 MDT 2006

Update from my laptop:

I discovered a bug (with Dr. Ringger's help) in my distsributional word clusterer–only one label from each document was being considered when estimating different probabilities. This is causing an unnacceptable drop in accuracy. Having fixed this, here is a result, which does not indicate that I've actually fixed the problem. data3/full_set MI

20,13.29

Wed Apr 19 09:16:20 MDT 2006

Update from marylou:

data3/full_set LFF

20,14.0
40,23.4
60,33.86
80,36.34
100,39.0
140,43.08

This shows a clear upward trend proportionate to the number of clusters, although accuracy is awful.

data3/full_set MI

20,12.41
40,19.68
60,24.11
80,27.83
100,34.39
140,38.47

This shows that LFF is a better choice than MI for initializing clusters.

data3/full_set IDF

20,10.46
40,25.88
60,30.49
80,36.52
100,39.36
140,46.63

data3/full_set TFIDF

20,9.04
40,22.34
60,27.83
80,30.85
100,39.0
140,42.19

data3/full_set TF

20,10.99
40,21.45
60,25.35
80,30.31
100,37.41
140,41.84

Tue Apr 18 22:47:58 MDT 2006

Update:

I found a bug in my clusterer – it was always and exclusively using a most frequent feature ranking block of code, which completely invalidates all the results for marylou thus far. However, I have remedied the situation, and am currently running 35 jobs (well, 33, 2 won't start for some reason). Hopefully, I will have more results in a day or so.

Latest results from marylou4: data3/full_set LFF

50,28.36
70,37.23
90,40.95
110,45.56
130,44.68

data3/full_set IDF-1

50,28.36
70,37.23
90,40.95
110,45.56
130,44.68

data3/full_set TF-1

50,28.36
70,37.23
90,40.95
110,45.56
130,44.68

Latest results for data3/full_set and LFF clusters with multiple labels.

50,28.36
70,35.81
90,39.53
110,42.55

Latest results for data3/full_set and LFF clusters. This is with predictions limited to one label.

k,accuracy
50,16.47
70,23.76
90,23.99
110,27.35

I haven't run a large enough batch, but I'm working on that. In the mean time, however, culling mostly uniformly distributed clusters does help when the number of clusters increases.

Here's the cm of 150 clusters with culling turned on:

		GVOTE		GREL		GENT		
GVOTE		0.96825		0.01587		0.01587		
GREL		0.03703		0.77777		0.18518		
GENT		0.08333		0.05555		0.86111		
Accuracy: 0.8968253968253969

and without culling:

		GVOTE		GREL		GENT		
GVOTE		0.98412		0.00000		0.01587		
GREL		0.14814		0.77777		0.07407		
GENT		0.16666		0.11111		0.72222		
Accuracy: 0.8650793650793651

So, we ge a 3.5% increase in accuracy. 11 clusters were culled. epsilon = .0001 (this is the variable for estimating distributional uniformity for a cluster).

Cluster (100) initialization by token descending frequency feature ranking:

Now testing classifier: Naive Bayes Multinomial (Mark's)
		GVOTE		GREL		GENT		
GVOTE		0.96825		0.01587		0.01587		
GREL		0.25925		0.62962		0.11111		
GENT		0.22222		0.11111		0.66666		
Accuracy: 0.8095238095238095

Cluster (100) initialization by token ascending frequency feature ranking:

Now testing classifier: Naive Bayes Multinomial (Mark's)
		GVOTE		GREL		GENT		
GVOTE		0.98412		0.01587		0.00000		
GREL		0.03703		0.92592		0.03703		
GENT		0.11111		0.16666		0.72222		
Accuracy: 0.8968253968253969

Cluster (200) initialization by token ascending frequency feature ranking:

Now testing classifier: Naive Bayes Multinomial (Mark's)
		GVOTE		GREL		GENT		
GVOTE		0.98412		0.01587		0.00000		
GREL		0.18518		0.74074		0.07407		
GENT		0.19444		0.05555		0.75000		
Accuracy: 0.8650793650793651

Cluster (50) initialization by token ascending frequency feature ranking:

Now testing classifier: Naive Bayes Multinomial (Mark's)
		GVOTE		GREL		GENT		
GVOTE		0.96825		0.01587		0.01587		
GREL		0.22222		0.66666		0.11111		
GENT		0.05555		0.13888		0.80555		
Accuracy: 0.8571428571428571

Cluster (90) initialization by token ascending frequency feature ranking:

Now testing classifier: Naive Bayes Multinomial (Mark's)
		GVOTE		GREL		GENT		
GVOTE		0.93650		0.03174		0.03174		
GREL		0.11111		0.81481		0.07407		
GENT		0.08333		0.00000		0.91666		
Accuracy: 0.9047619047619048

Cluster (80) initialization by token ascending frequency feature ranking:

Now testing classifier: Naive Bayes Multinomial (Mark's)
		GVOTE		GREL		GENT		
GVOTE		0.95238		0.00000		0.04761		
GREL		0.22222		0.62962		0.14814		
GENT		0.08333		0.02777		0.88888		
Accuracy: 0.8650793650793651

Cluster (110) initialization by token ascending frequency feature ranking:

Now testing classifier: Naive Bayes Multinomial (Mark's)
		GVOTE		GREL		GENT		
GVOTE		0.92063		0.03174		0.04761		
GREL		0.07407		0.85185		0.07407		
GENT		0.05555		0.08333		0.86111		
Accuracy: 0.8888888888888888

nlp/distributional-word-clustering.txt · Last modified: 2015/04/23 15:43 by ryancha

Back to top

Table of Contents

Wed Apr 19 21:21:38 MDT 2006

Wed Apr 19 09:16:20 MDT 2006

Tue Apr 18 22:47:58 MDT 2006