Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
cs-401r:adjusted-rand-index [2014/11/18 03:36]
cs401rPML [Adjusted Rand Index] clarified that ARI is defined for clusterings with different numbers of clusters.
cs-401r:adjusted-rand-index [2014/11/18 03:38] (current)
cs401rPML [Adjusted Rand Index] Added note about number of clusters in the clusterings.
Line 4: Line 4:
  
 An ARI score of 1 indicates that the two clusterings are the same apart from the cluster ID/name. An ARI score of 1 indicates that the two clusterings are the same apart from the cluster ID/name.
 +
 +Note that the ARI can be computed even for clusterings with a different number of clusters in each clustering.
  
 To compute the ARI start by building the contingency table (similar to a confusion matrix) for the two clusterings. ​ Say our data is the set of items (e.g., documents) $\{A, B, C, D, E, F\}$, that the gold-standard (GS) clustering is $(\{A, D\}, \{B,C\}, \{E,​F\})$---where $\{A,D\}$ is the first cluster, $\{B,C\}$ is the second and so on---and that the EM clusterings is $(\{A, B\}, \{E,F\}, \{C,​D\})$. ​ The contingency table is then filled in by calculating the size of the intersection of each EM cluster with each GS cluster: To compute the ARI start by building the contingency table (similar to a confusion matrix) for the two clusterings. ​ Say our data is the set of items (e.g., documents) $\{A, B, C, D, E, F\}$, that the gold-standard (GS) clustering is $(\{A, D\}, \{B,C\}, \{E,​F\})$---where $\{A,D\}$ is the first cluster, $\{B,C\}$ is the second and so on---and that the EM clusterings is $(\{A, B\}, \{E,F\}, \{C,​D\})$. ​ The contingency table is then filled in by calculating the size of the intersection of each EM cluster with each GS cluster:
Line 59: Line 61:
 ^  EM' 3         ​| ​ $|\emptyset| = 0$ |  $|\{C,D\}| = 2$ |  $|\emptyset| = 0$ |  2   | ^  EM' 3         ​| ​ $|\emptyset| = 0$ |  $|\{C,D\}| = 2$ |  $|\emptyset| = 0$ |  2   |
 ^  Column Sums  |  2  |  2  |  2  |  6  | ^  Column Sums  |  2  |  2  |  2  |  6  |
- 
-Further note that the ARI can be computed even for clusterings with a different number of clusters. ​ For example, the GS with three clusterings can be compared to an EM''​ clustering with less than three clusters or more than three clusters. 
 === Acknowledgements === === Acknowledgements ===
  
 Thanks to Mike Brodie and Abraham Frandsen for help in developing this tutorial in a Google Group discussion. Thanks to Mike Brodie and Abraham Frandsen for help in developing this tutorial in a Google Group discussion.
cs-401r/adjusted-rand-index.txt ยท Last modified: 2014/11/18 03:38 by cs401rPML
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0