This shows you the differences between two versions of the page.

Both sides previous revision Previous revision | |||

cs-401r:adjusted-rand-index [2014/11/18 03:36] cs401rPML [Adjusted Rand Index] clarified that ARI is defined for clusterings with different numbers of clusters. |
cs-401r:adjusted-rand-index [2014/11/18 03:38] (current) cs401rPML [Adjusted Rand Index] Added note about number of clusters in the clusterings. |
||
---|---|---|---|

Line 4: | Line 4: | ||

An ARI score of 1 indicates that the two clusterings are the same apart from the cluster ID/name. | An ARI score of 1 indicates that the two clusterings are the same apart from the cluster ID/name. | ||

+ | |||

+ | Note that the ARI can be computed even for clusterings with a different number of clusters in each clustering. | ||

To compute the ARI start by building the contingency table (similar to a confusion matrix) for the two clusterings. Say our data is the set of items (e.g., documents) $\{A, B, C, D, E, F\}$, that the gold-standard (GS) clustering is $(\{A, D\}, \{B,C\}, \{E,F\})$---where $\{A,D\}$ is the first cluster, $\{B,C\}$ is the second and so on---and that the EM clusterings is $(\{A, B\}, \{E,F\}, \{C,D\})$. The contingency table is then filled in by calculating the size of the intersection of each EM cluster with each GS cluster: | To compute the ARI start by building the contingency table (similar to a confusion matrix) for the two clusterings. Say our data is the set of items (e.g., documents) $\{A, B, C, D, E, F\}$, that the gold-standard (GS) clustering is $(\{A, D\}, \{B,C\}, \{E,F\})$---where $\{A,D\}$ is the first cluster, $\{B,C\}$ is the second and so on---and that the EM clusterings is $(\{A, B\}, \{E,F\}, \{C,D\})$. The contingency table is then filled in by calculating the size of the intersection of each EM cluster with each GS cluster: | ||

Line 59: | Line 61: | ||

^ EM' 3 | $|\emptyset| = 0$ | $|\{C,D\}| = 2$ | $|\emptyset| = 0$ | 2 | | ^ EM' 3 | $|\emptyset| = 0$ | $|\{C,D\}| = 2$ | $|\emptyset| = 0$ | 2 | | ||

^ Column Sums | 2 | 2 | 2 | 6 | | ^ Column Sums | 2 | 2 | 2 | 6 | | ||

- | |||

- | Further note that the ARI can be computed even for clusterings with a different number of clusters. For example, the GS with three clusterings can be compared to an EM'' clustering with less than three clusters or more than three clusters. | ||

=== Acknowledgements === | === Acknowledgements === | ||

Thanks to Mike Brodie and Abraham Frandsen for help in developing this tutorial in a Google Group discussion. | Thanks to Mike Brodie and Abraham Frandsen for help in developing this tutorial in a Google Group discussion. |