The "Adjusted Rand Index" (ARI) is a data clustering metric that measures the similarity between two clusterings. A clustering is the assignment of each data item to one cluster. In Assignment 6 one clustering comes from the EM/CEM results and the other from the gold-standard class labels. In general however, the ARI could be computed between two automated clusterings.

An ARI score of 1 indicates that the two clusterings are the same apart from the cluster ID/name.

Note that the ARI can be computed even for clusterings with a different number of clusters in each clustering.

To compute the ARI start by building the contingency table (similar to a confusion matrix) for the two clusterings. Say our data is the set of items (e.g., documents) $\{A, B, C, D, E, F\}$, that the gold-standard (GS) clustering is $(\{A, D\}, \{B,C\}, \{E,F\})$—where $\{A,D\}$ is the first cluster, $\{B,C\}$ is the second and so on—and that the EM clusterings is $(\{A, B\}, \{E,F\}, \{C,D\})$. The contingency table is then filled in by calculating the size of the intersection of each EM cluster with each GS cluster:

GS 1 GS 2 GS 3 Row Sums
EM 1 $|\{A\}| = 1$ $|\{B\}| = 1$ $|\emptyset| = 0$ 2
EM 2 $|\emptyset| = 0$ $|\emptyset| = 0$ $|\{E,F\}| = 2$ 2
EM 3 $|\{D\}| = 1$ $|\{C\}| = 1$ $|\emptyset| = 0$ 2
Column Sums 2 2 2 6

Let $n_{i,j}$ be the count in the cell at row $i$ column $j$, $a_i$ the sum for row $i$, $b_j$ the sum for column $j$, and $n$ the total count in the table. For our example:

GS 1 GS 2 GS 3 Row Sums
EM 1 $n_{1,1} = 1$ $n_{1,2} = 1$ $n_{1,3} = 0$ $a_1 = 2$
EM 2 $n_{2,1} = 0$ $n_{2,2} = 0$ $n_{2,3} = 2$ $a_2 = 2$
EM 3 $n_{3,1} = 1$ $n_{3,2} = 1$ $n_{3,3} = 0$ $a_3 = 2$
Column Sums $b_1 = 2$ $b_2 = 2$ $b_3 = 2$ $n = 6$

The ARI is then calculated with the following equation (copied from Wikipedia):

$ARI = \frac{ \sum_{ij} \binom{n_{ij}}{2} - [\sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}] / \binom{n}{2} }{ \frac{1}{2} [\sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2}] - [\sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}] / \binom{n}{2} }$

The notation $\binom{m}{k}$ is standard m-choose-k notation (see Binomial coefficient):

$\binom{m}{k} = \frac{m!}{k!(m-k)!}$ for $1 \leq k \leq m$

$\binom{m}{k} = 0$ for $k < 0, m < k$

For efficiency, for $m \geq 2$, $\binom{m}{2} = \frac{m!}{2!(m-2)!} = \frac{m(m-1)(m-2)!}{2!(m-2)!} = \frac{m(m-1)}{2}$. Similar patterns hold for other values of $k$.

For our example we will compute portions of the ARI and then combine them together, like so:

$ARI = \frac{ A - \frac{B}{C} }{ \frac{D}{2} - \frac{B}{C} }$

where $A = \sum_{i,j} \binom{n_{i,j}}{2}$, $B = \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}$, $C = \binom{n}{2}$, and $D = \sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2}$.

$A = \sum_{i,j} \binom{n_{i,j}}{2} = 4 \binom{0}{2} + 4 \binom{1}{2} + \binom{2}{2} = 4 * 0 + 4 * 0 + 1 = 1$

$B = \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2} = (3 \binom{2}{2})(3 \binom{2}{2}) = (3 * 1) (3 * 1) = 9$

$C = \binom{n}{2} = \binom{6}{2} = \frac{6 * 5}{2} = 15$

$D = \sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2} = (3 \binom{2}{2}) + (3 \binom{2}{2}) = (3 * 1) + (3 * 1) = 6$

Putting the values together, then:

$ARI = \frac{ A - \frac{B}{C} }{ \frac{D}{2} - \frac{B}{C} } = \frac{ 1 - \frac{9}{15} }{ \frac{6}{2} - \frac{9}{15} } = \frac{ 1 - \frac{3}{5} }{ 3 - \frac{3}{5} } = \frac{ \frac{2}{5} }{ \frac{12}{5} } = \frac{2}{12} = \frac{1}{6}$.

If the two clusterings were perfectly aligned, then $A = 6 \binom{6}{2} + 3 \binom{2}{2} = 6 * 0 + 3 * 1 = 3$ and $ARI = 1$. For example, for a clustering EM' versus the GS, a perfect alignment could be as follows, where you will note that cluster IDs are not the same but cluster contents are:

GS 1 GS 2 GS 3 Row Sums
EM' 1 $|\{A,B\}| = 2$ $|\emptyset| = 0$ $|\emptyset| = 0$ 2
EM' 2 $|\emptyset| = 0$ $|\emptyset| = 0$ $|\{E,F\}| = 2$ 2
EM' 3 $|\emptyset| = 0$ $|\{C,D\}| = 2$ $|\emptyset| = 0$ 2
Column Sums 2 2 2 6

### Acknowledgements

Thanks to Mike Brodie and Abraham Frandsen for help in developing this tutorial in a Google Group discussion.