The "Adjusted Rand Index" (ARI) is a data clustering metric that measures the similarity between two clusterings. A clustering is the assignment of each data item to one cluster. In Assignment 6 one clustering comes from the EM/CEM results and the other from the gold-standard class labels. In general however, the ARI could be computed between two automated clusterings.
An ARI score of 1 indicates that the two clusterings are the same apart from the cluster ID/name.
Note that the ARI can be computed even for clusterings with a different number of clusters in each clustering.
To compute the ARI start by building the contingency table (similar to a confusion matrix) for the two clusterings. Say our data is the set of items (e.g., documents) $\{A, B, C, D, E, F\}$, that the gold-standard (GS) clustering is $(\{A, D\}, \{B,C\}, \{E,F\})$—where $\{A,D\}$ is the first cluster, $\{B,C\}$ is the second and so on—and that the EM clusterings is $(\{A, B\}, \{E,F\}, \{C,D\})$. The contingency table is then filled in by calculating the size of the intersection of each EM cluster with each GS cluster:
GS 1 | GS 2 | GS 3 | Row Sums | |
---|---|---|---|---|
EM 1 | $|\{A\}| = 1$ | $|\{B\}| = 1$ | $|\emptyset| = 0$ | 2 |
EM 2 | $|\emptyset| = 0$ | $|\emptyset| = 0$ | $|\{E,F\}| = 2$ | 2 |
EM 3 | $|\{D\}| = 1$ | $|\{C\}| = 1$ | $|\emptyset| = 0$ | 2 |
Column Sums | 2 | 2 | 2 | 6 |
Let $n_{i,j}$ be the count in the cell at row $i$ column $j$, $a_i$ the sum for row $i$, $b_j$ the sum for column $j$, and $n$ the total count in the table. For our example:
GS 1 | GS 2 | GS 3 | Row Sums | |
---|---|---|---|---|
EM 1 | $n_{1,1} = 1$ | $n_{1,2} = 1$ | $n_{1,3} = 0$ | $a_1 = 2$ |
EM 2 | $n_{2,1} = 0$ | $n_{2,2} = 0$ | $n_{2,3} = 2$ | $a_2 = 2$ |
EM 3 | $n_{3,1} = 1$ | $n_{3,2} = 1$ | $n_{3,3} = 0$ | $a_3 = 2$ |
Column Sums | $b_1 = 2$ | $b_2 = 2$ | $b_3 = 2$ | $n = 6$ |
The ARI is then calculated with the following equation (copied from Wikipedia):
$ARI = \frac{ \sum_{ij} \binom{n_{ij}}{2} - [\sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}] / \binom{n}{2} }{ \frac{1}{2} [\sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2}] - [\sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}] / \binom{n}{2} }$
The notation $\binom{m}{k}$ is standard m-choose-k notation (see Binomial coefficient):
$\binom{m}{k} = \frac{m!}{k!(m-k)!}$ for $1 \leq k \leq m$
$\binom{m}{k} = 0$ for $k < 0, m < k$
For efficiency, for $m \geq 2$, $\binom{m}{2} = \frac{m!}{2!(m-2)!} = \frac{m(m-1)(m-2)!}{2!(m-2)!} = \frac{m(m-1)}{2}$. Similar patterns hold for other values of $k$.
For our example we will compute portions of the ARI and then combine them together, like so:
$ARI = \frac{ A - \frac{B}{C} }{ \frac{D}{2} - \frac{B}{C} }$
where $A = \sum_{i,j} \binom{n_{i,j}}{2}$, $B = \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}$, $C = \binom{n}{2}$, and $D = \sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2}$.
$A = \sum_{i,j} \binom{n_{i,j}}{2} = 4 \binom{0}{2} + 4 \binom{1}{2} + \binom{2}{2} = 4 * 0 + 4 * 0 + 1 = 1$
$B = \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2} = (3 \binom{2}{2})(3 \binom{2}{2}) = (3 * 1) (3 * 1) = 9$
$C = \binom{n}{2} = \binom{6}{2} = \frac{6 * 5}{2} = 15$
$D = \sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2} = (3 \binom{2}{2}) + (3 \binom{2}{2}) = (3 * 1) + (3 * 1) = 6$
Putting the values together, then:
$ARI = \frac{ A - \frac{B}{C} }{ \frac{D}{2} - \frac{B}{C} } = \frac{ 1 - \frac{9}{15} }{ \frac{6}{2} - \frac{9}{15} } = \frac{ 1 - \frac{3}{5} }{ 3 - \frac{3}{5} } = \frac{ \frac{2}{5} }{ \frac{12}{5} } = \frac{2}{12} = \frac{1}{6}$.
If the two clusterings were perfectly aligned, then $A = 6 \binom{6}{2} + 3 \binom{2}{2} = 6 * 0 + 3 * 1 = 3$ and $ARI = 1$. For example, for a clustering EM' versus the GS, a perfect alignment could be as follows, where you will note that cluster IDs are not the same but cluster contents are:
GS 1 | GS 2 | GS 3 | Row Sums | |
---|---|---|---|---|
EM' 1 | $|\{A,B\}| = 2$ | $|\emptyset| = 0$ | $|\emptyset| = 0$ | 2 |
EM' 2 | $|\emptyset| = 0$ | $|\emptyset| = 0$ | $|\{E,F\}| = 2$ | 2 |
EM' 3 | $|\emptyset| = 0$ | $|\{C,D\}| = 2$ | $|\emptyset| = 0$ | 2 |
Column Sums | 2 | 2 | 2 | 6 |
Thanks to Mike Brodie and Abraham Frandsen for help in developing this tutorial in a Google Group discussion.