We have implemented a NW-alignment algorithm in Java. We then implemented a banded NW-aligner. We then optimized the banded NW-aligner to run in linear space.
minPercOverlap banded_identity band_noband_identity unbandedTime bandedTime optimizedBandTime totalAlns 0.03125 1.0 0.893 3.0931 0.7097 0.4767 225.0 0.0625 1.0 0.956 2.9003 0.8283 0.5601 225.0 0.125 1.0 0.9734 2.94 1.1837 0.9215 225.0 0.25 1.0 0.9867 2.925 1.8431 1.569 225.0 0.5 1.0 1.0 2.9287 2.9216 2.493 225.0 1.0 1.0 1.0 2.9299 3.8145 2.5307 225.0
Initially the optimizedBandTime was running quite a bit slower. I greased it up by minimizing repetitive calculations and now it seems to be working faster than the unbanded aligner even when the “band” is the full sequence length. It's interesting to see that with the minimum percent overlap at 50%, the banded alignment still finds the same alignment as the unbanded 100% of the time. These results are exact character matching by the way, and I imagine those numbers will go up slightly if we change everything to lowercase:
minPercOverlap banded_identity band_noband_identity unbandedTime bandedTime optimizedBandTime totalAlns 0.03125 1.0 0.8978 2.9952 0.7033 0.4429 225.0 0.0625 1.0 0.96 3.0146 0.869 0.5882 225.0 0.125 1.0 0.98 2.924 1.1947 0.9135 225.0 0.25 1.0 0.9867 2.9173 1.8567 1.576 225.0 0.5 1.0 1.0 2.9185 2.996 2.5493 225.0 1.0 1.0 1.0 2.9427 3.925 2.6588 225.0
Interesting that the band_noband_identity went up slightly, but only for smaller bands. What this means is that there are some cases where case makes a difference, but with sufficiently large context, it figures it out anyway. Also keep in mind that the listed times are for all 225 alignments. The per alignment time is thus on the order of .012 seconds. Also the times are averages of 10 iterations.
The graph on the left shows how often the alignment results from the banded and unbanded alignment algorithms are identical. Even with a very, very small bandwidth (i.e., .03% of the sequence length), the alignments are identical nearly 90% of the time. That number reaches 100% when the bandwidth is 50% of the sequence length. This to me says that what we are aligning has a pretty no-duh, almost-no-indels answer most of the time. There are about 10% of the lyrics that have metadata in them that require a bit bigger bandwidth to get the optimal alignment.
In terms of the time, the smaller the bandwidth, the faster the banded alignment is. The unbanded alignment is obviously unaffected by the bandwidth. The difference between the banded and the optimized banded is that the banded uses linear time but polynomial space; the optimized uses both linear time and linear space (p.s., this was a lot harder to implement than I expected :)). The reason that the optimized runs a little faster than the banded or (even the unbanded when bandwith ratio is greater than .6) has less to do with the time savings of linear space usage and more to do with the fact that I greased up the optimized aligner to frontload all possible repetitive calculations. Thus the red (and blue) line could also be greased to appropriately match the green line.
Results are averages of 10 iterations on 225 alignments of Billy Joel lyrics (no tabs).
So, my plan then will be to do a crude MSA of all lyrics to get a “gold standard” lyric for each song (presumably with no metadata); then align the gold standard lyric to the tabs to distinguish the actual song content in the tab from the metadata (and simultaneously check the completeness in terms of song coverage of the tablature). Graphs: