Alignment

We have implemented a NW-alignment algorithm in Java. We then implemented a banded NW-aligner. We then optimized the banded NW-aligner to run in linear space.

minPercOverlap	banded_identity	band_noband_identity	unbandedTime	bandedTime	optimizedBandTime	totalAlns
0.03125		1.0		0.893			3.0931		0.7097		0.4767			225.0
0.0625		1.0		0.956			2.9003		0.8283		0.5601			225.0
0.125		1.0		0.9734			2.94		1.1837		0.9215			225.0
0.25		1.0		0.9867			2.925		1.8431		1.569			225.0
0.5		1.0		1.0			2.9287		2.9216		2.493			225.0
1.0		1.0		1.0			2.9299		3.8145		2.5307			225.0

Initially the optimizedBandTime was running quite a bit slower. I greased it up by minimizing repetitive calculations and now it seems to be working faster than the unbanded aligner even when the “band” is the full sequence length. It's interesting to see that with the minimum percent overlap at 50%, the banded alignment still finds the same alignment as the unbanded 100% of the time. These results are exact character matching by the way, and I imagine those numbers will go up slightly if we change everything to lowercase:

minPercOverlap	banded_identity	band_noband_identity	unbandedTime	bandedTime	optimizedBandTime	totalAlns
0.03125		1.0		0.8978			2.9952		0.7033		0.4429			225.0
0.0625		1.0		0.96			3.0146		0.869		0.5882			225.0
0.125		1.0		0.98			2.924		1.1947		0.9135			225.0
0.25		1.0		0.9867			2.9173		1.8567		1.576			225.0
0.5		1.0		1.0			2.9185		2.996		2.5493			225.0
1.0		1.0		1.0			2.9427		3.925		2.6588			225.0

Interesting that the band_noband_identity went up slightly, but only for smaller bands. What this means is that there are some cases where case makes a difference, but with sufficiently large context, it figures it out anyway. Also keep in mind that the listed times are for all 225 alignments. The per alignment time is thus on the order of .012 seconds. Also the times are averages of 10 iterations.

The graph on the left shows how often the alignment results from the banded and unbanded alignment algorithms are identical. Even with a very, very small bandwidth (i.e., .03% of the sequence length), the alignments are identical nearly 90% of the time. That number reaches 100% when the bandwidth is 50% of the sequence length. This to me says that what we are aligning has a pretty no-duh, almost-no-indels answer most of the time. There are about 10% of the lyrics that have metadata in them that require a bit bigger bandwidth to get the optimal alignment.

In terms of the time, the smaller the bandwidth, the faster the banded alignment is. The unbanded alignment is obviously unaffected by the bandwidth. The difference between the banded and the optimized banded is that the banded uses linear time but polynomial space; the optimized uses both linear time and linear space (p.s., this was a lot harder to implement than I expected :)). The reason that the optimized runs a little faster than the banded or (even the unbanded when bandwith ratio is greater than .6) has less to do with the time savings of linear space usage and more to do with the fact that I greased up the optimized aligner to frontload all possible repetitive calculations. Thus the red (and blue) line could also be greased to appropriately match the green line.

Results are averages of 10 iterations on 225 alignments of Billy Joel lyrics (no tabs).

So, my plan then will be to do a crude MSA of all lyrics to get a “gold standard” lyric for each song (presumably with no metadata); then align the gold standard lyric to the tabs to distinguish the actual song content in the tab from the metadata (and simultaneously check the completeness in terms of song coverage of the tablature). Graphs:

mind/alignment.txt · Last modified: 2016/05/13 13:26 by norkish

Back to top