Pop*

Pop* (pronounced Pop-Star) is a system developed by Paul Bodily starting Winter 2016 to automate the composition of pop songs. A fundamental element of the project is the automation of annotating tablature data as found in open-source databases such as ultimate-guitar.com. This venture is titled Tab-Complete. The purpose of this wiki is to facilitate documentation of the research endeavors for both the Pop* system and the Tab-Complete dataset.

Project Directory

Potential Publications

  • Sequence alignment methods for identifying structure in music
    • Pairwise MSA for identifying consensus of several lyric sources
    • Lyrical sequence alignment for identifying musical content of tab sources
    • Aligning pairwise alignments of lyric lines for chorus identification (alignment of alignments)
    • Aligning pairwise alignments of chord lines for verse identification (alignment of alignments)
    • Phoneme sequence alignment for rhyme scheme detection
    • Pairwise alignment of intra-segmental chord lines for substructure identification
  • How do we get complete tabs? Compare MSA with pairwise Smith-Waterman techniques.
  • Using rhymes and a distribution metric to pull structure out of unstructured tabs
  • Comparison of key normalization techniques for chord sequences

Upcoming Conferences

Assumptions Made

  • The number of chords per line doesn't equal the number of measures per line
  • All lines in a segment will have the same number of measures per line.
  • All MMs are single-order right now
  • Lyric transition models are based on word→word transition frequencies, not syllables→syllable transition frequencies (which we may want to try)
  • We select the number of words (not syllables) per line, and find a way to make it work syllable-wise

Todo List

  • Make sure we're excluding tab info in opening lines, like in https://tabs.ultimate-guitar.com/e/eagles/its_your_world_now_ukulele_crd.htm
  • lyricsnet isn't scraping the beatles
  • Include figures in manuscript
  • Get some gold-standard data to validate with and insert results into manuscript
  • Get a way for Ben to access verses/choruses on demand, etc.
  • Need better segmentation of songs to avoid a bunch of 1-line segments
  • Need better rhyme scheme analysis to get good rhyme constraints
  • Chords are being grouped only by root and minorness. Needs to be expanded to get more variety.
  • Need to check if rhyme constraints are working properly
  • Need to add constraints
    • on chorus to indicate which lines are fixed and which aren't (lower priority)
    • for subrhyme schemes
    • repetitive subsequences of chords
    • repetitive subsequences of lyrics
  • Start drafting paper on data process and alignment
  • Chorus must also match chords
  • Create gold standard labeled dataset to test various methods. Label
    • Key
    • Rhyme scheme
    • Internal rhymes
    • Segment structure
  • Normalize keys
    • Use neural network with one input feature for
      • each of 24 major/minor chords (double 0 to 1 representing frequency of chord)
      • first chord
      • last chord
    • and for output one node for each possible key
  • Use tab contents rather than lyric contents in finalized tab (use lyrics just for identification of song body and completeness)
  • When eliciting rhyme scheme, currently it's getting all pronunciations of the line and then taking the last few syllables of each (highly redundant). Fix it.
  • Compare different rhyme scoring algorithms:
    • Pat Patterson's rules
    • Hirjee Matrix simple scoring
    • Pat Patterson's rules w/ Hirjee Matrix
    • Alignment
    • w/ w/o considering penultimate syllable
    • Normalizing the Hirjee Matrix
    • Penalize for distance
  • Start parsing the scraped data
  • Get rid of blocks in tab with no chords && no matching lyrics
  • Explicitly set costs in Aligner before it is called.
  • Have TabComplete print out a tab-delimited file with columns representing value indicative of the quality at each step (don't filter initially, choose later)
    • May want to make this a function of how many sequences are being aligned?
  • Do thorough manual verification of pipeline up through tab parser
    • Check how complete the lyric consensuses are
    • Check that all the fields are being correctly populated
    • Check how many lyric sheets per song are being aligned
  • Create or find gold standard dataset

Shoulddo list

  • There are a number of known issues with the MSA consensus calling. It may be better to just use the text from the tab, rather than whatever mish-mash the consensus pulls out (e.g., mindeah).
    • Right now any alphabetic character will taken precedence over a non-alphabetic character if there's a tie
    • If there are three different sequences with three different characters, one will get picked at random or may not get picked at all, depending on where gaps are inserted.
    • So much depends on the scoring costs
  • Find threshold for rhyming vs no rhyming
  • Try running on just good tabbers tabs or on Pro tabs (requires subscription?) or those with videos?
  • Implement and give graduated scoring for relative major chord matching in chord alignment
  • Find best cost matrices for MSA (i.e., match cost, mismatch cost, gap open/extend costs)
  • For speed,
    • consider nixing the all v all pairwise alignments and just do pairwise MSA in random order
    • consider not using a map for the storage of MSA aligned sequences

Completed

  • Use conditional probability distributions to model number of chords per line, and constraints relating to number of chords per line and repetition of chords in a line
    • We have a distribution of rhyme schemes conditioned on SegmentType
    • We need a distribution of subrhyme schemes conditioned on rhyme schemes (could also be conditioned on SegType)
    • We need a distribution of the number of chords per line conditioned on rhyme schemes (and possibly conditioned on SegType)
    • We need a distribution of repetitive subsequences of chords per line conditioned on rhyme scheme and SegType
    • We need a distribution of the number of words per line conditioned on subrhyme scheme
    • We need a distribution of repetitive subsequences of lyrics per line conditioned on rhyme scheme and SegType
    • We need a distribution of the variation in words per line between paired lines (as per rhyme scheme)
    • We need a distribution of chord transitions, including intersegmental chord transitions, starting chords, ending chords, etc.
  • Instead of keeping just counts, keep pointers to actual songs to be able to later reference source of decision
  • Test: Rhyme Structure sampler with actual constraints
  • Filter explicit songs -
  • Complete tab alignment with assumption that chord indices (within line) are not correct and that blocks are not maintained
  • Find matching sheets to do MSA - we will start by only aligning those with matching names
  • Words not in the CMU dictionary?
    • Integrate G2PConverter into phoneticizer
  • Fix: Right now a one-line bridge between verses is screwing up distributions for lines per bridge, etc. Solution: treat such as interlude
  • Fix: When getting phenotypes per line, need to let phoneticizer (w/ rhyme stop words) decide what to get phonemes for (right now just gets last two words, even if they’re stop words)
  • Account for different pronunciations in the CMU dictionary
  • Discovered the refactoring takes the same time for both string alignment and MSA. I'm a little concerned the MSA is going to take a long time to complete.
  • Test refactored code to see if alignments are the same before and after refactoring and the change in speed from generalizing
    • Compare the two MSA alignments algorithms on simple inputs, comparing their scoring matrices and figuring out why they compute differently, specifically consider subsequence from a real case.
    • I discovered I'd never updated the fix for computing left costs for the old aligner AND simultaneously stumbled on a bug in the consensus caller that overestimated the count for a character that appeared in both upper and lower cases.
  • Verified that the refactored alignment code computes the same alignment for strings
  • Refactor sequence alignment code to easily allow alignment of non-char sequences (e.g., phonemes, chords)
  • Find out how unbanded and banded with minPercOverlap of 1.0 are different and fix it.
  • Can we grease up the banded?
  • n-dimensional MSA? or Pairwise?
  • Computer?
  • Implemented and compared alignment algorithms for aligning lyrics and tabs
  • Fix issues with scrapers:
    • UG: raw_tab,url, title, difficulty, key, provider, contributor, type all look good
    • eChords: raw_tab,url, title, difficulty, key, provider, contributor, type all look good
    • SongLyrics: url, title, provider, artist all look good
    • Metrolyrics: url, title, provider, artist all look good
    • LyricsNet: url, title, provider, artist all look good
  • Map role values to actual roles
  • Parse untagged chords in tabs
    • Legitimate Chords:
      • “Intro:G” or “Bm…” or “C|” or “Gmaj7 –” or “~D7” (1,2, 3,4)
      • “C+” or “Bb5+”
      • “Em-Dm-C” (1)
      • “Ebsus” (1)
      • “Dm/C” or “F6/A” (1)
      • “A4” or “Am2” (1)
      • “D9sus4”, “Bm7b5”, “F#m7b5”,“Am7add11” (1,2)
      • “Dm13” (1)
    • Inferable Chords:
      • “/F” (1)
      • “-7” (from “C -7”, i.e., space) (1)
      • “Em7-5” where Em7 is labeled
      • “Δ7” or “º7” (1,2)
      • “Do” ( 1)
      • “Am^D7” (1)
    • Repeat Information:
      • “x4” or “repeat” or “(2x)” (1,2)
      • “||: :||” (1)
    • Salvageable?
    • Removable elements:
      • “(hold)” or anything in parens? (1)
    • Things to check:
      • Impact on imbedded? (1)
      • Still seeing “=”?
      • Still seeing “de de da__ da da__ C” (1)?
  • Replace lyrics with more than three consecutive identical letters with just two of them ( solution)
  • Fix songlyrics.com scraper to not just get Billy Joel songs
  • Fix metrolyrics.com scraper to also get songs and artists with numbers in their name
    • Added “0-9” to the two LinkExtractors…
  • Check tab scrapers and lyrics.net to see if they're getting all the artists
    • Looks okay
  • Switch the chord/lyric tab parser to Java
  • Switch the lowerlevel loader/cleaner to Java
    • Fix missing artist in E-chords tabs
    • Remove extra column for lyrics
    • Fix difficulty and contributor parsing for UG

** Answer questions:

**

  • Are there regularities in the tabs titles or artists that can be removed?
    • Removed all UG Guitar Pro Tabs
    • Removed type from UG song titles
    • Removed (),[] everywhere they appear in titles or artist names
    • Used lowercase, alphanumeric-only keys for artist and song names for indexing
    • Removed “ Tab” from UG artist name

**

  • Are there regularities in the tabs content that can be removed?

**

  • Are there regularities in the lyric titles or artists that can be removed?
    • Removed (),[] everywhere they appear in titles or artist names

**

  • Are there regularities in the lyric content that can be removed?
    • For lyricsnet, the content always starts with the tag “<pre id=“lyric-body-text” class=“lyric-body” dir=“ltr” data-lang=“en”>” and ends with the tag “</pre>”
    • For songlyrics, the content almost always starts with the tag “<p id=“songLyricsDiv” class=“songLyricsV14 iComment-text”>” and end with “</p>”(1186539/1186581), otherwise it has no lyrics at all.
    • For metrolyrics, the content always starts with the tag “<div id=“lyrics-body-text”>” and end with the tag “</div>”(815749/815749)
    • Turns out all of these are consumed in the JSoup parsing
mind/pop.txt · Last modified: 2017/09/13 11:56 by edoyle91
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0