Table of Contents

Down by the Bay

Code Base

Github Repo

Project Overview

Pachet creates rhythmic templates from existing lyrics and then uses Markov models with (stress, rhyme, and POS) constraints over lyrics to find other suitable replacements.  However, this solution is very limited in that for the rhythmic template [100, 1, 1, 10, 1, 1, 1, 01] shown above. For example, although the phrase “innocence of a story i could leave today” satisfies the rhythmic constraints, the phrase “innocence of a story in an alleyway” ([100,1,1,10,1,1,101]) would not, despite being a very suitable lyric. This is one of many examples that show that the word-level rhythmic template is too restrictive in terms of what we might like to consider. Really we'd like to be able to consider any phrase that matches the syllable-level rhythmic template [100111011101].

How do we do this? We create a constrained Markov model over syllables. We can still constrain for rhyming (perhaps even more effectively since we're looking at phonemes). We can consider a greater breadth of solutions, but would need to sacrifice the POS template, which would break the Markov assumption in our case.

The order would have to be sufficiently high so as to ensure that words were formed from the syllables and syllables would probably need to be marked with their POS tag and possibly their position in a word in order to encourage grammatical cohesion and word cohesion respectively. We could then optionally constrain on certain POS tags if we wanted at specific syllable positions. The reality is that if we want to maintain the Markovian property and have internal rhymes, the order has to be long enough to where both rhyming positions fall within the scope of the order.

Take as a simple example “Down by the Bay”. This famous song requires the singer to ad lib words that follow the phrase “Did you ever see…” such that the rhythmic template matches [0101010101] or a derived form of this template where certain stresses are optionally omitted:

[01010101010] ("a llama eating polka dot pajamas")
[01---10-01-] ("a bear . . . combing . his hair .")
[-1---10-01-] (". us . . . riding . the bus .")
[01-10101-1-] ("a moose . with a pair of new . shoes .")

This sequence is a max of 11 syllables long. The constraints for this problem would be (using 1-base coords):

  1. the syllable at position 1 must either be null or have POS tag DT
  2. the syllable at position 2 must have POS tag NN (can't be null)
  3. the syllable at position 10 must have POS tag NN, ADJ, ADV (can't be null)
  4. the syllables at positions 2 and 10 must rhyme
  5. the syllables at positions 2 and 11 must either both be null or must rhyme
  6. OPTIONAL: the syllables at positions 6 (and maybe 7) must be non-null
  7. For all positions if not null, must have indicated stress

Note that to ensure that the constrained model only produces solutions that meet the rhyme constraint (#4), the Markov order has to be 8 or bigger (keep in mind this is syllables we're talking about, so that's not too bad).

Why do we care? I'm generating a melody without lyrics and I need to add lyrics. Whatever melody (specifically the rhythm of the melody) I generate suggests a syllable-level rhythmic template, not a word-level rhythmic template. I need to be able to generate lyrics according to that rhythm template.

This would work great for shorter lyrics (e.g., poetry, maybe kids books) but longer songs might require some adaptation. For example, songwriters often choose the words they want to rhyme first or a set of possible rhymes and then they try to figure out how to syntactically and grammatically weave them together. The reality is that the farther apart they are, the more likely it seems that you'll be able to find a way to make them fit together, so you could likely pick a pair of rhyming words, constrain the respective positions for that rhyming pair accordingly and then generate using a Markov model.

Question: is there a way to use parts of speech in the training data to augment the Markov transition model with examples it hasn't seen? For example, if the Markov model has only ever trained on “skin a beaver” (or the equivalent sequence of syllable tokens) then the transition matrix would look like this:

		skin		a	beaver
skin		0.0		1.0	0.0
a		0.0		0.0	1.0
beaver	0.0		0.0	0.0

But what if we represented “skin a duck” as “single_syllable_infinitive_verb single_syllable_determiner 2_syllable_noun_1 2_syllable_noun_2”. Suddenly our model (equipped with a dictionary of suitable words and their POS tags) has more expressive power. Perhaps instead of generating a sequence of syllables, we generate simply a sequence of POS-tagged syllables. We could not deal with rhymes directly this way, but it would increase expressive power.

There are connections to prosody here. Like, given a melody (think, for example, of the first half of whistling part on “Don't Worry, Be happy”) that suggests a rhythmic template [1111011010101] and a phrase like “Ring around the rosies a pocketful of rye” that has rhythmic template [101011010101], how do you match this phrase to this melody? Here are two solutions (note again how this is an alignment problem):

[1111011010101]
[1-01011010101]

To Dos

To Do

Future To Do

Done