In this lab we will use maps and vectors to make relatively quick work of a problem that would normally take quite a bit of code. The kind authors of map did most of the work for us!
In this lab we will generate new unique text that bears strong resemblance to text that was given as input. But first some explaination:
The foundation of most of modern natural language processing (processing human language) is the statistical properties of the text. We will collect simple statistics on text and then use these statistics in an amusing way. In this context we will learn how to process text data, collect statistics and learn about c++ maps.
Suppose you were to take the the text of the US constitution (you used this in an earlier lab) and randomly select words from it one after the other. If you do so you get something like:
noted Laws Congress affect Gorham, Treason States, of as the shall accordingly number may Compact 2 without of the Bill, President both disorderly of ourselves what or a freedom right be States Power In each to of shall the Tonnage, the 1. executive twelfth be be claiming whatsoever, of establish claims at 27 shall nor the No the in of Case present. all Legislature to the published Consequence William the been 14 the the in the States is Judicial shall Amendment not process Amendment United be Section Forces; Congress writs State 26 reduced But two disparage each the of to Emoluments the in term.
This text has the right words and in the right frequencies to look like the constitution, but it does not come close to making any sense.
Suppose instead we were to keep a table (This is called a “map” in C++) of the context in which each word appeared like this (if a word occurs more than once in a context we list is more than once):
Context: Subsequent words: the president president congress congress senate executive first ... executive branch order branch ... of the the all the two the ... . . .
The first line of this map means that the word “president” was found directly following the word “the” twice, as in “the president” and then later “the president” again. The word “congress” was found directly following the word “the” twice as well. The words “senate”, “executive” and “first” were also found after the word “the” somewhere in the document. These are all reasonable words to see after the word “the” in English text. Note that the word “an” would probably never appear in this list because in English it does not make sense to say “the an”. Likewise the second line lists words that follow the word “executive” in the constitution and the third line lists words that follow the word “of” in the input.
This structure is sometimes called a “language model”, because it tells us about how words are related to each other.
Given a language model we can generate text that follows it. This can be done as follows: Suppose that instead of picking words completely at random from the file we're mimicking, we start with a word (the “seed”) and for the next word, we randomly pick from the list of words that followed the seed word in the original text. That is, if we start with the word “of” we then pick the next word (randomly) from the list “the, the, all, the, two, the, …”. For example, suppose we were to pick “the” from this list (after all it is pretty likely).
Next we randomly pick from the list of words that followed “the”, that is from the list “president, president, congress, congress, senate, executive, first, …”. Assume for this example that we were to choose “executive” (not very likely but not impossible).
“executive” is now our context so we pick from “branch order branch …”, suppose again for purposes of this example that we randomly selected the word “branch”. Thus in this example we have generated the string “of the executive branch” which is reasonable text.
We can continue this process for as many words as we wish.
If we do this with the constitution (and start or “seed” the process with the word “Provided”) we could get:
Provided that in rebellion, shall extend to those voted for raising Revenue to support Armies, but the Congress shall in session. If he shall be vested in the witnesses in which a Compensation, which may by law provide, transmit sealed to the office unless on their respective Houses, or Ships of Representatives shall have engaged in Payment of the United States. The validity of the President and Felonies committed within four days after they reside.
Some may find it a little juvenile but in a very simple way we have modeled some sense of how the founding fathers used language. You can see that most of this text uses at least pairs of words in a way consistent with English grammar even though my code has no notion of grammar. The (obviously limited) sense of grammar was “learned” from the input data.
My favorite input text is poetry from William Blake. If I use a collection of his text as input I get:
Sleep, And into my wrath, my foe: I told her languish'd head, Whose modest tresses are cover'd with tears of gold: And into my happy pipe; Sing thy perfumed garments; let our winds Kiss thy little heart doth move Silently, invisibly. I guard o'er the Bard, Who present, past, and put Thy bright mane for thee. Thou fair-hair'd angel of thine influence! Whether in the evening star does shine; The lake; speak silence with smiles, Little sorrows sit and while thou with soft kisses on Him who bore thy fair fingers; pour Thy bright In what dread grasp Dare frame thy west wind doth rest. O Earth, return! Arise from harm: If they see any weeping That we may joy to thy little heart asleep! When wolves and future, sees; Whose modest tresses are driven: They pour blessing And I made a song again;' So I see My foe outstretched beneath the Blue curtains of day.' and did grow. And the morning, turn my back to bear, The fleeces of day.' and gives His voice, Saying, 'Wrath by His light, and sleep, Or think on earth a cloud, and kissed me, And, by His voice, Saying, 'Wrath by His heat
Which is better than any poetry I ever wrote!
Write a program that reads in the words in a file and generates a table like the one shown above. You can hardcode the filename rather than prompting the user for a filename. You must use a c++ map for this table. You will need to print out the table so that the TA can see it. Once you have the table, use it to generate 100 words of text “like” the text you read in (you should prompt the user for a seed string).
This will be the file that the TA will use to pass-off your program (right-click to download):
Here are some other sample files you can download (if you find other's please refer them to a TA to add to the website):
It is very important in this lab to break down the problem into smaller problems and solve them one at a time. Trying to solve all the problems at once is a recipe for frustration and failure. Consider for example that you might solve each of the following problems in order, some of these problems you solved in the earlier lab:
The rule is to simply consider (preferably before coding) each step–the simpler the better–that the algorithm needs to perform. Make sure that it performs the first step before moving to the second. That way when you get frustrated on step two, you know that you can always go back to a model that works with step one. Also, this helps you identify specific objectives in what you need to code next.
If you get stuck:
<!–
You may find this tutorial helpful http://www.dreamincode.net/forums/topic/33631-c-vector-tutorial/
When you have your program working, you will need to show it to a TA.
Put yourself in the help request queue to pass off: http://taohelpqueue.appspot.com/requesthelp/YPK364KMSA5U5QF55UCL8LQ3JHCS5C.
The TA will evaluate your code based on the following criteria:
Extra credit (3 points): You can actually use more than one word of context. You can use two or more words for context and the text becomes more and more realistic, up to point. Change you code so that you can specify the number of words of context you want the program to use. Generate some text at various levels of context and comment on the differences in the generated text. I used a “queue” for the context because as I process words I need to add a new word to the front of the context and drop a word from the back of the context.