Purpose

In this lab we will use maps and vectors to make relatively quick work of a problem that would normally take quite a bit of code. The kind authors of map did most of the work for us!

In this lab we will generate new unique text that bears strong resemblance to text that was given as input. But first some explaination:

Background

The foundation of most of modern natural language processing (processing human language) is the statistical properties of the text. We will collect simple statistics on text and then use these statistics in an amusing way. In this context we will learn how to process text data, collect statistics and learn about c++ maps.

Modeling Text

Suppose you were to take the the text of the US constitution (you used this in an earlier lab) and randomly select words from it one after the other. If you do so you get something like:

noted Laws Congress affect Gorham, Treason States, of as the shall accordingly number may Compact
2 without of the Bill, President both disorderly of ourselves what or a freedom right be States Power
In each to of shall the Tonnage, the 1. executive twelfth be be claiming whatsoever, of establish
claims at 27 shall nor the No the in of Case present. all Legislature to the published Consequence
William the been 14 the the in the States is Judicial shall Amendment not process Amendment United be
Section Forces; Congress writs State 26 reduced But two disparage each the of to Emoluments the in
term.

This text has the right words and in the right frequencies to look like the constitution, but it does not come close to making any sense.

Suppose instead we were to keep a table (This is called a “map” in C++) of the context in which each word appeared like this (if a word occurs more than once in a context we list is more than once):

 Context:        Subsequent words:
  the             president president congress congress senate executive first ...
  executive       branch order branch ...
  of              the the all the two the ...
   .
   .
   .

The first line of this map means that the word “president” was found directly following the word “the” twice, as in “the president” and then later “the president” again. The word “congress” was found directly following the word “the” twice as well. The words “senate”, “executive” and “first” were also found after the word “the” somewhere in the document. These are all reasonable words to see after the word “the” in English text. Note that the word “an” would probably never appear in this list because in English it does not make sense to say “the an”. Likewise the second line lists words that follow the word “executive” in the constitution and the third line lists words that follow the word “of” in the input.

This structure is sometimes called a “language model”, because it tells us about how words are related to each other.

Generating Text

Given a language model we can generate text that follows it. This can be done as follows: Suppose that instead of picking words completely at random from the file we're mimicking, we start with a word (the “seed”) and for the next word, we randomly pick from the list of words that followed the seed word in the original text. That is, if we start with the word “of” we then pick the next word (randomly) from the list “the, the, all, the, two, the, …”. For example, suppose we were to pick “the” from this list (after all it is pretty likely).

Next we randomly pick from the list of words that followed “the”, that is from the list “president, president, congress, congress, senate, executive, first, …”. Assume for this example that we were to choose “executive” (not very likely but not impossible).

“executive” is now our context so we pick from “branch order branch …”, suppose again for purposes of this example that we randomly selected the word “branch”. Thus in this example we have generated the string “of the executive branch” which is reasonable text.

We can continue this process for as many words as we wish.

If we do this with the constitution (and start or “seed” the process with the word “Provided”) we could get:

Provided that in rebellion, shall extend to those voted for raising
Revenue to support Armies, but the Congress shall in session. If he
shall be vested in the witnesses in which a Compensation, which may
by law provide, transmit sealed to the office unless on their respective
Houses, or Ships of Representatives shall have engaged in Payment of the
United States. The validity of the President and Felonies committed within
four days after they reside.

Some may find it a little juvenile but in a very simple way we have modeled some sense of how the founding fathers used language. You can see that most of this text uses at least pairs of words in a way consistent with English grammar even though my code has no notion of grammar. The (obviously limited) sense of grammar was “learned” from the input data.

My favorite input text is poetry from William Blake. If I use a collection of his text as input I get:

Sleep, And into my wrath, my foe: I told her languish'd head,
Whose modest tresses are cover'd with tears of gold:
And into my happy pipe; Sing thy perfumed garments;
let our winds Kiss thy little heart doth move Silently, invisibly.
I guard o'er the Bard, Who present, past, and put Thy bright mane for thee.

Thou fair-hair'd angel of thine influence!
Whether in the evening star does shine; The lake;
speak silence with smiles, Little sorrows sit and
while thou with soft kisses on Him who bore thy fair fingers;
pour Thy bright In what dread grasp Dare frame thy west wind doth rest.

O Earth, return! Arise from harm:
If they see any weeping That we may joy to thy little heart asleep!
When wolves and future, sees; Whose modest tresses are driven:
They pour blessing And I made a song again;'
So I see My foe outstretched beneath the Blue curtains of day.'
and did grow. And the morning, turn my back to bear, The fleeces of day.'
and gives His voice, Saying, 'Wrath by His light, and sleep,
Or think on earth a cloud, and kissed me,
And, by His voice, Saying, 'Wrath by His heat

Which is better than any poetry I ever wrote!

Requirements

Write a program that reads in the words in a file and generates a table like the one shown above. You can hardcode the filename rather than prompting the user for a filename. You must use a c++ map for this table. You will need to print out the table so that the TA can see it. Once you have the table, use it to generate 100 words of text “like” the text you read in (you should prompt the user for a seed string).

This will be the file that the TA will use to pass-off your program (right-click to download):

U.S. Constitution

Here are some other sample files you can download (if you find other's please refer them to a TA to add to the website):

Think before programming

Inputs– The file name that you want the program to operate on. The program should read this file name and the document becomes the input.
Outputs– 100 words of random text that is “like” the input

It is very important in this lab to break down the problem into smaller problems and solve them one at a time. Trying to solve all the problems at once is a recipe for frustration and failure. Consider for example that you might solve each of the following problems in order, some of these problems you solved in the earlier lab:

Construct an empty c++ map that maps words (Strings) to a list of words (Vector of Strings)
Isolate each word in the file, as you read words you need to remember the previous word so that you can…
Load each new word into the vector that corresponds to the previous word.
Generate text by choosing a “next” word randomly from the vector that corresponds to the current word.

The rule is to simply consider (preferably before coding) each step–the simpler the better–that the algorithm needs to perform. Make sure that it performs the first step before moving to the second. That way when you get frustrated on step two, you know that you can always go back to a model that works with step one. Also, this helps you identify specific objectives in what you need to code next.

Getting Help

If you get stuck:

Print out the contents of your map (perhaps in a format like that shown above) to prove that it is correctly storing the input words

<!–

Try getting the program to work before worrying about removing leading and trailing punctuation and making all letters lowercase.–>
Try to get the map to populate correctly before tackling the random text generation
Use print statements to isolate the error
Talk to a neighbor :)
Use a debugger
Identify a specific question to ask the TA

I need more than this

You may find this tutorial helpful http://www.dreamincode.net/forums/topic/33631-c-vector-tutorial/

Pass-off procedure

When you have your program working, you will need to show it to a TA.

Put yourself in the help request queue to pass off: http://taohelpqueue.appspot.com/requesthelp/YPK364KMSA5U5QF55UCL8LQ3JHCS5C.

The TA will evaluate your code based on the following criteria:

Did you include test cases to make sure each piece of your code was correct? (5 points)
Did you construct a map? (2 points)
Did you read in something from the file? (2 points)
Did you isolate each word? (2 points)
Did you load each unique context word in your map as a key? (3 points)
Did you load the “subsequent” words in the correct vector in the map (i.e. associated with the correct context key)? (3 points)
Did you print out your table using iterators? (4 points)
Did you prompt the user for a seed string? (2 points)
Did you generate text using the seed string and the language model (the map)? (4 points)
Is the generated text random (i.e. different each time)? (4 points)
Did you format your code as shown in the book and in class, with comment blocks and good variable names? (4 points)

Extra credit (3 points): You can actually use more than one word of context. You can use two or more words for context and the text becomes more and more realistic, up to point. Change you code so that you can specify the number of words of context you want the program to use. Generate some text at various levels of context and comment on the differences in the generated text. I used a “queue” for the context because as I process words I need to add a new word to the front of the context and drop a word from the back of the context.

cs-142/textlikefile.txt · Last modified: 2015/01/07 09:00 by ryancha

Back to top

Table of Contents