Purpose

You can often tell who the author for a book is by the selections of words used in the book. In this laboratory, you will count the number of times each word is used in a document and will print out the number of times that each of the highest used words occur.

Requirements

When you read a long document, there is a good chance that many words occur multiple times. Instead of storing each word, it may be beneficial to only store unique words, and to represent the document as a vector of pointers to the unique words.

Write a program that implements this strategy. Read a word at a time from the file (you can start by reading from cin at first). For our purposes a word is any sequence of characters delimited by space and excluding all beginning and all final punctuation marks (thus “vice-president” should be counted as “vice-president” and not “vice” and “president” and “!!hello!!” should be counted as “hello”). Words should be case insensitive (meaning “Vice” and “vice” should be counted as the same word). Keep a vector<char*> of words. If the new word is not contained in this vector, allocate memory, copy the word into it, and add a pointer in your vector<char *> to the new memory. You should also allocate a parallel vector<int> that will have the counts for each of these words. This vector<int> will contain the count for the word at that same position in the word vector.

If the word is already present, then you need to increment the count vector element at the same offset.

This will be the file that the TA will use to pass-off your program. Leave it unaltered so that your results will match the expected results (right-click to download):

U.S. Constitution

Here are some other sample files you can download (if you find other's please refer them to a TA to add to the website):

Think before programming

Inputs

The file name that you want the program to operate on. The program should read this file name and the document becomes the input.

Here is an example of reading a file into your program.

Outputs

The counts for the top 10 words in the document. For example:

Word	Count
and	512
thou	312
ye	265
..	…

It is very important in this lab to break down the problem into smaller problems and solve them one at a time. Trying to solve all the problems at once is a recipe for frustration and failure. Consider for example that you might solve each of the following problems in order:

Isolate each word in the file
Allocate memory for each word and store pointer to memory in vector
Check if word already exists in vector
etc…

The rule is to simply consider (preferably before coding) each step–the simpler the better–that the algorithm needs to perform. Make sure that it performs the first step before moving to the second. That way when you get frustrated on step two, you know that you can always go back to a model that works with step one. Also, this helps you identify specific objectives in what you need to code next.

Getting Help

If you get stuck:

First get your vector of words working. Make sure that you can add an additional word and have it appear in the correct place.
Then create a parallel vector of counts and initialize the count vector with known values to make sure that the word occurs at the same offset as the corresponding count.
Once you have your code basically working, try it with a small document where you can manually count the number of occurrences of each word.

I need more than this

You may find this tutorial helpful http://www.dreamincode.net/forums/topic/33631-c-vector-tutorial/

Pass-off procedure

When you have your program working, you will need to show it to a TA.

Put yourself in the help request queue to pass off: http://taohelpqueue.appspot.com/requesthelp/YPK364KMSA5U5QF55UCL8LQ3JHCS5C.

The TA will evaluate your code based on the following criteria:

Did you read in the file correctly? (5 points)
Did you allocate memory and add the words to the vector ? (5 points)
You have the correct counts for the top ten words in the file. (10 points)
Did you format your code as shown in the book and in class, with comment blocks? (5 points)
Test cases(10 points):
1. 3pts - test cases
2. 3pts - Correctly deal with capitalization (word counts are case insensitive)
3. 4pts - Correctly deal with surrounding punctuation

Extra credit: Change your program so that it can be given the filename for two documents and compute the top 10 words for both and then compute the difference between the two files by summing the difference in position between the same words in the two documents (3 points). For example, if the word “ubiquitous” is the 5th most frequent word in document A and the 7th most frequent word in document B, then the difference in position would be 5-7=2 for the word “ubiquitous. If “ubiquitous” does not appear in one of the documents, treat it as if it is at position 11. Sum these word differences to determine how different the two documents are. See if you can show that 2 documents from the same author have a smaller difference than two documents by different authors.

cs-142/unique-words.txt · Last modified: 2015/01/07 08:57 by ryancha

Back to top

Table of Contents