You can often tell who the author for a book is by the selections of words used in the book. In this laboratory, you will count the number of times each word is used in a document and will print out the number of times that each of the highest used words occur.
When you read a long document, there is a good chance that many words occur multiple times. Instead of storing each word, it may be beneficial to only store unique words, and to represent the document as a vector of pointers to the unique words.
Write a program that implements this strategy. Read a word at a time from the file (you can start by reading from cin at first). For our purposes a word is any sequence of characters delimited by space and excluding all beginning and all final punctuation marks (thus “vice-president” should be counted as “vice-president” and not “vice” and “president” and “!!hello!!” should be counted as “hello”). Words should be case insensitive (meaning “Vice” and “vice” should be counted as the same word). Keep a vector<char*> of words. If the new word is not contained in this vector, allocate memory, copy the word into it, and add a pointer in your vector<char *> to the new memory. You should also allocate a parallel vector<int> that will have the counts for each of these words. This vector<int> will contain the count for the word at that same position in the word vector.
If the word is already present, then you need to increment the count vector element at the same offset.
This will be the file that the TA will use to pass-off your program. Leave it unaltered so that your results will match the expected results (right-click to download):
Here are some other sample files you can download (if you find other's please refer them to a TA to add to the website):
The file name that you want the program to operate on. The program should read this file name and the document becomes the input.
Here is an example of reading a file into your program.
The counts for the top 10 words in the document. For example:
Word | Count |
---|---|
and | 512 |
thou | 312 |
ye | 265 |
.. | … |
It is very important in this lab to break down the problem into smaller problems and solve them one at a time. Trying to solve all the problems at once is a recipe for frustration and failure. Consider for example that you might solve each of the following problems in order:
The rule is to simply consider (preferably before coding) each step–the simpler the better–that the algorithm needs to perform. Make sure that it performs the first step before moving to the second. That way when you get frustrated on step two, you know that you can always go back to a model that works with step one. Also, this helps you identify specific objectives in what you need to code next.
If you get stuck:
You may find this tutorial helpful http://www.dreamincode.net/forums/topic/33631-c-vector-tutorial/
When you have your program working, you will need to show it to a TA.
Put yourself in the help request queue to pass off: http://taohelpqueue.appspot.com/requesthelp/YPK364KMSA5U5QF55UCL8LQ3JHCS5C.
The TA will evaluate your code based on the following criteria:
Extra credit: Change your program so that it can be given the filename for two documents and compute the top 10 words for both and then compute the difference between the two files by summing the difference in position between the same words in the two documents (3 points). For example, if the word “ubiquitous” is the 5th most frequent word in document A and the 7th most frequent word in document B, then the difference in position would be 5-7=2 for the word “ubiquitous. If “ubiquitous” does not appear in one of the documents, treat it as if it is at position 11. Sum these word differences to determine how different the two documents are. See if you can show that 2 documents from the same author have a smaller difference than two documents by different authors.