# Part of Speech Tagging with Hidden Markov Models

## Objectives

The objectives of this assignment are to:

• build models of sequences of random variables, in particular Markov chains and Hidden Markov Models (HMMs)
• give you an opportunity to think about and work with the sequential nature of language
• provide experience with the Viterbi algorithm

## Overview

In this project you will build a system that assigns part-of-speech tags (POS tags) to words in arbitrary English text. Your method will involve learning sequence models from labeled training data instead of coding up explicit rules.

Lecture #22 introduces the problem of sequence labeling, including part-of-speech tagging. Further, Wikipedia gives the following summary of Part-of-speech (POS) tagging: Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. However, there are clearly many more categories and sub-categories. For nouns, the plural, possessive, and singular forms can be distinguished. In many languages words are also marked for their “case” (role as subject, object, etc.), grammatical gender, and so on, while verbs are marked for tense, aspect, and other things. Linguists distinguish parts of speech to various fine degrees, reflecting a chosen “tagging system”. In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English. For example, NN for singular common nouns, NNS for plural common nouns, NNP for singular proper nouns (see the POS tags used in the Brown Corpus).

Slide #19 of Lecture #25 will serve as a useful reference for the language modeling task. Lecture #22 and Lecture #23 will serve as a useful reference for the tagging task. Furthermore, sections 15.1-15.3 of Russell and Norvig's Artificial Intelligence textbook can serve as a helpful reference for these topics. We recommending consulting these references or any other textbook or online tutorial on the topics of Markov chains and hidden Markov models (there are many good ones).

You will complete two tasks in order to build your POS tagger. First, you will build a simple language model. Second, you will build the POS tagger itself.

## Language Modeling

• Train an n-gram language model as a collection of categorical distributions; this is a learning task, involving maximum likelihood estimates of the parameters of the categorical distributions.
• Generate English-like text; this is a prediction task, involving sampling repeatedly from your the categorical distributions in your model.

### Training

First, train your model by estimating the needed probabilities. Referring to R&N's ch. 15, the words in your training data are represented by the variable $X_i$ as shown in figure 15.1 part (a). Each word token depends on the word token before it (in a bigram) or as shown in figure 15.1 part (b), the two word tokens before it, or in general, the $n$ words before it. In the bigram case you will need to compute the probability $P(w_i | w_{i-1})$ with which each word type appears after each word type. As a made-up example, you will estimate from data the probability that after the word “a” the word type “dog” appears 60% of the time, “cat” 30% of the time, and “professor” 10% of the time (again, those numbers are made-up but are provided for illustrative purposes). Similarly, after the word “aardvark”, the word “is” appears 70% and “was” appears 30%. The collection of all of these conditional (categorical) distributions for all word type conditions constitutes the transitions for the first-order Markov model (as shown in equation 15.1).

### Generation

by starting with a start symbol (as a seed) and sampling from your model

Second, generate sentences. This is the generation phase. To begin, start with a seed word $X_0$, preferably the “start” symbol, indicating the beginning of a sentence. Then consult the conditional distributions in your Markov model to generate the next word randomly. Use your code for sampling a categorical distribution (i.e., “spin the spinner”) to choose the next word in a manner that is consistent with the transitions you learned in the training phase. This process is repeated until you choose a “stop” symbol, using each randomly chosen word as the context for choosing the next randomly chosen word.

This task is not hard. It has been used in the introduction to programming class (CS 142). If you know how to program using maps and queues, you can do it in about 30 lines of code. It is a great context to sort out some ideas that you will need for POS tagging.

You can use the text from the Penn Treebank data, or you can find your own data. You may have fun using poetry.

Now repeat this task but for a second-order Markov process, where the transition probabilities depend on two words of context instead of just one. (See Figure 15.1 part (b).)

## Part of Speech Tagging

The second task is choosing the best POS tags for the tokens in arbitrary English text. This task is more challenging.

In the POS tagging task we will employ a Hidden Markov Model (HMM). See Chapter 15 again. In the case of part-of-speech tagging, the part-of-speech tag for word token $E_i$ is the hidden variable $X_i$. Thus, the token $E_i$ is the emission, observation, or evidence. (See equations 15.2 and 15.3.) This part of the project also has two parts: a training part and a labeling or “tagging” part. For the labeling part, instead of generating text our task is to predict the part-of-speech tags for all of the tokens in new, previously unseen text.

### Training

Figure 15.2 shows the basic structure of an HMM, however in our case the hidden states (the sequence of random variable on the top) will not be the weather, they will be parts of speech. The evidence (the random variables hanging at the bottom) will not be a binary umbrella variable, but rather the sequence of words in our text. Just like in the Language Modeling task you will need to learn the transition probabilities, but not for the words, but rather for the parts of speech. The provided training data gives ordered pairs including the part of speech and the word token. The transition probabilities only depend on the part of speech part of the data. You will also need to learn the “sensor” or “emission” model (Equations 15.2), that is, for each part of speech, the probability of seeing each possible word. For example, if Xi is “Noun” the word token is “dog” 40% of the time, “cat” 30%, “fish” 20%, and “snake” 10%.

### Labeling

Once you have these probabilities you can use the Viterbi algorithm to find the most likely sequence of tags given an unlabeled sequence of text. The Viterbi algorithm is described in section 15.2.3 (for some reason, they do not tell you it is called Viterbi until the end of the section!). Repeat this task but for a second order Markov process for the transitions, where the transition probabilities depend on two POSs of context instead of just one. Lecture #23 includes some helpful explanation about how to extend the method to Markov order two.

## Data

Use part-of-speech tagged sentences from a flattened version of the Penn Treebank Wall Street Journal data set. This data is available here. This data is split according to the convention in the NLP research community, into training, validation, and test sets.

For your reference, the set of tags used in this task is documented here: http://www.comp.leeds.ac.uk/amalgam/tagsets/upenn.html

IMPORTANT: The Penn Treebank is used under license by BYU from the Linguistic Data Consortium. We provide access to students according to the terms of the license. Please use the data only for the purposes of the project and do not share the data with anyone outside of the class.

ALSO IMPORTANT: Do not test your HMM on the same data you trained on! You MUST learn the probabilities from the “allTraining.txt” data. Test the resultant POS tagger on the test data in “devtest.txt” and report your results on the test data.

## Coding Requirements

You are required to train on a significant portion of the training set. It can be computationally and time-intensive to train on all of the data. Be sure to validate that your code is working on a small portion of the training data; once you are satisfied that it is working, proceed on the full training set, or at least a larger portion of it.

## Report Requirements

Please limit the non-code portion of you report to about 6 pages.

Write a report on the work you have performed. You should describe what you built, what choices you had to make, why you made the choices you did, how well they worked out, and what you might do to improve things further.

• [10 points] Clear writing is an important aspect of your report. Also, label the sections of your report. Include an introduction and conclusion. Structure your report in such a way that it is easy to read and follow.
• [5 points] Show that you obtained reasonable transition probabilities in the first order language model. Here and below, when I ask for probabilities I expect to see numbers for the training data or for some test case that you provide.
• [5 points] Show that you can generate reasonable looking random text from the first order model. Again, here and below, I expect to see generated text.
• [5 points] Show that you obtained reasonable transition probabilities in the second order language model.
• [5 points] Show that you can generate better looking random text from the second order model.
• [5 points] Show that you obtained reasonable transition and emission probabilities in the first order HMM.
• [15 points] Show that you can predict reasonable looking POS tags for new (test) text using the first order HMM. Show some sentences and their tags.
• [5 points] Show that you obtained reasonable transition and emission probabilities in the second order HMM.
• [10 points] Show that you can predict better POS tags for new (test) text using the second order HMM.
• [15 points] Include confusion matrices (described in class, or see http://en.wikipedia.org/wiki/Confusion_matrix) to evaluating your ability to predict POS tags for the first and second order models (above). Since the full tag confusion matrices are large, you could display only the most interesting parts of a confusion matrix. One possible way to do this is to implement a total threshold to filter rows or columns (so that rows and columns which are confused less than a given number times will not be displayed). I still expect to see a matrix, not just a few isolated values.
• [10 points] Throughout, provide evidence that you looked at the specific behavior of your models, thought about what they are doing and that when you found something that was wrong, that you took appropriate action.
• [10 points] Working code.

You are also required to include at the top of page 1 of your report a clear measure (in hours) of how long it took you (each) to complete this project. Please also include in your report a short section titled “Feedback”. Reflect on your experience in the project and provide any concrete feedback you have for us that would help make this project a better learning experience.

## Submission

Submit your code and report on learning suite. Your report should be a .pdf document.