nlp-private:byu-nlp-library

Mission Statement
Goals
Plans
TODO
Packages
Latest Unfiled Thoughts
Differences between the codebases
Ideas
Issues
Interfaces
Classes
Util
Feature Request
Future Work

Mission Statement

In an effort to facilitate research, we will create an open-source, re-distributable library that is well-documented and follows sound object-oriented design including reusability, modularity, and uniformity. In the long run, we will spend less time coding and more time publishing. We will also use the library to assist students in such a way that they are able to learn the principles of Statistical Natural Language Processing by using the primitives provided by the library while implementing the basic algorithms from first principles.

Goals

Facilitate collaboration within groups in the lab
Scalability
Facilitate research
Be pedagogically useful
Provide clear, concise documentation
Provide the framework for scriptable, reproducable experiments
Enforce portability such that all code is both buildable and runnable on all lab machines (including Windows boxes, esp. Perplexity) and Marylou 4
Facilitate error analysis and feature engineering
Allow for the possibility of releasing open source projects based on the library
Clearly delineate Berkeley and BYU libraries, minimizing the dependency on the Berkeley library for production code

Plans

Strict SVN usage and code reviews
Follow sound OO design including reusability, modulartiy, and unifromity
Distribute the library to students as jar(s) in which the actual implementations are hidden through the use of abstract classes and/or interfaces. Furthermore, additional single project “reference implementations” should be available as “add-on” jars for students who are unsuccessful at finishing labs which are needed for future labs.
Code distributed to students will match the lectures
Make the javadocs publically and easily accessible
Code originating from Dan Klein's original base will be placed in ther own namespace, will include proper attribution, and will house the official GPL license
An XML file will be used to manage configuration details
All code will be buidable and runnable through an ANT or multiple ANT scripts
Javadocs will be required for every type and method
All applications will be built around the MVC pattern, to facilitate the use of GUIs and command-line programs
GUI-based tools will be built, especially to facilitate error analysis and feature engineering

TODO

Branch Statistical NLP repository
Write code for Datum, Dataset, Data Manager in said branch
Refactor/Adapt old code in Stat NLP repository involving vectors to use new classes
Push Clutering into experimental branch
Retire Clustering repository
Adapt remaining classes in Stat NLP
Create/Adapt ANT build script that extracts appropriate jars for each course

Packages

Experimentation
Data & Datasets
- Transforms
ML Primitives
Features
- Feature Extraction/Transform
- Feature Selection & Reduction
- Dimensionality Reduction
Optimization
Math
Util (Data Structs)
Stat Primitives
Applications
- POS Tagger
- PCFG Parser
- NP classifier
- Word Alignment

Latest Unfiled Thoughts

I wonder if making some abstract base learner that takes datums one-by-one is a good idea or not (gets rid of the need for always writing the same for loop over and over)
Here's a good solution to the debate over whether trainModel should include a ValidationData parameter. Usually you don't need it. Frequently, you want to fold it in if it is present. Some algorithms require it. For these algorithms, you almost always want to retrain using the full data set, but fixing the parameters. Here's my solution: the user is able to specify the amount of data for the validation split (by default: 0). All Learners are required to implement trainModel(trainingData, validationData), but those algorithms that do not require validation data should ignore it (don't slurp it in because there shouldn't ever be validation data). We will also create an interface called “retrainModel” which provides a full training set, and a previous model that will be called in the case that there is validation data. Also, we can provide an abstract base class that always concatenates the training and validation data and passes it on to the abstract method trainModel(trainingData)
It would be nice if the DataManager or Dataset class had a lazy “getVocab()” (getSupport???). This is particularly useful in the EMAble interface.
I wonder if smoothing and/or open vocab models are particular to multinomial models rather than finite discrete distributions.
Our dataset class should (optionally) be indexed, or at least have a “getIndex” method
Data that is clustered should be a special case of LabeledDatum where we have the “goldLabel”, “assignedLabel” (and possibly distributions over labels, but this is probably best obtained through the model, rather than permanently stored with the datum for most practical purposes).
I question the need for Dan's interface that gives back indexed data and the full data set. We should return a model and use the model to label the data
- Why does this interface return a map to a set??? This necessitates a dataset that uses a collection (I also had to add ? extends datum because his is specified to be labeled datum)

Differences between the codebases

Dan's DataTransformer = Robbie's FeatureSelector + FeatureExtractor
- Recommendation, separate the processes. (1) First do feture selection (this is your state) (2) Then implement the transform for a SINGLE datum, given the state (3) The transform of a dataset will then be the same across all “DataTransformers”
- The data transformer removes all words, so
Data sets are Listsin Dan's code, Collections in Robbie's
- Collection reflects the real world (order doesn't matter). For managing the dataset, a List is probably more convenient
Having a factory where you manually have to add each type of classifier is bad design (not extensible when used as a library)

Ideas

Have an interface (or other marker) to the effect of “parameterized” or “tunable” which has hooks to obtain the relevant parameters. The basic idea is:
- (1) Why re-implement powell's/other tuning algorithms in every learner that needs it–do it outside
- (2) Why do cross-validation within every learner that needs? We'd rather have the data manager repeat this process
One consideration is how to summarize statistics for n-way cross-fold validation (e.g. “average number of iterations until convergence”)

Issues

Do we need one Learner or three:

Option 1:

public interface Learner<F, L> {
	Model<F, L> trainModel(
			Collection<? extends Datum<F>> trainingData,
			Collection<? extends Datum<F>> validationData);
}

Advantages
- Instantiating any learner by class name is simple–no instanceof or special cases necessary.
- Intuitively, these three are pretty much the same. Given training data, all three types produce a Model.
- SupervisedLearners and UnsupervisedLearner have exactly the same structure. SemisupervisedLearner may simply by a concrete class that uses a SupervisedLearner, in which case it is difficult to see the need for differente interfaces for the Learners.
Disadvantages
- Inside EVERY supervised learner, a type cast is necessary to obtain the labels on the Labeled data.

vs.

Option 2:

public interface SupervisedLearner<F, L> {
	Model<F, L> trainModel(
			Collection<? extends LabeledDatum<F, L>> trainingData,
			Collection<? extends LabeledDatum<F, L>> validationData);
}

public interface SemisupervisedLearner<F, L> {
	Model<F, L> trainModel(
			Collection<? extends LabeledDatum<F, L>> labeledData,
			Collection<? extends Datum<F>> unlabeledData,
			Collection<? extends LabeledDatum<F, L>> validationData);
}

public interface UnsupervisedLearner<F, L> {
	Model<F, L> trainModel(
			Collection<? extends Datum<F>> trainingData,
			Collection<? extends Datum<F>> validationData);
}

NOTE: do we need an L on unsupervised learning? Don't think so! –rah67

Advantages
- You typically can't turn a SupervisedLearner into an UnsupervisedLearner just be sending it unlabeled data, which suggests they should be different interfaces
- If it turns out we need a semi-supervised Learner interface, it would be more convenient to split the training data into labeled and unlabeled sets.
Disadvantages
- Instantiation requires special cases–at least for semisupervised learning.

We are currently using option 1 in the CS 401R codebase, but note that all projects are supervised learning (we have taken out the word alignment project).

Is semi-supervised learning predictable enough to create a single concrete SemisupervisedLearner that takes a SupervisedLearner, trainingData, and either % of data to use as labeled data or a set of data that is mixed labeled/unlabeled (along with possibly other parameters to represent how much we should trust the unlabeled data).

What is the best way to handle Datums with non-uniform features? Here are several examples (more are possible):

Aligned sentences (SentencePair). Aligned sentences are stored as a list of english words and a list of french words. Perhaps each feature should be a pairing b/t english and french words (what are the performance implications?) i.e. Pair<String, String>.
ARFF files with continuous (Double) and discrete (String/Enum) features. Would we create an object explicitly for this?
SpeechNBestLists (is a single feature one of the N best, i.e. List<String>?).
- In the CS 401R codebase, we use the following weirdness: class SpeechNBestList implements Datum<SpeechNBestList>, class SentencePair implements Datum<List<String», etc.

Interfaces

Learner<F,L> (see above)
Model<F,L>
- ProbabilisticModel<F,L> extends Model<F,L>
- GenerativeModel<F,L> extends Model<F,L>
Datum<F>
LabeledDatum<F>
DataReader<F>
MetricCalculator (see above)

Classes

SparseDatum<F>
ArrayListDatum<F>, LabeledArrayListDatum<F> (better performance than BasicLabeledDatum)
SimpleDatum<F>?, SimpleLabeledDatum<F>?

Util

Counter<E>
CounterMap<K, V>
LogCounter<E>
LogCounter<K, V>??
logAdd
BoundedPriorityQueue (already in CS 401R code base)
ArrayList/Map combo (called Indexer in Berkeley–Robbie wrote one before discovering this)

Feature Request

Future Work

Investigate Joshua Goodman's approach to MaxEnt smoothing

NLP Lab

Table of Contents