Table of Contents

Mission Statement

In an effort to facilitate research, we will create an open-source, re-distributable library that is well-documented and follows sound object-oriented design including reusability, modularity, and uniformity. In the long run, we will spend less time coding and more time publishing. We will also use the library to assist students in such a way that they are able to learn the principles of Statistical Natural Language Processing by using the primitives provided by the library while implementing the basic algorithms from first principles.

Goals

Plans

TODO

Packages

Latest Unfiled Thoughts

Differences between the codebases

Ideas

Issues

Option 1:

public interface Learner<F, L> {
	Model<F, L> trainModel(
			Collection<? extends Datum<F>> trainingData,
			Collection<? extends Datum<F>> validationData);
}

vs.

Option 2:

public interface SupervisedLearner<F, L> {
	Model<F, L> trainModel(
			Collection<? extends LabeledDatum<F, L>> trainingData,
			Collection<? extends LabeledDatum<F, L>> validationData);
}

public interface SemisupervisedLearner<F, L> {
	Model<F, L> trainModel(
			Collection<? extends LabeledDatum<F, L>> labeledData,
			Collection<? extends Datum<F>> unlabeledData,
			Collection<? extends LabeledDatum<F, L>> validationData);
}

public interface UnsupervisedLearner<F, L> {
	Model<F, L> trainModel(
			Collection<? extends Datum<F>> trainingData,
			Collection<? extends Datum<F>> validationData);
}

NOTE: do we need an L on unsupervised learning? Don't think so! –rah67

We are currently using option 1 in the CS 401R codebase, but note that all projects are supervised learning (we have taken out the word alignment project).

  1. Aligned sentences (SentencePair). Aligned sentences are stored as a list of english words and a list of french words. Perhaps each feature should be a pairing b/t english and french words (what are the performance implications?) i.e. Pair<String, String>.
  2. ARFF files with continuous (Double) and discrete (String/Enum) features. Would we create an object explicitly for this?
  3. SpeechNBestLists (is a single feature one of the N best, i.e. List<String>?).
    • In the CS 401R codebase, we use the following weirdness: class SpeechNBestList implements Datum<SpeechNBestList>, class SentencePair implements Datum<List<String», etc.

Interfaces

Classes

Util

Feature Request

Future Work

NLP Lab