In an effort to facilitate research, we will create an open-source, re-distributable library that is well-documented and follows sound object-oriented design including reusability, modularity, and uniformity. In the long run, we will spend less time coding and more time publishing. We will also use the library to assist students in such a way that they are able to learn the principles of Statistical Natural Language Processing by using the primitives provided by the library while implementing the basic algorithms from first principles.
Option 1:
public interface Learner<F, L> {
Model<F, L> trainModel(
Collection<? extends Datum<F>> trainingData,
Collection<? extends Datum<F>> validationData);
}
Advantages
Instantiating any learner by class name is simple–no instanceof or special cases necessary.
Intuitively, these three are pretty much the same. Given training data, all three types produce a Model.
SupervisedLearners and UnsupervisedLearner have exactly the same structure. SemisupervisedLearner may simply by a concrete class that uses a SupervisedLearner, in which case it is difficult to see the need for differente interfaces for the Learners.
Disadvantages
vs.
Option 2:
public interface SupervisedLearner<F, L> {
Model<F, L> trainModel(
Collection<? extends LabeledDatum<F, L>> trainingData,
Collection<? extends LabeledDatum<F, L>> validationData);
}
public interface SemisupervisedLearner<F, L> {
Model<F, L> trainModel(
Collection<? extends LabeledDatum<F, L>> labeledData,
Collection<? extends Datum<F>> unlabeledData,
Collection<? extends LabeledDatum<F, L>> validationData);
}
public interface UnsupervisedLearner<F, L> {
Model<F, L> trainModel(
Collection<? extends Datum<F>> trainingData,
Collection<? extends Datum<F>> validationData);
}
NOTE: do we need an L on unsupervised learning? Don't think so! –rah67
Advantages
You typically can't turn a SupervisedLearner into an UnsupervisedLearner just be sending it unlabeled data, which suggests they should be different interfaces
If it turns out we need a semi-supervised Learner interface, it would be more convenient to split the training data into labeled and unlabeled sets.
Disadvantages
We are currently using option 1 in the CS 401R codebase, but note that all projects are supervised learning (we have taken out the word alignment project).
Is semi-supervised learning predictable enough to create a single concrete SemisupervisedLearner that takes a SupervisedLearner, trainingData, and either % of data to use as labeled data or a set of data that is mixed labeled/unlabeled (along with possibly other parameters to represent how much we should trust the unlabeled data).
Aligned sentences (SentencePair). Aligned sentences are stored as a list of english words and a list of french words. Perhaps each feature should be a pairing b/t english and french words (what are the performance implications?) i.e. Pair<String, String>.
ARFF files with continuous (Double) and discrete (String/Enum) features. Would we create an object explicitly for this?
SpeechNBestLists (is a single feature one of the N best, i.e. List<String>?).
In the CS 401R codebase, we use the following weirdness: class SpeechNBestList implements Datum<SpeechNBestList>, class SentencePair implements Datum<List<String», etc.