Mission Statement
In an effort to facilitate research, we will create an open-source, re-distributable library that is well-documented and follows sound object-oriented design including reusability, modularity, and uniformity. In the long run, we will spend less time coding and more time publishing. We will also use the library to assist students in such a way that they are able to learn the principles of Statistical Natural Language Processing by using the primitives provided by the library while implementing the basic algorithms from first principles.
Goals
Facilitate collaboration within groups in the lab
Scalability
Facilitate research
Be pedagogically useful
Provide clear, concise documentation
Provide the framework for scriptable, reproducable experiments
Enforce portability such that all code is both buildable and runnable on all lab machines (including Windows boxes, esp. Perplexity) and
Marylou 4
Facilitate error analysis and feature engineering
Allow for the possibility of releasing open source projects based on the library
Clearly delineate Berkeley and BYU libraries, minimizing the dependency on the Berkeley library for production code
Plans
Strict SVN usage and code reviews
Follow sound OO design including reusability, modulartiy, and unifromity
Distribute the library to students as jar(s) in which the actual implementations are hidden through the use of abstract classes and/or interfaces. Furthermore, additional single project “reference implementations” should be available as “add-on” jars for students who are unsuccessful at finishing labs which are needed for future labs.
Code distributed to students will match the lectures
Make the javadocs publically and easily accessible
Code originating from Dan Klein's original base will be placed in ther own namespace, will include proper attribution, and will house the official
GPL license
An XML file will be used to manage configuration details
All code will be buidable and runnable through an ANT or multiple ANT scripts
Javadocs will be required for every type and method
All applications will be built around the MVC pattern, to facilitate the use of GUIs and command-line programs
GUI-based tools will be built, especially to facilitate error analysis and feature engineering
TODO
Branch Statistical NLP repository
Write code for Datum, Dataset, Data Manager in said branch
Refactor/Adapt old code in Stat NLP repository involving vectors to use new classes
Push Clutering into experimental branch
Retire Clustering repository
Adapt remaining classes in Stat NLP
Create/Adapt ANT build script that extracts appropriate jars for each course
Packages
Experimentation
Data & Datasets
ML Primitives
Features
Optimization
Math
Util (Data Structs)
Stat Primitives
Applications
POS Tagger
PCFG Parser
NP classifier
Word Alignment
Latest Unfiled Thoughts
I wonder if making some abstract base learner that takes datums one-by-one is a good idea or not (gets rid of the need for always writing the same for loop over and over)
Here's a good solution to the debate over whether trainModel should include a ValidationData parameter. Usually you don't need it. Frequently, you want to fold it in if it is present. Some algorithms require it. For these algorithms, you almost always want to retrain using the full data set, but fixing the parameters. Here's my solution: the user is able to specify the amount of data for the validation split (by default: 0). All Learners are required to implement trainModel(trainingData, validationData), but those algorithms that do not require validation data should ignore it (don't slurp it in because there shouldn't ever be validation data). We will also create an interface called “retrainModel” which provides a full training set, and a previous model that will be called in the case that there is validation data. Also, we can provide an abstract base class that always concatenates the training and validation data and passes it on to the abstract method trainModel(trainingData)
It would be nice if the DataManager or Dataset class had a lazy “getVocab()” (getSupport???). This is particularly useful in the EMAble interface.
I wonder if smoothing and/or open vocab models are particular to multinomial models rather than finite discrete distributions.
Our dataset class should (optionally) be indexed, or at least have a “getIndex” method
Data that is clustered should be a special case of LabeledDatum where we have the “goldLabel”, “assignedLabel” (and possibly distributions over labels, but this is probably best obtained through the model, rather than permanently stored with the datum for most practical purposes).
I question the need for Dan's interface that gives back indexed data and the full data set. We should return a model and use the model to label the data
Differences between the codebases
Dan's DataTransformer = Robbie's FeatureSelector + FeatureExtractor
Recommendation, separate the processes. (1) First do feture selection (this is your state) (2) Then implement the transform for a SINGLE datum, given the state (3) The transform of a dataset will then be the same across all “DataTransformers”
The data transformer removes all words, so
Data sets are Listsin Dan's code, Collections in Robbie's
Having a factory where you manually have to add each type of classifier is bad design (not extensible when used as a library)
Ideas
Have an interface (or other marker) to the effect of “parameterized” or “tunable” which has hooks to obtain the relevant parameters. The basic idea is:
(1) Why re-implement powell's/other tuning algorithms in every learner that needs it–do it outside
(2) Why do cross-validation within every learner that needs? We'd rather have the data manager repeat this process
One consideration is how to summarize statistics for n-way cross-fold validation (e.g. “average number of iterations until convergence”)
Issues
Option 1:
public interface Learner<F, L> {
Model<F, L> trainModel(
Collection<? extends Datum<F>> trainingData,
Collection<? extends Datum<F>> validationData);
}
Advantages
Instantiating any learner by class name is simple–no instanceof or special cases necessary.
Intuitively, these three are pretty much the same. Given training data, all three types produce a Model.
SupervisedLearners and UnsupervisedLearner have exactly the same structure. SemisupervisedLearner may simply by a concrete class that uses a SupervisedLearner, in which case it is difficult to see the need for differente interfaces for the Learners.
Disadvantages
vs.
Option 2:
public interface SupervisedLearner<F, L> {
Model<F, L> trainModel(
Collection<? extends LabeledDatum<F, L>> trainingData,
Collection<? extends LabeledDatum<F, L>> validationData);
}
public interface SemisupervisedLearner<F, L> {
Model<F, L> trainModel(
Collection<? extends LabeledDatum<F, L>> labeledData,
Collection<? extends Datum<F>> unlabeledData,
Collection<? extends LabeledDatum<F, L>> validationData);
}
public interface UnsupervisedLearner<F, L> {
Model<F, L> trainModel(
Collection<? extends Datum<F>> trainingData,
Collection<? extends Datum<F>> validationData);
}
NOTE: do we need an L on unsupervised learning? Don't think so! –rah67
Advantages
You typically can't turn a SupervisedLearner into an UnsupervisedLearner just be sending it unlabeled data, which suggests they should be different interfaces
If it turns out we need a semi-supervised Learner interface, it would be more convenient to split the training data into labeled and unlabeled sets.
Disadvantages
We are currently using option 1 in the CS 401R codebase, but note that all projects are supervised learning (we have taken out the word alignment project).
Is semi-supervised learning predictable enough to create a single concrete SemisupervisedLearner that takes a SupervisedLearner, trainingData, and either % of data to use as labeled data or a set of data that is mixed labeled/unlabeled (along with possibly other parameters to represent how much we should trust the unlabeled data).
Aligned sentences (SentencePair). Aligned sentences are stored as a list of english words and a list of french words. Perhaps each feature should be a pairing b/t english and french words (what are the performance implications?) i.e. Pair<String, String>.
ARFF files with continuous (Double) and discrete (String/Enum) features. Would we create an object explicitly for this?
SpeechNBestLists (is a single feature one of the N best, i.e. List<String>?).
In the CS 401R codebase, we use the following weirdness: class SpeechNBestList implements Datum<SpeechNBestList>, class SentencePair implements Datum<List<String», etc.
Interfaces
Classes
SparseDatum<F>
ArrayListDatum<F>, LabeledArrayListDatum<F> (better performance than BasicLabeledDatum)
SimpleDatum<F>?, SimpleLabeledDatum<F>?
Util
Feature Request
Future Work
Back to top