## Mission Statement

In an effort to facilitate research, we will create an open-source, re-distributable library that is well-documented and follows sound object-oriented design including reusability, modularity, and uniformity. In the long run, we will spend less time coding and more time publishing. We will also use the library to assist students in such a way that they are able to learn the principles of Statistical Natural Language Processing by using the primitives provided by the library while implementing the basic algorithms from first principles.

## Goals

• Facilitate collaboration within groups in the lab
• Scalability
• Facilitate research
• Be pedagogically useful
• Provide clear, concise documentation
• Provide the framework for scriptable, reproducable experiments
• Enforce portability such that all code is both buildable and runnable on all lab machines (including Windows boxes, esp. Perplexity) and Marylou 4
• Facilitate error analysis and feature engineering
• Allow for the possibility of releasing open source projects based on the library
• Clearly delineate Berkeley and BYU libraries, minimizing the dependency on the Berkeley library for production code

## Plans

• Strict SVN usage and code reviews
• Distribute the library to students as jar(s) in which the actual implementations are hidden through the use of abstract classes and/or interfaces. Furthermore, additional single project “reference implementations” should be available as “add-on” jars for students who are unsuccessful at finishing labs which are needed for future labs.
• Code distributed to students will match the lectures
• Make the javadocs publically and easily accessible
• Code originating from Dan Klein's original base will be placed in ther own namespace, will include proper attribution, and will house the official GPL license
• An XML file will be used to manage configuration details
• All code will be buidable and runnable through an ANT or multiple ANT scripts
• Javadocs will be required for every type and method
• All applications will be built around the MVC pattern, to facilitate the use of GUIs and command-line programs
• GUI-based tools will be built, especially to facilitate error analysis and feature engineering

## TODO

• Branch Statistical NLP repository
• Write code for Datum, Dataset, Data Manager in said branch
• Refactor/Adapt old code in Stat NLP repository involving vectors to use new classes
• Push Clutering into experimental branch
• Retire Clustering repository
• Adapt remaining classes in Stat NLP
• Create/Adapt ANT build script that extracts appropriate jars for each course

## Packages

• Experimentation
• Data & Datasets
• Transforms
• ML Primitives
• Features
• Feature Extraction/Transform
• Feature Selection & Reduction
• Dimensionality Reduction
• Optimization
• Math
• Util (Data Structs)
• Stat Primitives
• Applications
• POS Tagger
• PCFG Parser
• NP classifier
• Word Alignment

## Latest Unfiled Thoughts

• I wonder if making some abstract base learner that takes datums one-by-one is a good idea or not (gets rid of the need for always writing the same for loop over and over)
• Here's a good solution to the debate over whether trainModel should include a ValidationData parameter. Usually you don't need it. Frequently, you want to fold it in if it is present. Some algorithms require it. For these algorithms, you almost always want to retrain using the full data set, but fixing the parameters. Here's my solution: the user is able to specify the amount of data for the validation split (by default: 0). All Learners are required to implement trainModel(trainingData, validationData), but those algorithms that do not require validation data should ignore it (don't slurp it in because there shouldn't ever be validation data). We will also create an interface called “retrainModel” which provides a full training set, and a previous model that will be called in the case that there is validation data. Also, we can provide an abstract base class that always concatenates the training and validation data and passes it on to the abstract method trainModel(trainingData)
• It would be nice if the DataManager or Dataset class had a lazy “getVocab()” (getSupport???). This is particularly useful in the EMAble interface.
• I wonder if smoothing and/or open vocab models are particular to multinomial models rather than finite discrete distributions.
• Our dataset class should (optionally) be indexed, or at least have a “getIndex” method
• Data that is clustered should be a special case of LabeledDatum where we have the “goldLabel”, “assignedLabel” (and possibly distributions over labels, but this is probably best obtained through the model, rather than permanently stored with the datum for most practical purposes).
• I question the need for Dan's interface that gives back indexed data and the full data set. We should return a model and use the model to label the data
• Why does this interface return a map to a set??? This necessitates a dataset that uses a collection (I also had to add ? extends datum because his is specified to be labeled datum)

## Differences between the codebases

• Dan's DataTransformer = Robbie's FeatureSelector + FeatureExtractor
• Recommendation, separate the processes. (1) First do feture selection (this is your state) (2) Then implement the transform for a SINGLE datum, given the state (3) The transform of a dataset will then be the same across all “DataTransformers”
• The data transformer removes all words, so
• Data sets are Listsin Dan's code, Collections in Robbie's
• Collection reflects the real world (order doesn't matter). For managing the dataset, a List is probably more convenient
• Having a factory where you manually have to add each type of classifier is bad design (not extensible when used as a library)

## Ideas

• Have an interface (or other marker) to the effect of “parameterized” or “tunable” which has hooks to obtain the relevant parameters. The basic idea is:
• (1) Why re-implement powell's/other tuning algorithms in every learner that needs it–do it outside
• (2) Why do cross-validation within every learner that needs? We'd rather have the data manager repeat this process
• One consideration is how to summarize statistics for n-way cross-fold validation (e.g. “average number of iterations until convergence”)

## Issues

• Do we need one Learner or three:

Option 1:

public interface Learner<F, L> {
Model<F, L> trainModel(
Collection<? extends Datum<F>> trainingData,
Collection<? extends Datum<F>> validationData);
}
• Instantiating any learner by class name is simple–no instanceof or special cases necessary.
• Intuitively, these three are pretty much the same. Given training data, all three types produce a Model.
• SupervisedLearners and UnsupervisedLearner have exactly the same structure. SemisupervisedLearner may simply by a concrete class that uses a SupervisedLearner, in which case it is difficult to see the need for differente interfaces for the Learners.
• Inside EVERY supervised learner, a type cast is necessary to obtain the labels on the Labeled data.

vs.

Option 2:

public interface SupervisedLearner<F, L> {
Model<F, L> trainModel(
Collection<? extends LabeledDatum<F, L>> trainingData,
Collection<? extends LabeledDatum<F, L>> validationData);
}

public interface SemisupervisedLearner<F, L> {
Model<F, L> trainModel(
Collection<? extends LabeledDatum<F, L>> labeledData,
Collection<? extends Datum<F>> unlabeledData,
Collection<? extends LabeledDatum<F, L>> validationData);
}

public interface UnsupervisedLearner<F, L> {
Model<F, L> trainModel(
Collection<? extends Datum<F>> trainingData,
Collection<? extends Datum<F>> validationData);
}

NOTE: do we need an L on unsupervised learning? Don't think so! –rah67

• You typically can't turn a SupervisedLearner into an UnsupervisedLearner just be sending it unlabeled data, which suggests they should be different interfaces
• If it turns out we need a semi-supervised Learner interface, it would be more convenient to split the training data into labeled and unlabeled sets.
• Instantiation requires special cases–at least for semisupervised learning.

We are currently using option 1 in the CS 401R codebase, but note that all projects are supervised learning (we have taken out the word alignment project).

• Is semi-supervised learning predictable enough to create a single concrete SemisupervisedLearner that takes a SupervisedLearner, trainingData, and either % of data to use as labeled data or a set of data that is mixed labeled/unlabeled (along with possibly other parameters to represent how much we should trust the unlabeled data).
• What is the best way to handle Datums with non-uniform features? Here are several examples (more are possible):
1. Aligned sentences (SentencePair). Aligned sentences are stored as a list of english words and a list of french words. Perhaps each feature should be a pairing b/t english and french words (what are the performance implications?) i.e. Pair<String, String>.
2. ARFF files with continuous (Double) and discrete (String/Enum) features. Would we create an object explicitly for this?
3. SpeechNBestLists (is a single feature one of the N best, i.e. List<String>?).
• In the CS 401R codebase, we use the following weirdness: class SpeechNBestList implements Datum<SpeechNBestList>, class SentencePair implements Datum<List<String», etc.

## Interfaces

• Learner<F,L> (see above)
• Model<F,L>
• ProbabilisticModel<F,L> extends Model<F,L>
• GenerativeModel<F,L> extends Model<F,L>
• Datum<F>
• LabeledDatum<F>
• MetricCalculator (see above)

## Classes

• SparseDatum<F>
• ArrayListDatum<F>, LabeledArrayListDatum<F> (better performance than BasicLabeledDatum)
• SimpleDatum<F>?, SimpleLabeledDatum<F>?

## Util

• Counter<E>
• CounterMap<K, V>
• LogCounter<E>
• LogCounter<K, V>??