Formatting Lexicon Information


This page is to keep track of information on potential candidates for a storage format for the lexicon. Currently the goal is to simply store the part of speech tag for each token. However, we want the design to be extensible and allow for additional types of information in the future. Unfortunately we cannot predict all possible variations of what a project would store in its dictionary/lexicon. As a result, we are considering using some sort of storage format that could easily be manipulated and modified on a per dictionary basis. This page is a collection of potential candidates that could be used.


It would be helpful to discuss these possibilities even if it is just adding a list of pros/cons to each candidate.

General Formats

These formats would allow a project to define a list of attributes to be included. With the exception of Protocol Buffers, the list of attributes could easily be added/updated by a project admin.

Specific Formats

These are formats that are standardized to some extent. They would allow easy exchange with other programs. However, they tend to be very complicated formats and they do place some restrictions on what is stored.

