Formatting Lexicon Information

Details

This page is to keep track of information on potential candidates for a storage format for the lexicon. Currently the goal is to simply store the part of speech tag for each token. However, we want the design to be extensible and allow for additional types of information in the future. Unfortunately we cannot predict all possible variations of what a project would store in its dictionary/lexicon. As a result, we are considering using some sort of storage format that could easily be manipulated and modified on a per dictionary basis. This page is a collection of potential candidates that could be used.

Candidates

It would be helpful to discuss these possibilities even if it is just adding a list of pros/cons to each candidate.

General Formats

These formats would allow a project to define a list of attributes to be included. With the exception of Protocol Buffers, the list of attributes could easily be added/updated by a project admin.

XML
Protocol Buffers
JSON
Plain Text (Values separated by some delimiter)

Specific Formats

These are formats that are standardized to some extent. They would allow easy exchange with other programs. However, they tend to be very complicated formats and they do place some restrictions on what is stored.

TBX or TBX-Basic
OLIF

nlp-private/lexiconformat.txt · Last modified: 2015/04/22 14:57 by ryancha

Back to top

Table of Contents

Formatting Lexicon Information

Details

Candidates

General Formats

Specific Formats