CCash Version 0.1

Up to this point the main work has been done on the GUI side. Work has now started on what hopefully will become the production version of the web client. Jeremy Sandberg is working on the Database/Server side and Paul Felt is working on the GUI/client side. Version 0.1 is intended to implement a subset of the desired functionality. In other words, there will be plenty of chances to add features. We are starting with a pared down version so that we can actually finish and then go forward iteratively.

Features

  • one annotation per word - no multiple annotations per word
  • one project - only one project available
  • one document - only one document available to annotate
  • prefix and suffix - just identify, don't annotate them
  • annotation consists of: lexeme, top level part of speech, all attributes of verbs, all other annotation punted on
  • log in, log out
  • dummy dictionary - add words to dictionary, no review of added words
  • localization

Dictionary

  • Marc Carmen came onboard late in the game, so he didn't have time to do much with the dictionary.

GUI/Client Side

Project Navigation

  • Shows a tree of projects/documents. The tree may only be one level deep–that is, projects cannot have subprojects. Each project has a flat list of documents inside of it.

User Management

  • Allow login/logout
  • No permission role enforcement yet

Active Learning Mode

  • Segment words
  • Annotate words with a tagset delivered by the server
  • Keyboard shortcuts allow for completing the whole annotation process (so far) without using the mouse.

Stats

  • Show hard-coded stats placeholder on login (before a project is chosen)

Localization

  • The basic framework for translating the CCash into other languages (including rtl).

Database/Server Side

Server RPC

Account Management

  • Log in
  • Log out

GWT sessions will be used to keep the state of the current users.

Utilities

  • Get Project List

Returns a list of all the current projects that can be worked on. (ex Arabic, English, Syriac)

  • Get Document List

Returns all the documents that are in the current project. Basically used to populate the file browser once a project is selected.

  • Annotation Timer - in light of multiple annotations per user, I don't know if we need this….

Polls the server once every 30 seconds to let the server know that the client is still active. In this way we can avoid caching problems on the client side. For example: We will always load the next 3 best annotation targets(what the active learning model suggests) into the client side to improve response time. Each time an annotation is loaded, it is marked that someone is annotating it so no other uses will try and annotate the same word. If the user who has a word loses his connection to the internet his annotation session will expire once the server doesn't receive this heart beat message for three minutes. When this session expires, the annotation targets will then be marked as not taken since the user never had a change to annotate them. The client will also flush their cache and ask for a new session when internet connectivity is restored.

  • Get Statistics

This will probably be broken down into a few different methods but the basic idea is to show some statistics on who has annotated the most words, who is the “most correct” annotator and the personal stats for current user. Hopefully something like this will encourage all users to do more.

Browsing

  • Get Sentence

Returns a specific sentence. The client will send the sentence number that it needs from a specific file. This way each time a user selects a file to view he won't need to download the complete file in order to see it. He will see the file appear sentence by sentence.

Dictionary

  • Get Entry List

Returns a very simple list of all the words in the dictionary(if this isn't massive).

  • Get Entry Details

Gets the dictionary details of the specific word.

  • Add Entry

As an annotator goes along doing his/her job, they might come upon a word that is not in the current dictionary. This RPC will allow the GUI to notify the database that a word has been added by that annotator.

  • Get All Entries To Be Reviewed

Returns a list of all the words that are currently in the dictionary but have not been reviewed by an authorized user.

  • Mark As Reviewed

Each word that is added by an annotator will be considered in provisional status until it is reviewed. This allows the GUI to notify the database that the selected word was accepted, with changes included.

Annotate

  • Get Next Annotation

Returns the next sentence that needs to be annotated. If we are dropping to a word granularity, then it will still return a sentence object but with only one word that can be annotated.

  • Get Context Before

Takes in the sentence identifier and returns the previous sentence in read only mode.

  • Get Context After

Same as above except the sentence after.

  • Commit annotation

Takes a sentence that has been annotated. If we are in word granularity then the whole sentence will be sent back but only the specifically indicated word will be recorded as annotated. Also, this will be used by the reviewer and the “reviewed” flag will be flipped for the sentence and each word.

Database

General Tables

500px-DatabaseSchema.png

  • Gerneral Document Configuration

A project links to multiple documents; each document links to multiple sentences; each sentence links to multiple tokens.

  • Document Table

Each document had a status_id that indicates where in the work flow the document is: annotating, under review, complete. In later versions this needs to be expanded to allow for granulated control of the status of a document: multiple annotations, annotate complete document, select specific annotators, select specific reviewers, complete. A document will be able to go from any status to any other status at any time. Some decisions need to be made about this: such as, if a document is partially reviewed and it is sent back to annotation will the annotation also go over the reviewed parts of the document, or will those reviewed parts stay reviewed? There are several other issues that crop up with this fluid of a work flow.

  • Sentence Table

Sentences have a sentence index which indicates where in the document they fall. In this way when we need to load a complete document, sentences can be assured of their order regardless of their sentence_id and they can be loaded dynamically by the GUI for annotating purposes. For example, if the annotated sentence is the 33rd sentence in the document, the GUI can send a request asking for the 32nd and the 34th and thus give context for the sentence very quickly and the ordering of sentence_ids doesn't matter.

  • AnnotatedToken Table

Each time an annotation is completed an entry is created in the annotation table. Each annotation must have an annotation, which for this version will be stored in the “little annotation language”. Also, the annotation may have a corrected_text which will indicates if the annotator thinks there is an error in the OCR. Also, a user_id is associated with each each annotation so we can keep track of which users annotated which words. The reviewed column is a simple bit. If this bit is 0 then the reviewer will see each annotation that has been done perword. Once the reviewer selects an annotation he wants to keep the chosen annotations reviewed bit is then set to 1. There is another option that might be preferred for reviewed words which will be explained below. Each annotation is also connected to the dictionary and here we currently have a small conflict. See the next item in the list for an explanation.

  • Dictionary-AnnotatedToken Conflict

Currently as the schema is shown, the dictionary has an annotation element which could possibly be in conflict to the annotation done by the users. Why this is the current case is because if each word that is annotated must be in the dictionary, then it doesn't make sense to have an annotation field in the AnnotatedToken table because the dictionary will already have an annotation field. This means we can reference the annotation field in the dictionary if each word that is annotated must be linked back to the dictionary. I think this is a decision that needs to be decided but is not critical.

  • Reviewed Annotations

Another option rather than the reviewed bit in the AnnotatedToken table is to have a separate ReviewedToken table. This ReviewedToken table would have a reviewed_id that would be a PK. It would also include a user_id - the user who reviewed that token -, an annotation_id - which maps to the reviewed token that was selected as correct.

  • Dictionary Table

As an annotator annotates, he will be given the opportunity to add words to the dictionary on the fly. If each word that is annotated must be linked back to the dictionary, then all annotation data should be contained in the dictionary. That way when someone is annotating if a word is in the dictionary, everything should be completely annotated for them already. The reviewed column simply indicates whether the entry in the dictionary has been reviewed by an authority. Only specific users will be allowed to review the dictionary but all uses will be able to see all words whether reviewed or not.

  • TagSet Table

Contains the general categories by which a word will be tagged. It has two foreign keys; one to the project table and one to the file table. This is to allow admins to specify different tag sets per file but have a default tag set for a project.

  • TagValue Table

This has each value for each set.

Communication With Model

It seems that the easiest way to communicate between the model and the server side is to use two really simple tables. In this way, the model can change all it wants but the server side can rely on the fact that these two tables will always exist and thus the changes to the code base should be minimal when changes occur to these tables.

  • AnnotatedItems Tables

Each time an item is annotated, a record will be added to this table. The model can then grab however many annotated tokens it wants to train on and train away at its leisure.

  • SelectedItems Tables

Once the model has trained and selected items, each item will be added as a new row. Then when an annotated desires to annotate, the client side can ask for the next selected items and the server will simply grab the top ones in this table and give them to the annotator.

nlp-private/ccash_0.1.txt · Last modified: 2015/04/22 14:56 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0