Up to this point the main work has been done on the GUI side. Work has now started on what hopefully will become the production version of the web client. Jeremy Sandberg is working on the Database/Server side and Paul Felt is working on the GUI/client side. Version 0.1 is intended to implement a subset of the desired functionality. In other words, there will be plenty of chances to add features. We are starting with a pared down version so that we can actually finish and then go forward iteratively.
GWT sessions will be used to keep the state of the current users.
Returns a list of all the current projects that can be worked on. (ex Arabic, English, Syriac)
Returns all the documents that are in the current project. Basically used to populate the file browser once a project is selected.
Polls the server once every 30 seconds to let the server know that the client is still active. In this way we can avoid caching problems on the client side. For example: We will always load the next 3 best annotation targets(what the active learning model suggests) into the client side to improve response time. Each time an annotation is loaded, it is marked that someone is annotating it so no other uses will try and annotate the same word. If the user who has a word loses his connection to the internet his annotation session will expire once the server doesn't receive this heart beat message for three minutes. When this session expires, the annotation targets will then be marked as not taken since the user never had a change to annotate them. The client will also flush their cache and ask for a new session when internet connectivity is restored.
This will probably be broken down into a few different methods but the basic idea is to show some statistics on who has annotated the most words, who is the “most correct” annotator and the personal stats for current user. Hopefully something like this will encourage all users to do more.
Returns a specific sentence. The client will send the sentence number that it needs from a specific file. This way each time a user selects a file to view he won't need to download the complete file in order to see it. He will see the file appear sentence by sentence.
Returns a very simple list of all the words in the dictionary(if this isn't massive).
Gets the dictionary details of the specific word.
As an annotator goes along doing his/her job, they might come upon a word that is not in the current dictionary. This RPC will allow the GUI to notify the database that a word has been added by that annotator.
Returns a list of all the words that are currently in the dictionary but have not been reviewed by an authorized user.
Each word that is added by an annotator will be considered in provisional status until it is reviewed. This allows the GUI to notify the database that the selected word was accepted, with changes included.
Returns the next sentence that needs to be annotated. If we are dropping to a word granularity, then it will still return a sentence object but with only one word that can be annotated.
Takes in the sentence identifier and returns the previous sentence in read only mode.
Same as above except the sentence after.
Takes a sentence that has been annotated. If we are in word granularity then the whole sentence will be sent back but only the specifically indicated word will be recorded as annotated. Also, this will be used by the reviewer and the “reviewed” flag will be flipped for the sentence and each word.
A project links to multiple documents; each document links to multiple sentences; each sentence links to multiple tokens.
Each document had a status_id that indicates where in the work flow the document is: annotating, under review, complete. In later versions this needs to be expanded to allow for granulated control of the status of a document: multiple annotations, annotate complete document, select specific annotators, select specific reviewers, complete. A document will be able to go from any status to any other status at any time. Some decisions need to be made about this: such as, if a document is partially reviewed and it is sent back to annotation will the annotation also go over the reviewed parts of the document, or will those reviewed parts stay reviewed? There are several other issues that crop up with this fluid of a work flow.
Sentences have a sentence index which indicates where in the document they fall. In this way when we need to load a complete document, sentences can be assured of their order regardless of their sentence_id and they can be loaded dynamically by the GUI for annotating purposes. For example, if the annotated sentence is the 33rd sentence in the document, the GUI can send a request asking for the 32nd and the 34th and thus give context for the sentence very quickly and the ordering of sentence_ids doesn't matter.
Each time an annotation is completed an entry is created in the annotation table. Each annotation must have an annotation, which for this version will be stored in the “little annotation language”. Also, the annotation may have a corrected_text which will indicates if the annotator thinks there is an error in the OCR. Also, a user_id is associated with each each annotation so we can keep track of which users annotated which words. The reviewed column is a simple bit. If this bit is 0 then the reviewer will see each annotation that has been done perword. Once the reviewer selects an annotation he wants to keep the chosen annotations reviewed bit is then set to 1. There is another option that might be preferred for reviewed words which will be explained below. Each annotation is also connected to the dictionary and here we currently have a small conflict. See the next item in the list for an explanation.
Currently as the schema is shown, the dictionary has an annotation element which could possibly be in conflict to the annotation done by the users. Why this is the current case is because if each word that is annotated must be in the dictionary, then it doesn't make sense to have an annotation field in the AnnotatedToken table because the dictionary will already have an annotation field. This means we can reference the annotation field in the dictionary if each word that is annotated must be linked back to the dictionary. I think this is a decision that needs to be decided but is not critical.
Another option rather than the reviewed bit in the AnnotatedToken table is to have a separate ReviewedToken table. This ReviewedToken table would have a reviewed_id that would be a PK. It would also include a user_id - the user who reviewed that token -, an annotation_id - which maps to the reviewed token that was selected as correct.
As an annotator annotates, he will be given the opportunity to add words to the dictionary on the fly. If each word that is annotated must be linked back to the dictionary, then all annotation data should be contained in the dictionary. That way when someone is annotating if a word is in the dictionary, everything should be completely annotated for them already. The reviewed column simply indicates whether the entry in the dictionary has been reviewed by an authority. Only specific users will be allowed to review the dictionary but all uses will be able to see all words whether reviewed or not.
Contains the general categories by which a word will be tagged. It has two foreign keys; one to the project table and one to the file table. This is to allow admins to specify different tag sets per file but have a default tag set for a project.
This has each value for each set.
It seems that the easiest way to communicate between the model and the server side is to use two really simple tables. In this way, the model can change all it wants but the server side can rely on the fact that these two tables will always exist and thus the changes to the code base should be minimal when changes occur to these tables.
Each time an item is annotated, a record will be added to this table. The model can then grab however many annotated tokens it wants to train on and train away at its leisure.
Once the model has trained and selected items, each item will be added as a new row. Then when an annotated desires to annotate, the client side can ask for the next selected items and the server will simply grab the top ones in this table and give them to the annotator.