nlp:syriac-user-study [CS Wiki]

Introduction to the Project

The desirability of an electronic corpus of Syriac texts has long been recognized (most recently in Lucas Van Rompay's January 2007 Hugoye article). Several localized and limited steps have been made in this direction, most significantly with the Peshitta, and as part of the Comprehensive Aramaic Lexicon project. However, no coordinated and large scale effort has yet been attempted. Since 2001 scholars at the Center for the Preservation of Ancient Religious Texts (CPART) at Brigham Young University (BYU) have been working towards creating a comprehensive electronic corpus of Syriac texts. In 2004 they were joined in this effort by Dr. David G.K. Taylor of the Oriental Institute at the University of Oxford. Working from both printed editions and manuscripts this project aims to systematically acquire accurate electronic copies of all of Syriac literature.

Furthermore, the Syriac Electronic Corpus will include a morphological annotation of each word. Because of the size of the undertaking, some parts of the corpus will be automatically annotated by a machine. More crucial parts of the corpus will be annotated by human annotators.

The Natural Language Processing (NLP) Lab at BYU is developing tools and cost-efficient methods of annotation in order to make the Corpus’s construction feasible.

The Value of Corpora

Linguistic annotations of text offer many substantial benefits. Annotations can be used to more reliably find linguistic patterns, explore language usage, track how it changes over time, and discover rare forms. Finally, for a morphologically-rich language like Syriac, annotations can be a practical help to language learners. For example, Semitic language dictionaries tend to list conjugated verb forms after a base form (the dictionary headword). This arrangement is convenient for dictionary users who are already familiar with verb conjugations but can be very frustrating for language learners. Annotations allow language learners to interact more naturally with the text they are learning.

The Syriac morphological annotations we are collecting involve segmenting each word into prefix, stem, and suffix. The stem and suffix are then tagged with morphological information, and the stem is further annotated with the corresponding dictionary headword. The end result will be a morphologically annotated electronic corpus of the Syriac texts: a body of texts where every word is linked to a dictionary entry, and a dictionary where every entry is linked to each of its usages in the corpus.

How Can Machines Help?

Traditionally annotated corpora have been laboriously labeled by hand. Especially for under-resourced languages like Syriac, however, this approach is cost-prohibitive. Research in the field of Natural Language Processing (NLP) and Machine Learning has introduced several possible solutions to this problem. For example, annotation can be cheaper and more accurate when annotators are asked to correct machine predictions rather than annotate from scratch. The BYU NLP Lab has developed a model capable of making such predictions with high accuracy for Syriac. Additionally, it has been shown in some domains that annotation efficiency can be increased by automatically selecting which examples are annotated first, a technique known as Active Learning. The NLP Lab has been involved in improving Active Learning’s efficiency and usability in real applications and extending the methodology to better handle unexplored domains such as Syriac morphological tagging.

We have created a web-based annotation tool called CCASH (Cost Conscious Annotation Supervised by Humans) that takes advantage of these methods. As CCASH matures, it will be used to develop the annotated Syriac corpus. In addition, it will be made available to the public for other annotation projects.

The Importance of the User Studies

CCASH and the Syriac Electronic Corpus are both planned as open access resources intended to benefit the field of Syriac studies. The results of the user studies and feedback from users will directly impact the development and the functionality of the annotation tools. These tools will be used to computationally annotate the entirety of the corpus. Moreover, user study participants will also be able to use CCASH to efficiently and completely annotate Syriac texts that they are interested in.

User Study #1

In this user study we are gathering data on the effectiveness of having annotators correct machine predictions rather than annotate from scratch. This technique is called automatic pre-annotation.

As you take the study you will encounter machine predictions of varying quality. Bear in mind that most of these annotations will be of intentionally poor quality, and are not representative of the best our model can do.

We aim to determine how correct machine predictions need to be before they begin to be useful to annotators. This knowledge will help us appropriately apply pre-annotations to difficult texts like poetry.

We are also exploring an enhancement to pre-annotation in which machine predictions are updated in response to an annotator's actions. As you annotate, future pre-annotations will sometimes be updated based on decisions you have already made.

Getting Started

Fonts

The Syriac Computing Center (SyrCOM) of Beth Mardutho provides excellent free Syriac fonts that are compatible with the Windows Operating System.

If you are running an up-to-date browser, your browser should automatically load and use these fonts to display Syriac text in the Serto script. If you encounter any font-related problems, however, try downloading and installing the fonts manually. If Syriac text still doesn't render correctly, try using a different browser or setting your browser font manually.

Unfortunately, the Beth Mardutho fonts are not entirely compatible with Mac OS X or Linux. For this user study, we recommend using Windows XP, Vista, or 7.

Compatible Browsers

Please use a browser that is compatible with CCASH. The most recent versions of the following browsers work well with CCASH:

Firefox
Chrome
Safari

The following browsers do NOT work with CCASH:

Internet Explorer
Opera

Create an account with CCASH

The study is currently closed. <!– Navigate to http://cash.cs.byu.edu/Ccash and click the button that says “Register”. When you are done registering, you will be sent a verification email message. Open that email message using your favorite mail reader and click on the link to activate your account. Then you will be ready to annotate. –>

Start Annotating

The study is currently closed. <!– Navigate to http://cash.cs.byu.edu/Ccash and log in with your newly created username and password. You will see a list of all the projects you have been assigned to annotate. For now, that list contains only the user study. Click the button that says “Annotate”. –>

Resources

Resources you may want to use while annotating include:

A reference summary of the training you will receive inside CCASH
- Pdf (14 pages)
A Compendius Syriac Dictionary by Robert Payne Smith
- Downloadable copy (approx. 635 pages)
- Online version with basic navigation
Compendious Syriac Grammar by Theodor Nöldeke
- Pdf courtesy of CPART (371 pages)
Syriac Verb Tables by David Taylor
- Pdf (22 pages)

Results

Good news! A paper analyzing the results of the study has been published in the proceedings of LREC 2012:

First Results in a Study Evaluating Pre-annotation and Correction Propagation for Machine-assisted Syriac Morphological Analysis

More information

For more information about our methodologies, see the Active Learning For Annotation (ALFA) Project page.

Questions?

Please contact Paul Felt (paul DOT lewis DOT felt AT gmail DOT com) if you need further clarification or information about the project.

nlp/syriac-user-study.txt · Last modified: 2015/04/21 16:08 by ryancha

Back to top

Table of Contents