TL;DR
If you just want to run the models on some data you can do so easily with access to the BYU supercomputer and to fslg_handwriting.
ssh into RHEL 7 nodes of Mary Lou by doing:
ssh rhel7ssh.rc.byu.edu
This gets around issues of old versions of GLIBC when loading pytorch.
Dependencies can be taken care of through a conda virtual environment, make sure you have conda installed. Load the virtual environment using:
conda activate /fslgroup/fslg_handwriting/compute/death/env/death_env
Run the following command on a directory of images:
cd /fslgroup/fslg_handwriting/compute/death/software/maskrcnn-benchmark/
python demo/practice_dir.py configs/e2e_mask_rcnn_R_50_FPN_1x_death_nosfr.yaml <input imgs dir> <output images dir>
Run the following command on the directory of segmented images output from the last step:
cd /fslgroup/fslg_handwriting/compute/death/software/start_follow_read
python hw_pred.py sample_config_iam_hwr.yaml <segmented images dir>
Results will be output to stdout.
Overall Workflow
0. Preliminaries
We use 2 data sources to train a segmenter to identify pertinent lines in the death records. The 2 data sources are:
Labelme was used to label the images. We label the images with 2 labels:
These images and annotations can be found on the BYU supercomputer in COCO format at the following path:
/fslgroup/fslg_handwriting/compute/death/data/segmentation/combined_coco
We use 3 data sources of transcribed lines of text in order to train a handwriting recognition model for the Ohio death records. The 3 data sources are:
a) 13353 training images
a) 1390 training images b) 817 validation images
a) 7973 training images b) 1754 validation images
The Ohio and North Carolina images were obtained by segmenting death records and using the transcriptions collected by student transcribers. These images can annotations can be found in the format for Curtis Wiggington’s SFR code on the BYU supercomputer at the following path:
/fslgroup/fslg_handwriting/compute/death/data/transcription/transcribed_iam_combined
We needed to use the IAM dataset for training because of the irregularities present in the death record transcriptions. The main irregularity is whether or not contributory factors are present in the transcription or not.
1. Load Virtual Environment
Making sure you have the right software dependencies is awful. To ease this, we provide a conda virtual environment that can be used with all the appropriate software dependencies. Please don’t install any additional modules while using this environment or you might break software dependencies. Load the environment with the following command:
conda activate /fslgroup/fslg_handwriting/compute/death/env/death_env
2. Preprocess Death Records
All images should be deskewed first. We used Imagemagick to accomplish this. Deskewing is important because it makes labeling lines of text for creating segmentation ground truth significantly easier. The following command will deskew an image:
convert <path to image> -deskew 80% <save path for deskewed image>
3. Label Images for Segmentation
Images are labeled using labelme. We label both the
After labeling images, format the dataset in COCO format by running:
cd /fslgroup/fslg_handwriting/compute/death/software/labelme
python examples/instance_segmentation/labelme2coco_nocrowd_instance_all_cod.py –labels <labels text file> <input images dir> <output images dir>
The <labels text file> can be replaced with /fslgroup/fslg_handwriting/compute/death/data/segmentation/labels_nosfr.txt
4. Train Segmentation Model
The Facebook MaskRCNN model is used for segmenting single lines from death records. We use maskrcnn-benchmark, a pytorch implementation made by Facebook. We used the pretrained ResNet-50 architecture during training. We find that this model provides very good performance (0.969 IoU=0.50:0.95) with minimal training time (<15 hours). We believe that the ResNet architecture works better than Start-Follow-Read’s Start-of-Line finder for this task because it has a larger receptive field and we are only interested in a small number of lines of text instead of every line of text.
We use a modified version of the maskrcnn-benchmark library that does not flip images or do random crops.
To train the model, run the following command:
cd /fslgroup/fslg_handwriting/compute/death/software/maskrcnn-benchmark/
python tools/train_net.py –config-file configs/e2e_mask_rcnn_R_50_FPN_1x_death_nosfr.yaml SOLVER.IMS_PER_BATCH 2 SOLVER.MAX_ITER 100000 TEST.IMS_PER_BATCH 2
The config file configs/e2e_mask_rcnn_R_50_FPN_1x_death_nosfr.yaml contains settings for training the model. We note that we allow a maximum image size of 3000 pixels. Most images should remain about the same size during training and segmentation.
5. Segment Images
Run the following command on a directory of images:
cd /fslgroup/fslg_handwriting/compute/death/software/maskrcnn-benchmark/
python demo/practice_dir.py configs/e2e_mask_rcnn_R_50_FPN_1x_death_nosfr.yaml <input imgs dir> <output images dir>
6. Train Handwriting Recognition Model
We use just the handwriting module from the Start-Follow-Read
7. Perform Handwriting Recognition
Run the following command on the directory of segmented images output from the last step:
cd /fslgroup/fslg_handwriting/compute/death/software/start_follow_read
python hw_pred.py sample_config_iam_hwr.yaml <segmented images dir>
Results will be output to stdout.
8. Post-processing
Spell checker/text correction/normalization techniques needed
Deprecated
The following information is kept for historical reasons to follow previous attempts at automated documented processing of the Ohio death records.
This document contains instructions and insights into the machine learning pipeline used to automatically process the Ohio death records. This includes tasks such as:
Our approach is to use machine learning as a way to automatically extract regions of interest from the death records for use in text (machine printed and handwritten) recognition. First a labeled training dataset is created, then a model is trained. Unlabeled images can then be processed with the train model. After regions are extracted, we then use text recognition trained by handmaid transcriptions to automatically transcribe the cause of death. Next the transcribed cause of death is mapped to an ICD code to bin the different causes of death.
Scraping Data
We have scraped many thousands death records from Family Search’s website while we wait for them to give them all to us. This is accomplished using Selenium and 34,530 death record IDs that were hand scraped from family search’s website. The Selenium script creates a session with Family Search and attempts to download images given a death record ID. Family Search will block access after an unknown number of queries. The block is released after about an hour. If the script is blocked during download, it sleeps for 60 minutes, creates a new session and begins downloading again.
Data Labeling
Labeling death record forms is accomplished by using the application labelme. It can be downloaded at:
https://github.com/wkentaro/labelme
Data Formatting
Once data is labeled it must be converted from the labelme format to the commonly used COCO dataset format. labelme provides a conversion script but it does not work for various reasons outlined below. In the meantime, a custom script has been written which accomplishes the task:
An example of this command is:
python labelme2coco_nocrowd.py –labels labels.txt ohio_death_images_combined/ ohio_coco
The custom script is necessary because of data format the Mask-RCNN implementation (maskrcnn-benchmark) that we use requires segmentation information recorded in polygon format for training. Segmentation data stored as a run-length encoding (RLE) cannot currently be used for training. The supplied conversion script from labelme uses RLE, the custom script uses polygons.
Segmentation
We utilize Facebook AI Research’s Mask-RCNN implementation provided here:
https://github.com/facebookresearch/maskrcnn-benchmark
We find it to be flexible and powerful. It trains quicker than Detectron and is more flexible than MatterPort’s Tensorflow Mask-RCNN implementation.
In order to use this network, your data should be formatted into the COCO dataset format with configuration information noted in the repositories catalog script here:
maskrcnn_benchmark/config/paths_catalog.py
A configuration script should also be generated for training. We suggest using the ResNet-50 config file that is provided:
maskrcnn_benchmark/configs/e2e_mask_rcnn_R_50_FPN_1x.yaml
Training
We have found that the ResNet-50 network trains quickly and has low memory usage while providing excellent results. An example training command is:
python tools/train_net.py –config-file configs/e2e_mask_rcnn_R_50_FPN_1x_death.yaml SOLVER.IMS_PER_BATCH 4 SOLVER.MAX_ITER 10000 TEST.IMS_PER_BATCH 4
Inference/Segment Extraction
The inference process is accomplished using maskrcnn-benchmark’s prediction code snippet in their README.md
Preliminary Segmentation Results
~1 hour training, 294 training images (20190227) With minimal training data we are able to achieve good baseline results. Greater diversity in death record formats would improve results significantly. Below are results from images that the segmenter has never seen.
Handwriting Recognition
Handwriting recognition is accomplished using Start-Follow-Read (SFR) by Curtis Wiggington. SFR is composed of three components: start-of-line (SOL) detector, line follower (LF) and handwriting recognition (HWR).
Start-of-Line Detection
We replace the provided SOL detector with our own based on MaskRCNN. We originally tried using the provided SOL detector but found that it had difficulties identifying the lines of text we were interested in. We believe that this is due to it being based on the shallow VGG11 network architecture which works when identifying every line of text but does not capture enough context to identify specific lines of text.
We use MaskRCNN with a ResNet-50 backend and find that this much deeper network is able to identify the lines of interest very well. The MaskRCNN model is the same model that was used in identifying the ‘Medical Certificate of Death’ (MCD) region mentioned previously. In this approach, we process the entire image (no excising specific portions from the image) to retain contextual information.
When labeling images, we label the MCD and then highlight any line of text that we’re interested in. MaskRCNN performs instance segmentation to identify all SOLs of text that we’re interested in.
Line Follower
We use the built in LF module that was trained on the ICHDAR2017 READ dataset. This module performs adequately but would benefit from additional fine-tuning.
Handwriting Recognition
We use the built in HWR module that was trained on the ICHDAR2017 READ dataset. Performance is quite poor because it is trained using a German dataset. We believe that providing an English language model may be enough to correct the predicted text. If necessary, we will retrain the HWR model using the transcriptions provided from the COD records. Language Model A language model is necessary for quality text recognition. SFR’s pretrained model was trained on German data. We use INSERT CORPUS as our corpus to build a language model for English that is specialized for medical vocabulary.
Nuances of Training SFR
The sample config files provided by the SFR repo for building the run environment and for running the actual models may not be optimized for your setup. I modified the environment to use pytorch==0.3.1 (error in config file). Updating to the most current version of opencv is also recommended to avoid a bug in the repo specified version. I also modified the run config by upping the batch size and changing the training scripts to the dataloader objects to use 16 worker threads. This reduced training time from 800+ seconds to ~10 seconds.
Eventually, we updated the code to work with pytorch==1.0 so that we could use it in conjunction with the MaskRCNN library we have.
Current Results
The MaskRCNN model was trained using 625 images for about 2 hours.
Cause of Death to ICD Code
Labeled training for this data comes from
Questions/Future Approaches
a) Is is unclear how much segmenting the image first actually helps
a) There is a great deal of noise in the images. This may be caused by wear to the microfiche they were stored on. There may be a way to normalize the images in a way that helps to reduce these issues.
a) Using resnet instead of vgg may provide benefits to SFR and could increase the performance of the start-of-line finder when only certain lines are of interest instead of all lines being of interest