Differences

This shows you the differences between two versions of the page.

Link to this comparison view

handwriting:deathrecords [2019/09/06 18:00]
10.37.241.82 created
handwriting:deathrecords [2019/09/06 18:36] (current)
10.37.241.82
Line 1: Line 1:
-Ohio Death Records Automated Processing Pipeline Documentation+Ohio Death Records Automated Processing Pipeline Documentation ​
 + 
 +'''​TL;​DR'''​
  
-TL;DR 
 If you just want to run the models on some data you can do so easily with access to the BYU supercomputer and to fslg_handwriting. If you just want to run the models on some data you can do so easily with access to the BYU supercomputer and to fslg_handwriting.
-Log Into a Recent Version of Linux+== Log Into a Recent Version of Linux == 
 ssh into RHEL 7 nodes of Mary Lou by doing: ssh into RHEL 7 nodes of Mary Lou by doing:
  
Line 9: Line 11:
  
 This gets around issues of old versions of GLIBC when loading pytorch. This gets around issues of old versions of GLIBC when loading pytorch.
-Load Virtual Environment+ 
 +== Load Virtual Environment ​== 
 Dependencies can be taken care of through a conda virtual environment,​ make sure you have conda installed. Load the virtual environment using: Dependencies can be taken care of through a conda virtual environment,​ make sure you have conda installed. Load the virtual environment using:
  
 conda activate /​fslgroup/​fslg_handwriting/​compute/​death/​env/​death_env conda activate /​fslgroup/​fslg_handwriting/​compute/​death/​env/​death_env
-Segmenting+ 
 +== Segmenting ​== 
 Run the following command on a directory of images: Run the following command on a directory of images:
  
Line 19: Line 25:
  
 python demo/​practice_dir.py configs/​e2e_mask_rcnn_R_50_FPN_1x_death_nosfr.yaml <input imgs dir> <output images dir> python demo/​practice_dir.py configs/​e2e_mask_rcnn_R_50_FPN_1x_death_nosfr.yaml <input imgs dir> <output images dir>
-Handwriting Recognition+ 
 +== Handwriting Recognition ​== 
 Run the following command on the directory of segmented images output from the last step: Run the following command on the directory of segmented images output from the last step:
  
Line 29: Line 37:
  
  
-Overall Workflow +'''​Overall Workflow'''​ 
-Load virtual environment +Load virtual environment 
-Preprocess death record images by deskewing +Preprocess death record images by deskewing 
-Label images for segmentation +Label images for segmentation 
-Train segmentation model +Train segmentation model 
-Segment death records and pair with cause of death transcriptions +Segment death records and pair with cause of death transcriptions 
-Train HWR model +Train HWR model 
-Perform handwriting recognition +Perform handwriting recognition 
-Post-processing +Post-processing 
-0. Preliminaries + 
-Data for Segmentation+'''​0. Preliminaries'''​ 
 + 
 +== Data for Segmentation ​== 
 We use 2 data sources to train a segmenter to identify pertinent lines in the death records. The 2 data sources are: We use 2 data sources to train a segmenter to identify pertinent lines in the death records. The 2 data sources are:
-Ohio death records +Ohio death records 
-North Carolina death records+North Carolina death records 
 Labelme was used to label the images. We label the images with 2 labels: Labelme was used to label the images. We label the images with 2 labels:
-medcert - Medical Certificate of Death portion of death record +medcert - Medical Certificate of Death portion of death record 
-cod - any line of text pertinent to why a person died+cod - any line of text pertinent to why a person died
 These images and annotations can be found on the BYU supercomputer in COCO format at the following path: These images and annotations can be found on the BYU supercomputer in COCO format at the following path:
  
-/​fslgroup/​fslg_handwriting/​compute/​death/​data/​segmentation/​combined_coco +''​/​fslgroup/​fslg_handwriting/​compute/​death/​data/​segmentation/​combined_coco''​ 
-Data for Handwriting Recognition+ 
 +== Data for Handwriting Recognition ​== 
 We use 3 data sources of transcribed lines of text in order to train a handwriting recognition model for the Ohio death records. The 3 data sources are: We use 3 data sources of transcribed lines of text in order to train a handwriting recognition model for the Ohio death records. The 3 data sources are:
-IAM dataset +IAM dataset 
-13353 training images +  ​a) ​13353 training images 
-Ohio death records +Ohio death records 
-1390 training images +  ​a) ​1390 training images 
-817 validation images +  ​b) ​817 validation images 
-North Carolina death records +North Carolina death records 
-7973 training images +  ​a) ​7973 training images 
-1754 validation images+  ​b) ​1754 validation images 
 +  ​
 The Ohio and North Carolina images were obtained by segmenting death records and using the transcriptions collected by student transcribers. These images can annotations can be found in the format for Curtis Wiggington’s SFR code on the BYU supercomputer at the following path: The Ohio and North Carolina images were obtained by segmenting death records and using the transcriptions collected by student transcribers. These images can annotations can be found in the format for Curtis Wiggington’s SFR code on the BYU supercomputer at the following path:
  
-/​fslgroup/​fslg_handwriting/​compute/​death/​data/​transcription/​transcribed_iam_combined+''​/​fslgroup/​fslg_handwriting/​compute/​death/​data/​transcription/​transcribed_iam_combined''​
  
 We needed to use the IAM dataset for training because of the irregularities present in the death record transcriptions. The main irregularity is whether or not contributory factors are present in the transcription or not. We needed to use the IAM dataset for training because of the irregularities present in the death record transcriptions. The main irregularity is whether or not contributory factors are present in the transcription or not.
-1. Load Virtual Environment+ 
 +'''​1. Load Virtual Environment'''​ 
 Making sure you have the right software dependencies is awful. To ease this, we provide a conda virtual environment that can be used with all the appropriate software dependencies. Please don’t install any additional modules while using this environment or you might break software dependencies. Load the environment with the following command: Making sure you have the right software dependencies is awful. To ease this, we provide a conda virtual environment that can be used with all the appropriate software dependencies. Please don’t install any additional modules while using this environment or you might break software dependencies. Load the environment with the following command:
  
-conda activate /​fslgroup/​fslg_handwriting/​compute/​death/​env/​death_env +''​conda activate /​fslgroup/​fslg_handwriting/​compute/​death/​env/​death_env''​ 
-2. Preprocess Death Records+ 
 +'''​2. Preprocess Death Records'''​ 
 All images should be deskewed first. We used Imagemagick to accomplish this. Deskewing is important because it makes labeling lines of text for creating segmentation ground truth significantly easier. The following command will deskew an image: All images should be deskewed first. We used Imagemagick to accomplish this. Deskewing is important because it makes labeling lines of text for creating segmentation ground truth significantly easier. The following command will deskew an image:
  
-convert <path to image> ​ -deskew 80% <save path for deskewed image> +''​convert <path to image> ​ -deskew 80% <save path for deskewed image>''​ 
-3. Label Images for Segmentation+ 
 +'''​3. Label Images for Segmentation'''​ 
 Images are labeled using labelme. We label both the  Images are labeled using labelme. We label both the 
  
Line 82: Line 103:
  
 The <labels text file> can be replaced with /​fslgroup/​fslg_handwriting/​compute/​death/​data/​segmentation/​labels_nosfr.txt The <labels text file> can be replaced with /​fslgroup/​fslg_handwriting/​compute/​death/​data/​segmentation/​labels_nosfr.txt
-4. Train Segmentation Model+ 
 +'''​4. Train Segmentation Model'''​ 
 The Facebook MaskRCNN model is used for segmenting single lines from death records. We use maskrcnn-benchmark,​ a pytorch implementation made by Facebook. We used the pretrained ResNet-50 architecture during training. We find that this model provides very good performance (0.969 IoU=0.50:​0.95) with minimal training time (<15 hours). We believe that the ResNet architecture works better than Start-Follow-Read’s Start-of-Line finder for this task because it has a larger receptive field and we are only interested in a small number of lines of text instead of every line of text. The Facebook MaskRCNN model is used for segmenting single lines from death records. We use maskrcnn-benchmark,​ a pytorch implementation made by Facebook. We used the pretrained ResNet-50 architecture during training. We find that this model provides very good performance (0.969 IoU=0.50:​0.95) with minimal training time (<15 hours). We believe that the ResNet architecture works better than Start-Follow-Read’s Start-of-Line finder for this task because it has a larger receptive field and we are only interested in a small number of lines of text instead of every line of text.
  
Line 94: Line 117:
  
 The config file configs/​e2e_mask_rcnn_R_50_FPN_1x_death_nosfr.yaml contains settings for training the model. We note that we allow a maximum image size of 3000 pixels. Most images should remain about the same size during training and segmentation. The config file configs/​e2e_mask_rcnn_R_50_FPN_1x_death_nosfr.yaml contains settings for training the model. We note that we allow a maximum image size of 3000 pixels. Most images should remain about the same size during training and segmentation.
-5. Segment Images+ 
 +'''​5. Segment Images'''​ 
 Run the following command on a directory of images: Run the following command on a directory of images:
  
Line 100: Line 125:
  
 python demo/​practice_dir.py configs/​e2e_mask_rcnn_R_50_FPN_1x_death_nosfr.yaml <input imgs dir> <output images dir> python demo/​practice_dir.py configs/​e2e_mask_rcnn_R_50_FPN_1x_death_nosfr.yaml <input imgs dir> <output images dir>
-6. Train Handwriting Recognition Model+ 
 +'''​6. Train Handwriting Recognition Model'''​ 
 We use just the handwriting module from the Start-Follow-Read We use just the handwriting module from the Start-Follow-Read
-7. Perform Handwriting Recognition+ 
 +'''​7. Perform Handwriting Recognition'''​ 
 Run the following command on the directory of segmented images output from the last step: Run the following command on the directory of segmented images output from the last step:
  
Line 110: Line 139:
  
 Results will be output to stdout. Results will be output to stdout.
-8. Post-processing 
-Spell checker/​text correction/​normalization techniques needed 
-Results 
-The following are random results of handwriting recognition on segmented images: 
  
 +'''​8. Post-processing'''​
  
-Congenitral Enlory Tyoed ggtal +Spell checker/​text correction/​normalization techniques needed
  
-Catillaut Bronchitis+'''​Deprecated'''​
  
- 
-Tuberculosis of left bumay Lnumoni Daltes 
- 
- 
-Broncho Pneumonia 
- 
- 
-diph theria 
- 
- 
-Acute Cardiac Coronary te 
- 
- 
-Cardiac Failure Broncho-Pneumonia 
- 
- 
-Pulmonary tuberculosis,​ reinfection tyne, fer advanced, active, severe symptom. Right spontenoous pneumothoraxi 
- 
- 
-Inlenza Pneumonia Bron 
- 
- 
-Peritonitis Appentritis 
- 
- 
- 
- 
-Deprecated 
 The following information is kept for historical reasons to follow previous attempts at automated documented processing of the Ohio death records. The following information is kept for historical reasons to follow previous attempts at automated documented processing of the Ohio death records.
  
 This document contains instructions and insights into the machine learning pipeline used to automatically process the Ohio death records. This includes tasks such as: This document contains instructions and insights into the machine learning pipeline used to automatically process the Ohio death records. This includes tasks such as:
-Scraping data from Family Search +Scraping data from Family Search 
-Data labeling +Data labeling 
-Segmentation +Segmentation 
-Handwriting recognition +Handwriting recognition 
-Cause of death to ICD code+Cause of death to ICD code
  
 Our approach is to use machine learning as a way to automatically extract regions of interest from the death records for use in text (machine printed and handwritten) recognition. First a labeled training dataset is created, then a model is trained. Unlabeled images can then be processed with the train model. After regions are extracted, we then use text recognition trained by handmaid transcriptions to automatically transcribe the cause of death. Next the transcribed cause of death is mapped to an ICD code to bin the different causes of death. ​ Our approach is to use machine learning as a way to automatically extract regions of interest from the death records for use in text (machine printed and handwritten) recognition. First a labeled training dataset is created, then a model is trained. Unlabeled images can then be processed with the train model. After regions are extracted, we then use text recognition trained by handmaid transcriptions to automatically transcribe the cause of death. Next the transcribed cause of death is mapped to an ICD code to bin the different causes of death. ​
-Scraping Data+ 
 +'''​Scraping Data'''​ 
 We have scraped many thousands death records from Family Search’s website while we wait for them to give them all to us. This is accomplished using Selenium and 34,530 death record IDs that were hand scraped from family search’s website. The Selenium script creates a session with Family Search and attempts to download images given a death record ID. Family Search will block access after an unknown number of queries. The block is released after about an hour. If the script is blocked during download, it sleeps for 60 minutes, creates a new session and begins downloading again. We have scraped many thousands death records from Family Search’s website while we wait for them to give them all to us. This is accomplished using Selenium and 34,530 death record IDs that were hand scraped from family search’s website. The Selenium script creates a session with Family Search and attempts to download images given a death record ID. Family Search will block access after an unknown number of queries. The block is released after about an hour. If the script is blocked during download, it sleeps for 60 minutes, creates a new session and begins downloading again.
-Data Labeling+ 
 +'''​Data Labeling'''​ 
 Labeling death record forms is accomplished by using the application labelme. It can be downloaded at: Labeling death record forms is accomplished by using the application labelme. It can be downloaded at:
  
 https://​github.com/​wkentaro/​labelme https://​github.com/​wkentaro/​labelme
-Data Formatting+ 
 +'''​Data Formatting'''​ 
 Once data is labeled it must be converted from the labelme format to the commonly used COCO dataset format. labelme provides a conversion script but it does not work for various reasons outlined below. In the meantime, a custom script has been written which accomplishes the task: Once data is labeled it must be converted from the labelme format to the commonly used COCO dataset format. labelme provides a conversion script but it does not work for various reasons outlined below. In the meantime, a custom script has been written which accomplishes the task:
  
Line 175: Line 178:
  
 The custom script is necessary because of data format the Mask-RCNN implementation (maskrcnn-benchmark) that we use requires segmentation information recorded in polygon format for training. Segmentation data stored as a run-length encoding (RLE) cannot currently be used for training. The supplied conversion script from labelme uses RLE, the custom script uses polygons. The custom script is necessary because of data format the Mask-RCNN implementation (maskrcnn-benchmark) that we use requires segmentation information recorded in polygon format for training. Segmentation data stored as a run-length encoding (RLE) cannot currently be used for training. The supplied conversion script from labelme uses RLE, the custom script uses polygons.
-Segmentation+ 
 +'''​Segmentation'''​ 
 We utilize Facebook AI Research’s Mask-RCNN implementation provided here: We utilize Facebook AI Research’s Mask-RCNN implementation provided here:
  
Line 189: Line 194:
  
 maskrcnn_benchmark/​configs/​e2e_mask_rcnn_R_50_FPN_1x.yaml maskrcnn_benchmark/​configs/​e2e_mask_rcnn_R_50_FPN_1x.yaml
-Training+ 
 +'''​Training'''​ 
 We have found that the ResNet-50 network trains quickly and has low memory usage while providing excellent results. An example training command is: We have found that the ResNet-50 network trains quickly and has low memory usage while providing excellent results. An example training command is:
  
 python tools/​train_net.py --config-file configs/​e2e_mask_rcnn_R_50_FPN_1x_death.yaml SOLVER.IMS_PER_BATCH 4 SOLVER.MAX_ITER 10000 TEST.IMS_PER_BATCH 4 python tools/​train_net.py --config-file configs/​e2e_mask_rcnn_R_50_FPN_1x_death.yaml SOLVER.IMS_PER_BATCH 4 SOLVER.MAX_ITER 10000 TEST.IMS_PER_BATCH 4
-Inference/​Segment Extraction+ 
 +'''​Inference/​Segment Extraction'''​ 
 The inference process is accomplished using maskrcnn-benchmark’s prediction code snippet in their README.md The inference process is accomplished using maskrcnn-benchmark’s prediction code snippet in their README.md
-Preliminary Segmentation Results+ 
 +'''​Preliminary Segmentation Results'''​ 
 ~1 hour training, 294 training images (20190227) ~1 hour training, 294 training images (20190227)
 With minimal training data we are able to achieve good baseline results. Greater diversity in death record formats would improve results significantly. Below are results from images that the segmenter has never seen. With minimal training data we are able to achieve good baseline results. Greater diversity in death record formats would improve results significantly. Below are results from images that the segmenter has never seen.
Line 202: Line 213:
  
  
-Handwriting Recognition+'''​Handwriting Recognition'''​ 
 Handwriting recognition is accomplished using Start-Follow-Read (SFR) by Curtis Wiggington. SFR is composed of three components: start-of-line (SOL) detector, line follower (LF) and handwriting recognition (HWR). Handwriting recognition is accomplished using Start-Follow-Read (SFR) by Curtis Wiggington. SFR is composed of three components: start-of-line (SOL) detector, line follower (LF) and handwriting recognition (HWR).
-Start-of-Line Detection+ 
 +'''​Start-of-Line Detection'''​ 
 We replace the provided SOL detector with our own based on MaskRCNN. We originally tried using the provided SOL detector but found that it had difficulties identifying the lines of text we were interested in. We believe that this is due to it being based on the shallow VGG11 network architecture which works when identifying every line of text but does not capture enough context to identify specific lines of text. We replace the provided SOL detector with our own based on MaskRCNN. We originally tried using the provided SOL detector but found that it had difficulties identifying the lines of text we were interested in. We believe that this is due to it being based on the shallow VGG11 network architecture which works when identifying every line of text but does not capture enough context to identify specific lines of text.
  
Line 210: Line 224:
  
 When labeling images, we label the MCD and then highlight any line of text that we’re interested in. MaskRCNN performs instance segmentation to identify all SOLs of text that we’re interested in. When labeling images, we label the MCD and then highlight any line of text that we’re interested in. MaskRCNN performs instance segmentation to identify all SOLs of text that we’re interested in.
-Line Follower+ 
 +'''​Line Follower'''​ 
 We use the built in LF module that was trained on the ICHDAR2017 READ dataset. This module performs adequately but would benefit from additional fine-tuning. We use the built in LF module that was trained on the ICHDAR2017 READ dataset. This module performs adequately but would benefit from additional fine-tuning.
-Handwriting Recognition+ 
 +'''​Handwriting Recognition'''​ 
 We use the built in HWR module that was trained on the ICHDAR2017 READ dataset. Performance is quite poor because it is trained using a German dataset. We believe that providing an English language model may be enough to correct the predicted text. If necessary, we will retrain the HWR model using the transcriptions provided from the COD records. We use the built in HWR module that was trained on the ICHDAR2017 READ dataset. Performance is quite poor because it is trained using a German dataset. We believe that providing an English language model may be enough to correct the predicted text. If necessary, we will retrain the HWR model using the transcriptions provided from the COD records.
 Language Model Language Model
 A language model is necessary for quality text recognition. SFR’s pretrained model was trained on German data. We use INSERT CORPUS as our corpus to build a language model for English that is specialized for medical vocabulary. A language model is necessary for quality text recognition. SFR’s pretrained model was trained on German data. We use INSERT CORPUS as our corpus to build a language model for English that is specialized for medical vocabulary.
-Nuances of Training SFR+ 
 +'''​Nuances of Training SFR'''​ 
 The sample config files provided by the SFR repo for building the run environment and for running the actual models may not be optimized for your setup. I modified the environment to use pytorch==0.3.1 (error in config file). Updating to the most current version of opencv is also recommended to avoid a bug in the repo specified version. I also modified the run config by upping the batch size and changing the training scripts to the dataloader objects to use 16 worker threads. This reduced training time from 800+ seconds to ~10 seconds. The sample config files provided by the SFR repo for building the run environment and for running the actual models may not be optimized for your setup. I modified the environment to use pytorch==0.3.1 (error in config file). Updating to the most current version of opencv is also recommended to avoid a bug in the repo specified version. I also modified the run config by upping the batch size and changing the training scripts to the dataloader objects to use 16 worker threads. This reduced training time from 800+ seconds to ~10 seconds.
  
 Eventually, we updated the code to work with pytorch==1.0 so that we could use it in conjunction with the MaskRCNN library we have. Eventually, we updated the code to work with pytorch==1.0 so that we could use it in conjunction with the MaskRCNN library we have.
-Current Results+ 
 +'''​Current Results'''​ 
 The MaskRCNN model was trained using 625 images for about 2 hours. The MaskRCNN model was trained using 625 images for about 2 hours.
  
-Cause of Death to ICD Code+'''​Cause of Death to ICD Code'''​ 
 Labeled training for this data comes from Labeled training for this data comes from
-Questions/​Future Approaches + 
-Do we need to segment the death record first to get the “medical certificate of death”? +'''​Questions/​Future Approaches'''​ 
-Is is unclear how much segmenting the image first actually helps + 
-How should the images be normalized?​ +#Do we need to segment the death record first to get the “medical certificate of death”? 
-There is a great deal of noise in the images. This may be caused by wear to the microfiche they were stored on. There may be a way to normalize the images in a way that helps to reduce these issues. +    ​a) ​Is is unclear how much segmenting the image first actually helps 
-Could we use active learning to reduce the number of labeled samples required for processing?​ +#How should the images be normalized?​ 
-Start-Follow-Read could be enhanced with a more powerful underlying network +    ​a) ​There is a great deal of noise in the images. This may be caused by wear to the microfiche they were stored on. There may be a way to normalize the images in a way that helps to reduce these issues. 
-Using resnet instead of vgg may provide benefits to SFR and could increase the performance of the start-of-line finder when only certain lines are of interest instead of all lines being of interest+#Could we use active learning to reduce the number of labeled samples required for processing?​ 
 +#Start-Follow-Read could be enhanced with a more powerful underlying network 
 +    ​a) ​Using resnet instead of vgg may provide benefits to SFR and could increase the performance of the start-of-line finder when only certain lines are of interest instead of all lines being of interest
  
handwriting/deathrecords.txt · Last modified: 2019/09/06 18:36 by 10.37.241.82
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0