Imputing missing values



Imputing Missing Values


The objective of this project is to find an application for unsupervised backprogation. We have determined to use it to impute missing values from a data set.

Goal: Get it into NIPS

Action Items 4/15/2011

  • Proof of concept - artificially modify a data and compare with the original (Mike G)
  • Proof of concept with other imputation methods (Mike G.)
    • Mean
    • Median
    • Random
    • kNN
    • K-Means
  • Related works to compare against (Richard)
  • Find data sets/artificially remove values (Mike G)
  • Train other LAs on the imputed data (Mike S)
  • Annotated citations (All)

We also decided that we are going to do two things:

  • Show that it gets closer than other methods via RMSE
  • Show that it helps increase the classification accuracy than leaving the missing values and vs using other imputation methods.

Action Items 4/22/2011

  • Implement or get implementations of other methods (Richard)
  • Get Condor on other computers (Richard)
  • Start writing the paper (pending)
  • Throw data set comparison on the cluster (Mike S)

Action Items 4/29/2011

  • Condor computers (Richard)
  • Automating KEEL (Richard)
  • Algorithm description (Mike G.)
  • Fix error function (Mike S.)
  • through it on the cluster (Mike S.)

Action Items 5/6/2011

  • Redo SVN (Richard)
  • Find state of the art (Richard)
  • Are the results usable from paper (Richard)
  • Look into KEEL (Mike S)
  • Re-run NI and calculate Errors (Mike G)
  • Would it be better to not put missing values in the class as far as comparing ourselves with other people?

To Do List

Mike Smith's code ToDo list:

  • Add options for the NI filter.
  • Make the code so that we use missing values and calculate the missing values

Experiments to run:

  • Data set comparison
    • data sets to use:
      • teachingAssistant
      • Sonar
      • Vowel
      • mnist
      • hypothyroid
    • % of missing values
    • Hidden nodes (0, 2, 4)
    • # of intrinsic dimensions (1, 2, 3) (never have more intrinsic dims than number of hidden nodes, unless 0 hidden nodes)
    • random seed
  • Classification Accuracy
    • Other imputation methods
      • Baseline
      • K-NN
      • Matrix Factorization
      • K-Means

Change the code to convert nominal values back and using hamming distance for error values on the nominal attributes

Related Works

  • Here is a link to a number of different imputation strategies including articles.
  • Here is a link to a website discussing different imputation strategies. This is less of a list and more of a discussion and how, what, and why.
  • Multiple Imputation ( link )
    • Basically using an imputation technique to create m versions of the imputed value. The final value that is used is a combination of the m estimated values (generally averaged).
  • I didn't find an original citation, but using a trained LA to predict the missing attribute values.
 author = {Farhangfar, Alireza and Kurgan, Lukasz and Dy, Jennifer},
 title = {Impact of imputation of missing values on classification error for discrete data},
 journal = {Pattern Recognition},
 volume = {41},
 issue = {12},
 month = {December},
 year = {2008},
 issn = {0031-3203},
 pages = {3692--3705},
 numpages = {14},
 url = {},
 doi = {10.1016/j.patcog.2008.05.019},
 acmid = {1405217},
 publisher = {Elsevier Science Inc.},
 address = {New York, NY, USA},
 keywords = {Classification, Imputation of missing values, Missing values, Multiple imputations, Single imputation},

In addition to the trained LAs, they also compare other methods. I have not yet read the article.

  • Clustering based
 author = {Zhang, Shichao and Zhang, Jilian and Zhu, Xiaofeng and Qin, Yongsong and Zhang, Chengqi},
 chapter = {Missing value imputation based on data clustering},
 title = {Transactions on computational science I},
 editor = {Gavrilova, Marina L. and Tan, C. J. Kenneth},
 year = {2008},
 isbn = {3-540-79298-8, 978-3-540-79298-7},
 pages = {128--138},
 numpages = {11},
 url = {},
 acmid = {1805829},
 publisher = {Springer-Verlag},
 address = {Berlin, Heidelberg},

The authors first cluster the data and then missing values are imputed using values only from the cluster. They seek to solve the problem from NN-based solutions that the distance has to be calculated between every pair of instances and that there are only a few random chances of picking the true nearest neighbor. To cluster, the authors use the K-means clustering algorithm and then use a kernel function imputation strategy. It seems that they claim the contribution of clustering the data first before using an imputation method. It also assumes that the only missing values are in the target. I thought that this was a really weak paper. I guess that it wasn't a great paper to pick first to read.

  • Imputation Review
  title={{A review of methods for missing data}},
  author={Pigott, T.D.},
  journal={Educational research and evaluation},

This is a paper that gives a review of different imputation methods as well as an overview for "why" data ends up missing.

  • Book with a more rigorous treatment
  title={{Statistical analysis with missing data}},
  author={Little, R.J.A. and Rubin, D.B.},
  series={Wiley series in probability and mathematical statistics. Probability and mathematical statistics},

This book is in the library (currently checked out), but is probably worth a look. It is one of the most cited in the imputation literature.

  • More in depth Imputation Review
  title={{Missing data: Our view of the state of the art.}},
  author={Schafer, J.L. and Graham, J.W.},
  journal={Psychological methods},
  publisher={American Psychological Association}

Again, this is a review on the various imputation methods, review of different types of missing data (MAR vs. MNAR vs. MCAR, etc.). Comes from the Psychology literature, but is well cited. Is probably worth the time to look into and read the whole thing.

  • Imputation using Machine Learning
  title={{Imputation of missing data using machine learning techniques}},
  author={Lakshminarayan, K. and Harp, S.A. and Goldman, R. and Samad, T. and others},
  booktitle={Proceedings of the Second International Conference on Knowledge Discovery and Data Mining},

This paper discusses the use of two different machine learning methods for imputation. Autoclass (unsupervised clustering, a bayesian method) provides a probability distribution over the set of classes. Used predictively, this can impute whichever data entries may be missing. The C4.5 approach gives an "outcome" for each missing item that is weighted between each possible "outcome," where the weight is the relative frequency for that "outcome." Wasn't terribly clear, not the best paper in the world.

Imputation Methods for Comparison

Subversion Repository

  • To check out a copy of the repository
svn co svn://
  • To get the latest changes
svn up
  • To submit your changes
svn commit

Things to include in the paper

  • Different types of missingness
Personal tools
  • Log in