Imputing missing values
From NNML
Contents 
Imputing Missing Values
Objective
The objective of this project is to find an application for unsupervised backprogation. We have determined to use it to impute missing values from a data set.
Goal: Get it into NIPS
Action Items 4/15/2011
 Proof of concept  artificially modify a data and compare with the original (Mike G)

Proof of concept with other imputation methods (Mike G.)
 Mean
 Median
 Random
 kNN
 KMeans
 Related works to compare against (Richard)
 Find data sets/artificially remove values (Mike G)
 Train other LAs on the imputed data (Mike S)
 Annotated citations (All)
We also decided that we are going to do two things:
 Show that it gets closer than other methods via RMSE
 Show that it helps increase the classification accuracy than leaving the missing values and vs using other imputation methods.
Action Items 4/22/2011
 Implement or get implementations of other methods (Richard)
 Get Condor on other computers (Richard)
 Start writing the paper (pending)
 Throw data set comparison on the cluster (Mike S)
Action Items 4/29/2011
 Condor computers (Richard)
 Automating KEEL (Richard)
 Algorithm description (Mike G.)
 Fix error function (Mike S.)
 through it on the cluster (Mike S.)
Action Items 5/6/2011
 Redo SVN (Richard)
 Find state of the art (Richard)
 Are the results usable from paper (Richard)
 Look into KEEL (Mike S)
 Rerun NI and calculate Errors (Mike G)
 Would it be better to not put missing values in the class as far as comparing ourselves with other people?
To Do List
Mike Smith's code ToDo list:
 Add options for the NI filter.
 Make the code so that we use missing values and calculate the missing values
Experiments to run:

Data set comparison

data sets to use:
 teachingAssistant
 Sonar
 Vowel
 mnist
 hypothyroid
 % of missing values
 Hidden nodes (0, 2, 4)
 # of intrinsic dimensions (1, 2, 3) (never have more intrinsic dims than number of hidden nodes, unless 0 hidden nodes)
 random seed

data sets to use:

Classification Accuracy

Other imputation methods
 Baseline
 KNN
 Matrix Factorization
 KMeans

Other imputation methods
Change the code to convert nominal values back and using hamming distance for error values on the nominal attributes
Related Works
 Here is a link to a number of different imputation strategies including articles.
 Here is a link to a website discussing different imputation strategies. This is less of a list and more of a discussion and how, what, and why.

Multiple Imputation (
link
)
 Basically using an imputation technique to create m versions of the imputed value. The final value that is used is a combination of the m estimated values (generally averaged).
 I didn't find an original citation, but using a trained LA to predict the missing attribute values.
@article{Farhangfar:2008:IIM:1405194.1405217, author = {Farhangfar, Alireza and Kurgan, Lukasz and Dy, Jennifer}, title = {Impact of imputation of missing values on classification error for discrete data}, journal = {Pattern Recognition}, volume = {41}, issue = {12}, month = {December}, year = {2008}, issn = {00313203}, pages = {36923705}, numpages = {14}, url = {http://portal.acm.org/citation.cfm?id=1405194.1405217}, doi = {10.1016/j.patcog.2008.05.019}, acmid = {1405217}, publisher = {Elsevier Science Inc.}, address = {New York, NY, USA}, keywords = {Classification, Imputation of missing values, Missing values, Multiple imputations, Single imputation}, }
In addition to the trained LAs, they also compare other methods. I have not yet read the article.
 Clustering based
@incollection{Zhang:2008:MVI:1805820.1805829, author = {Zhang, Shichao and Zhang, Jilian and Zhu, Xiaofeng and Qin, Yongsong and Zhang, Chengqi}, chapter = {Missing value imputation based on data clustering}, title = {Transactions on computational science I}, editor = {Gavrilova, Marina L. and Tan, C. J. Kenneth}, year = {2008}, isbn = {3540792988, 9783540792987}, pages = {128138}, numpages = {11}, url = {http://portal.acm.org/citation.cfm?id=1805820.1805829}, acmid = {1805829}, publisher = {SpringerVerlag}, address = {Berlin, Heidelberg}, }
The authors first cluster the data and then missing values are imputed using values only from the cluster. They seek to solve the problem from NNbased solutions that the distance has to be calculated between every pair of instances and that there are only a few random chances of picking the true nearest neighbor. To cluster, the authors use the Kmeans clustering algorithm and then use a kernel function imputation strategy. It seems that they claim the contribution of clustering the data first before using an imputation method. It also assumes that the only missing values are in the target. I thought that this was a really weak paper. I guess that it wasn't a great paper to pick first to read.
 Imputation Review
@article{pigott2001review, title={{A review of methods for missing data}}, author={Pigott, T.D.}, journal={Educational research and evaluation}, volume={7}, number={4}, pages={353383}, issn={13803611}, year={2001}, publisher={Routledge} }
This is a paper that gives a review of different imputation methods as well as an overview for "why" data ends up missing.
 Book with a more rigorous treatment
@book{little2002statistical, title={{Statistical analysis with missing data}}, author={Little, R.J.A. and Rubin, D.B.}, isbn={9780471183860}, lccn={2002027006}, series={Wiley series in probability and mathematical statistics. Probability and mathematical statistics}, url={http://books.google.com/books?id=aYPwAAAAMAAJ}, year={2002}, publisher={Wiley} }
This book is in the library (currently checked out), but is probably worth a look. It is one of the most cited in the imputation literature.
 More in depth Imputation Review
@article{schafer2002missing, title={{Missing data: Our view of the state of the art.}}, author={Schafer, J.L. and Graham, J.W.}, journal={Psychological methods}, volume={7}, number={2}, pages={147}, issn={19391463}, year={2002}, publisher={American Psychological Association} }
Again, this is a review on the various imputation methods, review of different types of missing data (MAR vs. MNAR vs. MCAR, etc.). Comes from the Psychology literature, but is well cited. Is probably worth the time to look into and read the whole thing.
 Imputation using Machine Learning
@conference{lakshminarayan1996imputation, title={{Imputation of missing data using machine learning techniques}}, author={Lakshminarayan, K. and Harp, S.A. and Goldman, R. and Samad, T. and others}, booktitle={Proceedings of the Second International Conference on Knowledge Discovery and Data Mining}, pages={140145}, year={1996} }
This paper discusses the use of two different machine learning methods for imputation. Autoclass (unsupervised clustering, a bayesian method) provides a probability distribution over the set of classes. Used predictively, this can impute whichever data entries may be missing. The C4.5 approach gives an "outcome" for each missing item that is weighted between each possible "outcome," where the weight is the relative frequency for that "outcome." Wasn't terribly clear, not the best paper in the world.
Imputation Methods for Comparison
 Mean/Mode
 Knn/Kmeans
 Linear Regression?
 Multiple Imputation?
 EM?
 KEEL Algorithms?
Subversion Repository
 To check out a copy of the repository
 svn co svn://lobe.cs.byu.edu/ni
 To get the latest changes
 svn up
 To submit your changes
 svn commit
Things to include in the paper
 Different types of missingness