Title: | Weighted Nearest Neighbor Imputation of Missing Values using Selected Variables |
---|---|
Description: | New tools for the imputation of missing values in high-dimensional data are introduced using the non-parametric nearest neighbor methods. It includes weighted nearest neighbor imputation methods that use specific distances for selected variables. It includes an automatic procedure of cross validation and does not require prespecified values of the tuning parameters. It can be used to impute missing values in high-dimensional data when the sample size is smaller than the number of predictors. For more information see Faisal and Tutz (2017) <doi:10.1515/sagmb-2015-0098>. |
Authors: | Shahla Faisal |
Maintainer: | Shahla Faisal <[email protected]> |
License: | GPL-2 |
Version: | 0.1 |
Built: | 2025-02-06 03:24:37 UTC |
Source: | https://github.com/cran/wNNSel |
This package introduces new non-parametric tools for the imputation of missing values in high-dimensional data.
It includes weighted nearest neighbor
imputation methods that use distances for selected covariates. The careful
selection of distances that carry information about the missing values yields an imputation
tool. It does not require pre-specified , unlike other kNN methods.
It can be used to impute missing values in high-dimensional data when
.
Package: | wNNSel |
Version: | 0.1 |
Date: | 2017-11-08 |
Depends: | R (>= 2.10) |
License: | GPL (>= 2) |
The main function of the package is wNNSel
for implementing the nonparameteric procedure of nearest neighbors imputaiton.
See wNNSel
for more details.
*Author's Last name changed to Faisal from Ramzan in 2016.
Shahla Faisal <[email protected]>
Tutz, G. and Ramzan,S*. (2015). Improved methods for the imputation of missing data by nearest neighbor methods. Computational Statistics and Data Analysis, Vol. 90, pp. 84-99.
Faisal, S.* and Tutz, G. (2017). Missing value imputation for gene expression data by tailored nearest neighbors. Statistical Application in Genetics and Molecular Biology. Vol. 16(2), pp. 95-106.
This function artificially introduces missing values in a data matrix under missing completely at random (MCAR) mechanism.
artifNA(x, miss.prop = 0.1)
artifNA(x, miss.prop = 0.1)
x |
a matrix, in which missing values are to be created. |
miss.prop |
proportion of missing values |
a matrix with missing values
set.seed(3) x = matrix(rnorm(100),10,10) ## create 10% missing values in x artifNA(x, 0.10)
set.seed(3) x = matrix(rnorm(100),10,10) ## create 10% missing values in x artifNA(x, 0.10)
This function introduces additional missing values in a missing data matrix artificially. The missing values are introduced under missing completely at random (MCAR) mechanism.
artifNA.cv(x, testNA.prop = 0.1)
artifNA.cv(x, testNA.prop = 0.1)
x |
a matrix, in which missing values are to be created. |
testNA.prop |
proportion of missing values |
a list contatining a matrix with artifical missing values, removed indices and the provided x matrix
set.seed(3) x = matrix(rnorm(100),10,10) ## create 10% missing values in x x.miss<- artifNA(x, 0.10) ## create another 10% missing values in x x.miss.cv<- artifNA.cv(x, 0.10) summary(x.miss) summary(x.miss.cv)
set.seed(3) x = matrix(rnorm(100),10,10) ## create 10% missing values in x x.miss<- artifNA(x, 0.10) ## create another 10% missing values in x x.miss.cv<- artifNA.cv(x, 0.10) summary(x.miss) summary(x.miss.cv)
This function computes the mean absolute imputation error for a given complete/true data matrix, imputed data matrix and the data matrix with missing values.
computeMAIE(x.miss, x.impute, x.true)
computeMAIE(x.miss, x.impute, x.true)
x.miss |
a |
x.impute |
an imputed data |
x.true |
complete/true data |
value of MSIE
set.seed(3) x.true = matrix(rnorm(100),10,10) ## create 10% missing values in x x.miss = artifNA(x.true, 0.10) ## impute using wNNSel method x.impute = wNNSel.impute(x.miss) computeMAIE(x.miss, x.impute, x.true)
set.seed(3) x.true = matrix(rnorm(100),10,10) ## create 10% missing values in x x.miss = artifNA(x.true, 0.10) ## impute using wNNSel method x.impute = wNNSel.impute(x.miss) computeMAIE(x.miss, x.impute, x.true)
This function computes the mean squared imputation error for a given complete/true data matrix, imputed data matrix and the data matrix with missing values.
computeMSIE(x.miss, x.impute, x.true)
computeMSIE(x.miss, x.impute, x.true)
x.miss |
a |
x.impute |
an imputed data |
x.true |
complete/true data |
value of MSIE
set.seed(3) x.true = matrix(rnorm(100),10,10) ## create 10% missing values in x x.miss = artifNA(x.true, 0.10) ## impute using wNNSel method x.impute = wNNSel.impute(x.miss) computeMSIE(x.miss, x.impute, x.true)
set.seed(3) x.true = matrix(rnorm(100),10,10) ## create 10% missing values in x x.miss = artifNA(x.true, 0.10) ## impute using wNNSel method x.impute = wNNSel.impute(x.miss) computeMSIE(x.miss, x.impute, x.true)
This function computes the nrmalized root mean squared imputation error for a given complete/true data matrix, imputed data matrix and the data matrix with missing values.
computeNRMSE(x.miss, x.impute, x.true)
computeNRMSE(x.miss, x.impute, x.true)
x.miss |
a |
x.impute |
an imputed data |
x.true |
complete/true data |
value of MSIE
set.seed(3) x.true = matrix(rnorm(100),10,10) ## create 10% missing values in x x.miss = artifNA(x.true, 0.10) ## impute using wNNSel method x.impute = wNNSel.impute(x.miss) computeNRMSE(x.miss, x.impute, x.true)
set.seed(3) x.true = matrix(rnorm(100),10,10) ## create 10% missing values in x x.miss = artifNA(x.true, 0.10) ## impute using wNNSel method x.impute = wNNSel.impute(x.miss) computeNRMSE(x.miss, x.impute, x.true)
This function aims to search for optimal values of the tuning parameters for the wNNSel imputation.
cv.wNNSel(x, kernel = "gaussian", x.dist = "euclidean", method = "2", m.values = seq(2, 8, by = 2), c.values = seq(0.1, 0.5, by = 0.1), lambda.values = seq(0, 0.6, by = 0.01)[-1], times.max = 5, testNA.prop = 0.05)
cv.wNNSel(x, kernel = "gaussian", x.dist = "euclidean", method = "2", m.values = seq(2, 8, by = 2), c.values = seq(0.1, 0.5, by = 0.1), lambda.values = seq(0, 0.6, by = 0.01)[-1], times.max = 5, testNA.prop = 0.05)
x |
a |
kernel |
kernel function to be used in nearest neighbors imputation. Default kernel function is "gaussian". |
x.dist |
distance to compute, The default is |
method |
convex function, performs selection of variables. If |
m.values |
a |
c.values |
a |
lambda.values |
a |
times.max |
maximum number of repititions for the cross validation procedure. |
testNA.prop |
proportion of values to be deleted artificially for
cross validation in the missing matrix |
Some values are artificially deleted and wNNSel is run multiple times, varying and
.
For each pair of
and
, compute MSIE on the subset of the data matrix x for which the
the values were deleted artificially. (See References for more detail).
a list containing
lambda.opt |
optimal parameter selected by cross validation |
m.opt |
optimal parameter selected by cross validation |
MSIE.cv |
cross validation error |
Shahla Faisal <[email protected]>
Tutz, G. and Ramzan,S. (2015). Improved methods for the imputation of missing data by nearest neighbor methods. Computational Statistics and Data Analysis, Vol. 90, pp. 84-99.
Faisal, S. and Tutz, G. (2017). Missing value imputation for gene expression data by tailored nearest neighbors. Statistical Application in Genetics and Molecular Biology. Vol. 16(2), pp. 95-106.
set.seed(3) x.true = matrix(rnorm(100),10,10) ## create 10% missing values in x x.miss = artifNA(x.true, 0.10) ## use cross validation to find optimal values result = cv.wNNSel(x.miss) ## optimal values are result$lambda.opt result$m.opt ## Now use these values to get final imputation x.impute = wNNSel.impute(x.miss, lambda=result$lambda.opt, m=result$m.opt) ## and final MSIE computeMSIE(x.miss, x.impute, x.true)
set.seed(3) x.true = matrix(rnorm(100),10,10) ## create 10% missing values in x x.miss = artifNA(x.true, 0.10) ## use cross validation to find optimal values result = cv.wNNSel(x.miss) ## optimal values are result$lambda.opt result$m.opt ## Now use these values to get final imputation x.impute = wNNSel.impute(x.miss, lambda=result$lambda.opt, m=result$m.opt) ## and final MSIE computeMSIE(x.miss, x.impute, x.true)
'wNNSel'
is used to impute the missing values particularly in high dimensional data.
It uses a cross validation procedure for selecting the best values of the tuning parameters.
It also works when the samples are smaller than the covariates.
wNNSel(x, x.initial = NULL, x.true = NULL, k, useAll = TRUE, x.dist = "euclidean", kernel = "gaussian", method = "2", impute.fn, convex = TRUE, m.values = seq(2, 8, by = 2), c.values = seq(0.1, 0.5, by = 0.1), lambda.values = seq(0, 0.6, by = 0.01)[-1], times.max = 5, testNA.prop = 0.05, withinFolds = FALSE, folds, verbose = TRUE)
wNNSel(x, x.initial = NULL, x.true = NULL, k, useAll = TRUE, x.dist = "euclidean", kernel = "gaussian", method = "2", impute.fn, convex = TRUE, m.values = seq(2, 8, by = 2), c.values = seq(0.1, 0.5, by = 0.1), lambda.values = seq(0, 0.6, by = 0.01)[-1], times.max = 5, testNA.prop = 0.05, withinFolds = FALSE, folds, verbose = TRUE)
x |
a numeric data |
x.initial |
an optional. A complete data matrix e.g. using mean imputation of |
x.true |
a matrix of true or complete data. If provided, |
k |
an optional, the number of nearest neighbors to use for imputation. |
useAll |
|
x.dist |
distance to compute. The default is |
kernel |
kernel function to be used in nearest neighbors imputation. Default kernel function is "gaussian". |
method |
convex function, performs selection of variables. If |
impute.fn |
the imputation function to run on the length k vector of values for a missing feature. Defaults to a weighted mean of the neighboring values, weighted by the specified |
convex |
logical. If |
m.values |
a |
c.values |
a |
lambda.values |
a |
times.max |
maximum number of repititions for the cross validation procedure. |
testNA.prop |
proportion of values to be deleted artificially for
cross validation in the missing matrix |
withinFolds |
|
folds |
a |
verbose |
logical. If |
For each sample, identify missinng features. For each missing feature
find the nearest neighbors which have that feature. Impute the missing
value using the imputation function on the selected vector of values
found from the neighbors.
By default the wNNSel
method automatically searches for optimal values for a given data matrix.
The default method uses x.dist="euclidean"
including selected covariates.
The specific distancs are computed using important covariates only.
If mehtod="1"
, the linear function in absolute value of is used, defined by
for , and, 0 , otherwise.
By default, the power function
is used when
mehtod="2"
. For more detailed discussion, see references.
a list containing imputed data matrix, and cross validation results
x.impute |
imputed data matrix |
MSIE |
True error. Note it is only available when x.true is provided. |
lambda.opt |
optimal parameter selected by cross validation |
m.opt |
optimal parameter selected by cross validation |
MSIE.cv |
cross validation error |
Tutz, G. and Ramzan,S. (2015). Improved methods for the imputation of missing data by nearest neighbor methods. Computational Statistics and Data Analysis, Vol. 90, pp. 84-99.
Faisal, S. and Tutz, G. (2017). Missing value imputation for gene expression data by tailored nearest neighbors. Statistical Application in Genetics and Molecular Biology. Vol. 16(2), pp. 95-106.
set.seed(3) x.true = matrix(rnorm(100),10,10) ## create 10% missing values in x x.miss = artifNA(x.true, 0.10) ## imputed matrix result <- wNNSel(x.miss) result$x.impute ## cross validation result can be accessed using result$cross.val
set.seed(3) x.true = matrix(rnorm(100),10,10) ## create 10% missing values in x x.miss = artifNA(x.true, 0.10) ## imputed matrix result <- wNNSel(x.miss) result$x.impute ## cross validation result can be accessed using result$cross.val
This function imputes the missing values using user-spefied values of the tuning parameters. It also works when the samples are smaller than the covariates.
wNNSel.impute(x, k, useAll = TRUE, x.initial = NULL, x.dist = "euclidean", kernel = "gaussian", lambda = 0.3, impute.fn, convex = TRUE, method = "2", m = 2, c = 0.3, withinFolds = FALSE, folds, verbose = TRUE, verbose2 = FALSE)
wNNSel.impute(x, k, useAll = TRUE, x.initial = NULL, x.dist = "euclidean", kernel = "gaussian", lambda = 0.3, impute.fn, convex = TRUE, method = "2", m = 2, c = 0.3, withinFolds = FALSE, folds, verbose = TRUE, verbose2 = FALSE)
x |
a |
k |
an optional, the number of nearest neighbors to use for imputation. |
useAll |
|
x.initial |
an optional. A complete data matrix e.g. using mean imputation of |
x.dist |
distance to compute. The default is |
kernel |
kernel function to be used in nearest neighbors imputation. Default kernel function is "gaussian". |
lambda |
|
impute.fn |
the imputation function to run on the length k vector of values for a missing feature.
Defaults to a weighted mean of the neighboring values, weighted by the specified |
convex |
logical. If |
method |
convex function, performs selection of variables. If |
m |
|
c |
|
withinFolds |
|
folds |
a |
verbose |
logical. If |
verbose2 |
logical. If |
For each sample, identify missinng features. For each missing feature find the nearest neighbors which have that feature. Impute the missing value using the imputation function on the selected vector of values found from the neighbors.
imputed data matrix
set.seed(3) x = matrix(rnorm(100),10,10) x.miss = x > 1 x[x.miss] = NA wNNSel.impute(x) wNNSel.impute(x, lambda=0.5, m=2)
set.seed(3) x = matrix(rnorm(100),10,10) x.miss = x > 1 x[x.miss] = NA wNNSel.impute(x) wNNSel.impute(x, lambda=0.5, m=2)