Hi, I am trying to understand the objective for sub-challenge 1. The 0s in data_true are clearly simulated using bernoulli p = .1 for each observation (protein, sample pair). Also, the NAs for observed data sets include all the 0s, and each NA seems clearly simulated with bernoulli p = 1/3 chance of being 0. For each protein IDs, the non-zero expression values have low variance across samples; so mean non-zero expressions across samples will lead to very good imputation. So, to minimize the imputation criterion, the key is to predict the 0 values. But the 0 values in data_true are clearly simulated with a bernoulli p = .1. They are completely random! It is futile to try and build a model to predict them. Please let me know how I am misunderstanding this sub-challenge. Thank you, Joe R code data_true = read.table("data_true.txt", header=T, sep="\t") par(mfrow = c(2,2)) hist(rbinom( n = 80, size = 7927, p = .1), breaks = 10, main = "80 Random Samples from Binomial p=.1, n = 7927", xlim = c(700, 900), xlab = "Number of 0s") hist(apply(data_true[,-1] == 0, 2, sum), breaks = 10, main = "Zero values by Sample", xlim = c(700, 900), xlab = "Number of 0s") hist(rbinom( n = 7927, size = 80, p = .1), breaks = 100, main = "7927 Random Samples from Binomial p=.1, n = 80", xlim = c(0,20)) hist(apply(data_true[,-1] == 0, 1, sum), breaks = 100, main = "Zero value by protein", xlim = c(0,20)) ### See that data are generated from binomial 0 ### data_obs_2 = read.table("data_obs_2.txt", header=T, sep="\t") number_nas_2 <- sum(is.na(data_obs_2[,-1])) number_0s <- sum(data_true[,-1] == 0) ## About exactly 1/3, similar for each data_obs number_0s/number_nas_2 ##0.3330392

Created by Joseph Usset mojusset
The missing events (including true-zeros) are not independent. And the missing mechanism is a mixture of MAR and MNAR. Mean-imputation in general does not give the optimal performance in these scenarios. The participants are encouraged to make use of the dependence among proteins in their imputation algorithm.

Sub-challenge 1: data_true 0s simulated using bernoulli p =.1, data_obs NAs simulated with exactly 1/3 being 0 page is loading…