Hi,
I am trying to understand the objective for sub-challenge 1.
The 0s in data_true are clearly simulated using bernoulli p = .1 for each observation (protein, sample pair). Also, the NAs for observed data sets include all the 0s, and each NA seems clearly simulated with bernoulli p = 1/3 chance of being 0.
For each protein IDs, the non-zero expression values have low variance across samples; so mean non-zero expressions across samples will lead to very good imputation. So, to minimize the imputation criterion, the key is to predict the 0 values. But the 0 values in data_true are clearly simulated with a bernoulli p = .1. They are completely random! It is futile to try and build a model to predict them.
Please let me know how I am misunderstanding this sub-challenge.
Thank you,
Joe
R code
data_true = read.table("data_true.txt", header=T, sep="\t")
par(mfrow = c(2,2))
hist(rbinom( n = 80, size = 7927, p = .1), breaks = 10, main = "80 Random Samples from Binomial p=.1, n = 7927", xlim = c(700, 900), xlab = "Number of 0s")
hist(apply(data_true[,-1] == 0, 2, sum), breaks = 10, main = "Zero values by Sample", xlim = c(700, 900), xlab = "Number of 0s")
hist(rbinom( n = 7927, size = 80, p = .1), breaks = 100, main = "7927 Random Samples from Binomial p=.1, n = 80", xlim = c(0,20))
hist(apply(data_true[,-1] == 0, 1, sum), breaks = 100, main = "Zero value by protein", xlim = c(0,20))
### See that data are generated from binomial 0 ###
data_obs_2 = read.table("data_obs_2.txt", header=T, sep="\t")
number_nas_2 <- sum(is.na(data_obs_2[,-1]))
number_0s <- sum(data_true[,-1] == 0)
## About exactly 1/3, similar for each data_obs
number_0s/number_nas_2
##0.3330392
Created by Joseph Usset mojusset The missing events (including true-zeros) are not independent. And the missing mechanism is a mixture of MAR and MNAR. Mean-imputation in general does not give the optimal performance in these scenarios. The participants are encouraged to make use of the dependence among proteins in their imputation algorithm.
Drop files to upload
Sub-challenge 1: data_true 0s simulated using bernoulli p =.1, data_obs NAs simulated with exactly 1/3 being 0 page is loading…