I have following views on these challenges,
In order to participate in these challenge, you need to learn lot of techniques to download data and submit via docker. It mean you should learn lot of software like docker. In addition test or validation files are not available, so codes fail during testing/validation, despite code is working fine on training dataset locally. CPU time is another major concern, we implement via Random Forest, it take lot of time in prediction. As organisers have limited CPU power and number of participants are submitting codes, so our method fails (could not get sufficient CPU time). In summary, we are more focusing on technical jargons rather than on developing scientific algorithms for better prediction. Our group regulayly participate in other competitions like CASAP (Critical assessment of protein structure prediction techniques), organisers provides sequence of proteins and we submit structure of these proteins. Everything is so smooth and well define that we hardly spend any time on technical issues.
In addition, name and studies are changed frequently without informing participants. Thus user have no option except to guess and do hit and trial. Organisers has to rethink to release test data to participants so they may check there code on local machine and then final code on synapse. Even we can not create big log files to bug our code on synapse machine.
In summary, we are spending most of our time in solving technical issues rather than on scientific advancement. Organisers should make these challenges more user-friendly if they wish to utilise maximum potential of crowd sourcing, otherwise only limited scientific community will participate in this type of challenges.
Yuanfang:
1) I'm here to learn, and would be very happy if you could politely provide some examples of what is "completely wrong" with my understanding of the challenge.
2) I'm sorry that comment about computing resources made you uncomfortable. It was not directed towards you or anyone for using resources; I was only pointing out that cloud solutions are rather expensive, and there has to be limits if the organizers want to make sure everyone can have a fair chance to evaluate their code.
Manuel, your understanding of how machine learning works is completely wrong.
But that doesn't matter. What make me feel **EXTREMELY** uncomfortable is when one participant educates another participant by saying ' keep in mind ....', 'these things are expensive'... you already got enough resource for not paying.'
@ArtemSokolov The goal, as the organizers stated, is to be able to successfully predict new samples with gaps in the RNA coverage. If the test data was available, then how would you tell if a model is simply over-fitting the test data versus something that will genuinely come up with good predictions on new samples? They need to hold out some samples for validation, or else anyone could just train on the validation; personally I wouldn't be interested in the challenge at all if the test data was available. Which is also why the models have to be "pre-trained"; if you always need to re-train based on new data, then your software is not actually learning anything.
The IT infrastructure point is valid, I think it would've helped if the organizers had disclosed some hardware limitations or some upper-bound earlier so that they could actually enforce them and kill processes that use too much before the system gets overwhelmed. But at least the organizers have addressed this for the next round (though maybe they should do it a bit more formally, say by e-mail.) Also keep in mind that running those hour long computations with multiple cpus and large memory instances isn't free, nor is setting up the infrastructure to do all the testing and evaluation. We can't all have dedicated high-performance computing servers that will magically accommodate any kind software architecture, I think what we're given is pretty generous considering what it would cost out of my pocket. it is indeed strange, artem. it is great that all of the participants could unite and voice our concerns together. @mbelmadani : Releasing test data means releasing input-space features (CNV & RNA expression in this case) without labels (which are to be predicted). In most real-world scenarios, test data is available and can be used to guide model training in a semi-supervised fashion.
Docker containers are extremely important for reproducibility. However, I personally think that they should be introduced during the "community phase", where the goal is to fully reconstruct pipelines that go from data to predictions. This way, containers can be more complete (i.e., include training, not just running pre-trained models) and authors can deal with the technical aspects without the pressure of a submission deadline.
The current usage of Docker is a bit strange: it obscures test data, which is normally visible in the real world; it doesn't maintain full reproducibility, since models are pre-trained; and it is highly dependent on IT infrastructure that runs them, as we've experienced on Sunday.
yeah.... they always have their defenders. But based on past experience, the defenders only appear before the challenge ends. after that, they will become the complainer... Just my opinions, but I don't see these as issues. The challenge expects the code to run on minimal CPU, and memory (up to 10GB, that's huge considering the size of the input), even what we're provide is generous in my opinion to evaluate a model that should already be trained. Docker is super useful for submissions and iterations, I can't imagine the horror of having another set-up geared to handle all the different versions of R/python or even more finicky programming langues and install the specific packages for everyone. That would be a infeasible. Big log files would allow people to cheat and extract data for the challenge. I agree I would like more feedback on why things failed (e.g. maybe reporting the row in the prediction that failed validation, or something like that.)
About the data; if I recall, they said they would release the data but is embargoed until publication. And how would this challenge even work if the full data was already fully available? That would make no sense, people could fit the data extremely well without providing solutions that don't generalized to anything else. The performance is not expected to be optimal at this stage, but rather you should find a solution that finds optimal scores in spite of missing data constraints, which is the real application of the challenge IMO, not the predictions of the actual data. I'm 100% for open data but I don't think the organizers are doing anything malicious here.
Just my two cents; I understand there's been issues here and there but organizing a challenge of this scale is not trivial. I think they've made appropriates choices in terms designing the challenge and hopefully things that didn't go as smooth can be learned on for the next time. I appreciate the effort that's been going into organizing this. I partially agree with this comment.
Typically, docker was required in challenges where the contributors are not willing to release fully the data, training or testing. I could completely understand and fully support in cases where the data is so sensitive that releasing it may cause human subject issues.
However, in some challenges like this, I think the whole docker was to prevent full access of the data to the participants, and thus guarantees that the organizers maintain the sole proprietary of the data.
If the data providers are called data providers, then please provide the data.