Dear Organizers,
I'm re-raising the pertinent question asked by @TaiHsien deep down in another thread. Perfect scores being achieved on leaderboard suggest either a data leakage or a tiny validation set. Please clarify whether there is significant overlap in donor metadata shared by the validation and training data set, since a small combination of them uniquely identifies a donor. If there is overlap, should models be expected to use such features? Perhaps it is fine, provided there are zero such overlaps (statistically speaking) in the final test data set.
Thanks a lot in advance!
Created by Nikhil Karthik nkck Hi @ktravaglini,
Thank you so much for your response and the clarification! In case there is metadata overlap, even for task-2, I just wanted to point out that the CCC between 6e10 in MTG and A9 for common donors is 0.59 with Pearson R of 0.7. If there is metadata overlap, I hope you will also consider additional robust measures of a model's primary dependence on gene expressions and cell types than on secondary metadata details.
Thanks again for organizing this nice challenge!
Best regards,
Nikhil Hi @nkck,
We really appreciate your interest in the challenge and attention to detail here. As we previously said, the validation and test data may include new donors and/or may include data from other regions, though the cell types across regions will be shared. You are correct that if there are overlapping donors there would be data leakage in Task 1. I am not confirming whether this is the case or not, but I am validating your reasoning. I'll add that Task 2 is designed such that there can be no data leakage, regardless of if there is donor overlap. In our eyes the true winning algorithm would perform well in both (and in the spirit of the challenge would not be a trivial solution, but one that could generalize).
Best,
Kyle
Drop files to upload
Is validation data set independent of training set? page is loading…