Dear Organizers, Thanks for posting the results. I was surprised to still find top QWK scores of 1.0 +/- 0.0 making it clear that there is data leakage between training and final data sets. A data leakage not only affects task-1 but also task-2 due to correlations between 6e10 and AT8 values between different brain sections of the same donor (from a simple memorization of 6e10 in A9, one can get a CCC score of 0.69 in corresponding MTG data). Thus, it would be nice to have some comments and clarifications from the organizers on the results page whether there is data leakage, and whether one has learnt something meaningful about AD prediction from the leaderboard ranks. Thanks!

Created by Nikhil Karthik nkck
Hi @hsinyul, Thank you for the clarification. Indeed, the ranking is correct from the challenge perspective, but doesn't serve any purpose. Please note that it was already clear to everyone from the validation round that there is data leakage, and if winning the challenge was the only motive, one could have just submitted a model consisting of if-else statements. Thus, I strongly urge the organizers to consider my suggestion. Best, Nikhil
Hi Nikhil, Thank you for your suggestion and contribution. I'll answer your questions separately based on 1) what we can learn from our challenge, and 2) whether the leaderboard is fair. For question 1, given the significant effort required for data collection and curation, our cohort is small, though we have high-quality data. With that constraint, the question we can answer in this challenge is: how much information can we learn from one region for another region? While it is a different question from: how much we can learn from scRNA-seq data, it is still a valid question. In addition, the contexts we provided do have biological relevance: sex, age, APOE are heavily correlated with AD progression. However, due to the size of the cohort, it is likely that the model can overfit on these contexts, limiting its ability to test generalizability. In the write-up, we will try to disentangle these questions and answer them to the best of our ability. For question 2, we are aware that the leaderboard is a reflection of how well someone can learn from the constraint datasets rather than how well a model can generally predict AD based on unseen scRNA-seq data. Since in the challenge we did not exclude people from using context (and note that team MetFormin-121 has a detailed write-up about their approach that did not utilize any context information), the current leaderboard is still valid from a challenge perspective, though not necessarily from a generalizability perspective. We are discussing within the organizer team whether to test the models on another cohort to address the generalizability question, and if so, what logistics we should consider. We will discuss your suggestion internally, but since we didn't explicitly state that we are looking for the team with the best generalizability, the leaderboard will remain until the organizing team discusses your suggestion and reaches a joint decision. In short, we recognize the limitations of our challenge and encourage people to read the write-ups of each team. We will form a more detailed summary in the upcoming write-up. Best, Jane
Hi @ktravaglini, Thanks a lot for the detailed response. I look forward to your write-up on the challenge, and to the actual scientific impact of this work. Given your explanation, I hope you agree that ranking might not be representative of the true performance of the submitted models without further analysis. So I would kindly suggest making the results "unranked" in fairness to all the participants who have all put in a lot of work. Best, Nikhil
Hi @nkck, Thank you for raising these. Several of us have been traveling for conferences, so have not been able to sit down and reply in the way we would want. We plan to describe the challenge structure shortly, which should hopefully provide clarity. Briefly, the training data were on the middle temporal gyrus and frontal cortex of full SEA-AD cohort (n=84 donors). We chose these because they were already public/released. The validation and test datasets are from the superior temporal gyrus and inferior temporal gyrus from a subset of these donors, respectively (those without co-morbidities, n=43 donors each). The quantitative neuropathology data for these regions had not been released, so offered a reasonable unseen benchmark for Task 2. Though, as you already noted there is correlation between the quantitative neuropathology in these regions/donors and that in the MTG/DFC. These will be important caveats to discuss in the manuscript about the challenge. For Task 1, even though perfect predictions were possible due to donor overlap, we felt there was still value in predicting donor level variables. We would have loved a completely separate cohort, but unfortunately all of those we have access to are already publicly available. Hopefully that provides some initial insight that we can expand on further! Best, Kyle
cc @SEA-ADDREAMChallengeOrganizers : Here are some possible solutions for the likely presence of data leakage that I can suggest: 1. Detailed sensitivity study of all submitted models through random perturbations of various inputs (predictions for a model change drastically if the age of a donor is changed from 85 to 82? Then it is perhaps meaningless.) 2. Or, ask participants to redo analysis by giving a new training data set ensuring no donor overlaps in the training and test sets, and trust the participants to submit new models trained only on the new train sets. 3. Or, consider releasing just an unranked final scores without any “top-performing teams”. Data leakage or not, a value of QWK = 1.0 needs an explanation. Thanks! -Nikhil

Data leakage? page is loading…