Hello, I had a look at the data provided for this challenge and found several issues associated to it: 1. As recently mentioned in a discussion post, there are some missing CIDs (same as in the discussion post). On top of these, several CIDs are specified in the TASK2 component definition file, but not used (-3 to -9 CIDs) and not found in CID.csv either. 2. I find the lack of labeling of certain columns a bit confusing. For example some solvents are abbreviated as "nt", "pg", "dep" but no description is provided explaining these abbreviations. Other solvents are not abbreviated. 3. I found several example of Stimulus corresponding to the same components, but not having the same RATA labels (eg. G523 and L973) in the TASK2 single RATA file. Could you provide an explanation for this observation? 4. I found what I believe is a typo for TASK2 multi-component data, where Stimulus ID C852 is associated to components "338;338" but the same Stimulus is found in TASK1, with only one molecule, that corresponds to the molecule of component "338". Also, this C852 datapoint is only found in TASK2 single RATA. 5. I found that the column "components" in the TASK2 single RATA file is misaligned with the Stimulus ID. For example, Stimulus 204 in the TASK2 single RATA file corresponds to "component 344, molecule 7127 and dilution 0.2" but in the TASK2 component definition, Stimulus 204 is actually linked to component 345, which corresponds to "CID 7127, dilution 0.2 and solvent pg". **Here is the colab link so anyone can reproduce what I did:**[link](https://colab.research.google.com/drive/1cSb67BWyECJOIMedMGVY6B8NJZyw1ido?usp=sharing) I would say that most of these issues can easily be solved on my side (except understanding 2 and 3), but I would appreciate if the organizers could take a look at it and provide explanations, especially for point 3. Thanks for your help!

Created by miray
Hi @Songxuebo when are we going to get complete files which are fixed?
Hi, the mismatch I am referring to is exactly the same as in Q5, it just repeats for 55 more times: Some component IDs used for the Stimulus definition in the "TASK2 single RATA.csv" are not the same component IDs used in TASK2 Stimulus definitions obtained from the " TASK2 stimulus definition.csv" and "TASK2 component definition.csv" files. For example: - Stimulus 204 in the TASK2 single RATA file corresponds to "component 344, molecule 7127 and dilution 0.2" but in the TASK2 Stimulus definition file, Stimulus 204 is actually linked to component 345, which corresponds to "CID 7127, dilution 0.2 and solvent pg" in the TASK2 Component definition file **This issue happens for 55 other rows in TASK2 single rata.** I listed them in the other thread but here are some examples: - Stimulus 992 in the TASK2 single RATA file corresponds to "component 339, molecule 6989 and dilution 0.2" but in the TASK2 definition, Stimulus 992 is actually linked to component 340, which corresponds to "CID 6989, dilution 0.2 and solvent pg" in the TASK2 Component definition file. - Stimulus J622 in the TASK2 single RATA file corresponds to "component 340, molecule 700 and dilution 0.5" but in the TASK2 definition, Stimulus J622 is actually linked to component 341, which corresponds to "CID 700, dilution 0.5 and solvent pg" in the TASK2 Component definition file. - ... - Stimulus AF445 in the TASK2 single RATA file corresponds to "component 530, molecule 9589 and dilution 0.2" but in the TASK2 definition, Stimulus AF445 is actually linked to component 531, which corresponds to "CID 9589, dilution 0.2, and solvent pg" in the TASK2 Component definition file. You mentioned that "A for Q5: The component information in the TASK2 single RATA file is actually correct. I will revise the relevant component definition files to align with it. I really appreciate your careful review and feedback". Therefore, I assume that all these other component definitions "mismatch" need to be addressed. I hope this makes my concern clearer. Happy to provide some more clarification. Thanks!
Hi, Could you clarify what type of mismatch you're referring to? Do you mean the values differ when matching the component with the corresponding RATA data?
Hi, @dskhanirfan, Task 1 and Task 2 share some of the single-molecule RATA data. You can easily identify the differences by directly comparing the relevant rows between the two. We’ve just updated the data — the corrected file has been uploaded as Task2_single_RATA_fixed_V1.csv. For all data updates and fixed files, please refer to the **Data Correction & Update Notes thread**. Let me know if you have any further questions!
@Songxuebo Hi Can you please explain which and how task2 files depend on task1 files? and when are we getting the correct data? The discussion is confusing
Hi @ranaabarghout and @miray, I am woking on that and will update today or tomorrow.
Hi there, will there be an updated file for the stimulus definitions uploaded soon?
Hi @Songxuebo, thanks for your reply! I would appreciate if the issue raised in Q5 could be solved quickly. I believe progress in the challenge is harder to make if the data provided is erroneous. Thanks again for taking the time to review my code.
A for Q5: The component information in the TASK2 single RATA file is actually correct. I will revise the relevant component definition files to align with it. I really appreciate your careful review and feedback. I was planning to upload the corrected data files, but to avoid confusion with users downloading different versions, I’ll open a new discussion thread shortly to document and track all data updates moving forward.
A for Q4: C852 is only associated with component "338", and the duplicate entry "338;338" in the TASK2 multi-component file is indeed a typo.
A for Q3: The two stimuli (G523 and L973) you mentioned were tested in different projects, which involved different participant groups. As a result, variations in the RATA profiles are expected due to differences in participant perception.These duplicates were included to reflect real-world variability, enrich the training data, and support more robust modeling — we encourage you to review and decide how best to handle them for your approach.
A for Q2: A reference table mapping solvent abbreviations to their full names and descriptions has been posted under the **Solvent abbreviations and full names** thread. Please feel free to refer to it as needed.
Hi @Songxuebo, Thanks for your reply and for updating the data for my first question. I would appreciate if you could also answer Q2/3/4 and 5. Thanks for your help!
A for Q1: Hi, the issue stems from the fact that the version of TASK2Componentdefinition.csv I provided was overcomplete. It included some molecules that aren’t actually used in this challenge. I’ll clean up the file to remove those unused entries and upload a revised version shortly.

Several data issues page is loading…