Data Correction & Update Notes

Statement 1: Solvent abbreviations and full names --- Solvent|Full Name NT|Neat (undiluted) PG/1,2-Propanediol|Propylene Glycol/ 1,2-Propanediol paraffin oil|Paraffin Oil dep|Diethyl Phthalate 99% ethanol|99% Ethanol 90% ethanol|90% Ethanol Mineral oil|Mineral Oil

Created by Xuebo Song Songxuebo
Hi @miray；@ranaabarghout；I’ve uploaded a new components definition file for the stimuli. I hope you find it helpful.
**Statement 5: Description of “TASK2_Components_definition_fixed_V1.csv** After making corrections to TASK2_Components_definition, a new version has been uploaded.
**Statement 4: Description of “TASK2_Stimulus_definition_fixed_V1.csv** After making corrections to TASK2_Stimulus_definition, a new version has been uploaded
**Statement 3: Description of “Task2singleRATAfixedV2.csv** In the original Task2singleRATAfixedV2.csv file, two modifications were made: The component for stimulus 204 was reverted from 345 back to 344.
**Statement 2: Description of “Task2singleRATAfixedV1.csv** In the original Task2singleRATA.csv file, two modifications were made: The component for stimulus 204 was updated from 344 to 345. For all rows where the components field was NA, the value was replaced with the following text: " Refer to TASK1Stimulusdefinition.csv"
Hi there, I just want to ask whether there is an updated fixing the issues discussed above @Songxuebo. As mentioned, this issue wasn't a one time issue. Here are some other examples: In the _Task2_single_RATA_fixed_V1_.csv, we see the following ``` 992,339,6989,0.2, ``` while in _TASK2_Component_definition.csv,_ we see: ``` 339,6920,0.36,pg ```` So there is mismatch between the stimulus definition and the components. For example, for stimulus 992, component 340 is listed in _TASK2_Stimulus_definition.csv_ and not 339 as listed in the RATA file. Component 340 seems to have matching CID, dilution, and solvent to stimulus 992. Does this mean the RATA file is the one with the errors? Please **note that this happens for many cases and not just the one listed in this example and the stimulus 204 case which you already fixed**.
Hi, From my understanding of how the "TASK2 Stimulus definition.csv" file work, the goal of having component ID(s) associated to a Stimulus ID is for it to act as unique query ID to the following list of attributes: [CID, dilution factor, solvent] -- by this, I mean that a component ID matches a unique list of attribute. Therefore, provided that my understanding is correct, the molecule" and "dilution" columns in "TASK2 single RATA.csv" do not need to be included to showcase the misalignment between Stimulus and component ID in files "TASK2 Stimulus definition.csv" and "TASK2 single RATA.csv" **We should ensure the component IDs given in the "TASK2 single RATA.csv" file match the one found in "TASK2 stimulus definition.csv" file.** If we do this, the columns "component", "molecule" and "dilution" in "TASK2 single RATA.csv" will actually not be needed anymore, and the file format will be unified between the RATA label files "TASK2 single RATA.csv", "TASK2 Train mixture dataset.cdv" and "TASK1 Training.csv".
Hi @miray, In this challenge, for mixtures and their corresponding single molecules, we’ve provided RATA data for some components at the same concentration, but for others, the data may come from different concentrations or solvents. From reviewing your code, it seems that the current approach may not capture all relevant components under these conditions. In cases where you can't find an exact match for the same component, I’d recommend using the CID information to identify the corresponding single-molecule data.
Hi @Songxuebo, I just downloaded the file you added to the data folder and the only corrected mismatch was the one of stimulus 204. What about the other 55 mismatch I highlighted above?
Hi @miray Yes, the updated file has been uploaded as **Task2_single_RATA_fixed_V1.csv**. Please feel free to let me know if you notice anything else!
Hi, thanks for your fast reply! All the data mismatch I have highlighted here exclude missing stimulus definition issues, as demonstrated by the following line of code: ``` if not pd.isna(single_rata_component) and int(single_rata_component) != int(component_defined): ``` Do you have any updates regarding the mismatches I found? Thanks!
Hi @miray, thanks for the reminder. I’m double-checking the data now. If the stimulus definition is missing, you can refer to the TASK1_Stimulus_definition.csv file.
Hi @Songxuebo, Unfortunately, the data issue in Task2 I found seemed bigger than for a single stimulus? I found 56 mismatch in the data file by running the following code: ``` task2_single_mol_labels = pd.read_csv("../dataset/raw/Task2_single_RATA.csv") task2_stimulus = pd.read_csv("../dataset/raw/TASK2_Stimulus_definition.csv") task2_stimulus_single = task2_stimulus[~task2_stimulus["components"].str.contains(';')].rename(columns={"id": "stimulus"}) count = 0 task2_single_mol_labels = task2_single_mol_labels.sort_values(by="components") for stimulus in task2_single_mol_labels["stimulus"]: if stimulus in task2_stimulus_single["stimulus"].unique(): single_rata_component = task2_single_mol_labels.loc[task2_single_mol_labels["stimulus"] == stimulus]["components"].values[0] component_defined = task2_stimulus_single.loc[task2_stimulus_single["stimulus"] == stimulus]["components"].values[0] if not pd.isna(single_rata_component) and int(single_rata_component) != int(component_defined): print("mismatch") print("stimulus id:", stimulus) print(single_rata_component) print(component_defined) count += 1 print("Num mistmatch:", count) ``` When a data issue as tremendous as this one appears, I think it is better to check the entire file. Sorry that at the time, I did not quantify the data issue as much, it would have made it clearer for the organizer that this is not the case for a single data point. When ordering by component number instead of stimulus ID, it seems like the issue stems from a mismatch starting at component 339 or earlier. **Given the potential impact of the data mismatch, I would appreciate that you address this issue _promptly_ and double check the mismatches I found below:** ``` mismatch stimulus id: 992 339.0 340 mismatch stimulus id: AQ230 340.0 341 mismatch stimulus id: J622 340.0 341 mismatch stimulus id: D323 341.0 342 mismatch stimulus id: 204 344.0 345 mismatch stimulus id: AF383 345.0 346 mismatch stimulus id: H714 348.0 349 mismatch stimulus id: AI614 356.0 357 mismatch stimulus id: AF408 360.0 361 mismatch stimulus id: H644 361.0 362 mismatch stimulus id: AF925 364.0 365 mismatch stimulus id: AN132 366.0 367 mismatch stimulus id: J497 371.0 372 mismatch stimulus id: B623 380.0 381 mismatch stimulus id: AQ673 385.0 386 mismatch stimulus id: J087 385.0 386 mismatch stimulus id: AF370 388.0 389 mismatch stimulus id: 715 388.0 389 mismatch stimulus id: AI722 391.0 392 mismatch stimulus id: AI094 398.0 399 mismatch stimulus id: A204 401.0 402 mismatch stimulus id: C622 414.0 415 mismatch stimulus id: AF342 415.0 416 mismatch stimulus id: H127 416.0 417 mismatch stimulus id: AO016 423.0 424 mismatch stimulus id: AI856 429.0 430 mismatch stimulus id: AF610 431.0 432 mismatch stimulus id: AF586 432.0 433 mismatch stimulus id: AQ792 432.0 433 mismatch stimulus id: AF552 433.0 434 mismatch stimulus id: AI636 441.0 442 mismatch stimulus id: 199 441.0 442 mismatch stimulus id: A398 443.0 444 mismatch stimulus id: E804 445.0 446 mismatch stimulus id: AI029 457.0 458 mismatch stimulus id: D313 462.0 463 mismatch stimulus id: I666 478.0 479 mismatch stimulus id: D141 483.0 484 mismatch stimulus id: A866 487.0 488 mismatch stimulus id: H092 491.0 492 mismatch stimulus id: 909 493.0 494 mismatch stimulus id: I769 497.0 498 mismatch stimulus id: K853 498.0 499 mismatch stimulus id: D564 498.0 499 mismatch stimulus id: A072 502.0 503 mismatch stimulus id: C887 504.0 505 mismatch stimulus id: J346 515.0 516 mismatch stimulus id: C749 519.0 520 mismatch stimulus id: 588 526.0 527 mismatch stimulus id: AQ692 527.0 528 mismatch stimulus id: AF890 527.0 528 mismatch stimulus id: J093 528.0 529 mismatch stimulus id: AQ892 528.0 529 mismatch stimulus id: AF986 529.0 530 mismatch stimulus id: AF445 530.0 531 mismatch stimulus id: AQ145 530.0 531 Num mistmatch: 56 ``` Thanks for your help!
**Statement 2**: **Description of “Task2_single_RATA_fixed_V1.csv** In the original Task2_single_RATA.csv file, two modifications were made: 1. The component for stimulus 204 was updated from 344 to 345. 2. For all rows where the components field was NA, the value was replaced with the following text: " Refer to TASK1_Stimulus_definition.csv"

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Statement 1: Solvent abbreviations and full names --- Solvent|Full Name NT|Neat (undiluted) PG/1,2-Propanediol|Propylene Glycol/ 1,2-Propanediol paraffin oil|Paraffin Oil dep|Diethyl Phthalate 99% ethanol|99% Ethanol 90% ethanol|90% Ethanol Mineral oil|Mineral Oil

Drop files to upload

Data Correction & Update Notes page is loading…

**Statement 1**: **Solvent abbreviations and full names** --- Solvent|Full Name NT|Neat (undiluted) PG/1,2-Propanediol|Propylene Glycol/ 1,2-Propanediol paraffin oil|Paraffin Oil dep|Diethyl Phthalate 99% ethanol|99% Ethanol 90% ethanol|90% Ethanol Mineral oil|Mineral Oil

Drop files to upload

Data Correction & Update Notes page is loading…

Statement 1: Solvent abbreviations and full names --- Solvent|Full Name NT|Neat (undiluted) PG/1,2-Propanediol|Propylene Glycol/ 1,2-Propanediol paraffin oil|Paraffin Oil dep|Diethyl Phthalate 99% ethanol|99% Ethanol 90% ethanol|90% Ethanol Mineral oil|Mineral Oil