**Statement 1**: **Solvent abbreviations and full names**
---
Solvent|Full Name
NT|Neat (undiluted)
PG/1,2-Propanediol|Propylene Glycol/ 1,2-Propanediol
paraffin oil|Paraffin Oil
dep|Diethyl Phthalate
99% ethanol|99% Ethanol
90% ethanol|90% Ethanol
Mineral oil|Mineral Oil
Created by Xuebo Song Songxuebo Hi @miray;@ranaabarghout;I’ve uploaded a new components definition file for the stimuli. I hope you find it helpful. **Statement 5: Description of “TASK2_Components_definition_fixed_V1.csv**
After making corrections to TASK2_Components_definition, a new version has been uploaded. **Statement 4: Description of “TASK2_Stimulus_definition_fixed_V1.csv**
After making corrections to TASK2_Stimulus_definition, a new version has been uploaded **Statement 3: Description of “Task2singleRATAfixedV2.csv**
In the original Task2singleRATAfixedV2.csv file, two modifications were made:
The component for stimulus 204 was reverted from 345 back to 344. **Statement 2: Description of “Task2singleRATAfixedV1.csv**
In the original Task2singleRATA.csv file, two modifications were made:
The component for stimulus 204 was updated from 344 to 345.
For all rows where the components field was NA, the value was replaced with the following text: " Refer to TASK1Stimulusdefinition.csv" Hi there,
I just want to ask whether there is an updated fixing the issues discussed above @Songxuebo. As mentioned, this issue wasn't a one time issue. Here are some other examples:
In the _Task2_single_RATA_fixed_V1_.csv, we see the following
```
992,339,6989,0.2,
```
while in _TASK2_Component_definition.csv,_ we see:
```
339,6920,0.36,pg
````
So there is mismatch between the stimulus definition and the components. For example, for stimulus 992, component 340 is listed in _TASK2_Stimulus_definition.csv_ and not 339 as listed in the RATA file. Component 340 seems to have matching CID, dilution, and solvent to stimulus 992. Does this mean the RATA file is the one with the errors? Please **note that this happens for many cases and not just the one listed in this example and the stimulus 204 case which you already fixed**. Hi,
From my understanding of how the "TASK2 Stimulus definition.csv" file work, the goal of having component ID(s) associated to a Stimulus ID is for it to act as unique query ID to the following list of attributes: [CID, dilution factor, solvent] -- by this, I mean that a component ID matches a unique list of attribute. Therefore, provided that my understanding is correct, the molecule" and "dilution" columns in "TASK2 single RATA.csv" do not need to be included to showcase the misalignment between Stimulus and component ID in files "TASK2 Stimulus definition.csv" and "TASK2 single RATA.csv"
**We should ensure the component IDs given in the "TASK2 single RATA.csv" file match the one found in "TASK2 stimulus definition.csv" file.** If we do this, the columns "component", "molecule" and "dilution" in "TASK2 single RATA.csv" will actually not be needed anymore, and the file format will be unified between the RATA label files "TASK2 single RATA.csv", "TASK2 Train mixture dataset.cdv" and "TASK1 Training.csv". Hi @miray, In this challenge, for mixtures and their corresponding single molecules, we’ve provided RATA data for some components at the same concentration, but for others, the data may come from different concentrations or solvents. From reviewing your code, it seems that the current approach may not capture all relevant components under these conditions. In cases where you can't find an exact match for the same component, I’d recommend using the CID information to identify the corresponding single-molecule data. Hi @Songxuebo,
I just downloaded the file you added to the data folder and the only corrected mismatch was the one of stimulus 204. What about the other 55 mismatch I highlighted above? Hi @miray Yes, the updated file has been uploaded as **Task2_single_RATA_fixed_V1.csv**. Please feel free to let me know if you notice anything else! Hi, thanks for your fast reply! All the data mismatch I have highlighted here exclude missing stimulus definition issues, as demonstrated by the following line of code:
```
if not pd.isna(single_rata_component) and int(single_rata_component) != int(component_defined):
```
Do you have any updates regarding the mismatches I found?
Thanks! Hi @miray, thanks for the reminder. I’m double-checking the data now. If the stimulus definition is missing, you can refer to the TASK1_Stimulus_definition.csv file. Hi @Songxuebo,
Unfortunately, the data issue in Task2 I found seemed bigger than for a single stimulus? I found 56 mismatch in the data file by running the following code:
```
task2_single_mol_labels = pd.read_csv("../dataset/raw/Task2_single_RATA.csv")
task2_stimulus = pd.read_csv("../dataset/raw/TASK2_Stimulus_definition.csv")
task2_stimulus_single = task2_stimulus[~task2_stimulus["components"].str.contains(';')].rename(columns={"id": "stimulus"})
count = 0
task2_single_mol_labels = task2_single_mol_labels.sort_values(by="components")
for stimulus in task2_single_mol_labels["stimulus"]:
if stimulus in task2_stimulus_single["stimulus"].unique():
single_rata_component = task2_single_mol_labels.loc[task2_single_mol_labels["stimulus"] == stimulus]["components"].values[0]
component_defined = task2_stimulus_single.loc[task2_stimulus_single["stimulus"] == stimulus]["components"].values[0]
if not pd.isna(single_rata_component) and int(single_rata_component) != int(component_defined):
print("mismatch")
print("stimulus id:", stimulus)
print(single_rata_component)
print(component_defined)
count += 1
print("Num mistmatch:", count)
```
When a data issue as tremendous as this one appears, I think it is better to check the entire file. Sorry that at the time, I did not quantify the data issue as much, it would have made it clearer for the organizer that this is not the case for a single data point. When ordering by component number instead of stimulus ID, it seems like the issue stems from a mismatch starting at component 339 or earlier. **Given the potential impact of the data mismatch, I would appreciate that you address this issue _promptly_ and double check the mismatches I found below:**
```
mismatch
stimulus id: 992
339.0
340
mismatch
stimulus id: AQ230
340.0
341
mismatch
stimulus id: J622
340.0
341
mismatch
stimulus id: D323
341.0
342
mismatch
stimulus id: 204
344.0
345
mismatch
stimulus id: AF383
345.0
346
mismatch
stimulus id: H714
348.0
349
mismatch
stimulus id: AI614
356.0
357
mismatch
stimulus id: AF408
360.0
361
mismatch
stimulus id: H644
361.0
362
mismatch
stimulus id: AF925
364.0
365
mismatch
stimulus id: AN132
366.0
367
mismatch
stimulus id: J497
371.0
372
mismatch
stimulus id: B623
380.0
381
mismatch
stimulus id: AQ673
385.0
386
mismatch
stimulus id: J087
385.0
386
mismatch
stimulus id: AF370
388.0
389
mismatch
stimulus id: 715
388.0
389
mismatch
stimulus id: AI722
391.0
392
mismatch
stimulus id: AI094
398.0
399
mismatch
stimulus id: A204
401.0
402
mismatch
stimulus id: C622
414.0
415
mismatch
stimulus id: AF342
415.0
416
mismatch
stimulus id: H127
416.0
417
mismatch
stimulus id: AO016
423.0
424
mismatch
stimulus id: AI856
429.0
430
mismatch
stimulus id: AF610
431.0
432
mismatch
stimulus id: AF586
432.0
433
mismatch
stimulus id: AQ792
432.0
433
mismatch
stimulus id: AF552
433.0
434
mismatch
stimulus id: AI636
441.0
442
mismatch
stimulus id: 199
441.0
442
mismatch
stimulus id: A398
443.0
444
mismatch
stimulus id: E804
445.0
446
mismatch
stimulus id: AI029
457.0
458
mismatch
stimulus id: D313
462.0
463
mismatch
stimulus id: I666
478.0
479
mismatch
stimulus id: D141
483.0
484
mismatch
stimulus id: A866
487.0
488
mismatch
stimulus id: H092
491.0
492
mismatch
stimulus id: 909
493.0
494
mismatch
stimulus id: I769
497.0
498
mismatch
stimulus id: K853
498.0
499
mismatch
stimulus id: D564
498.0
499
mismatch
stimulus id: A072
502.0
503
mismatch
stimulus id: C887
504.0
505
mismatch
stimulus id: J346
515.0
516
mismatch
stimulus id: C749
519.0
520
mismatch
stimulus id: 588
526.0
527
mismatch
stimulus id: AQ692
527.0
528
mismatch
stimulus id: AF890
527.0
528
mismatch
stimulus id: J093
528.0
529
mismatch
stimulus id: AQ892
528.0
529
mismatch
stimulus id: AF986
529.0
530
mismatch
stimulus id: AF445
530.0
531
mismatch
stimulus id: AQ145
530.0
531
Num mistmatch: 56
```
Thanks for your help!
**Statement 2**: **Description of “Task2_single_RATA_fixed_V1.csv**
In the original Task2_single_RATA.csv file, two modifications were made:
1. The component for stimulus 204 was updated from 344 to 345.
2. For all rows where the components field was NA, the value was replaced with the following text: " Refer to TASK1_Stimulus_definition.csv"