Hi can you provide a method/code to generate the step1_submission file from test data using trained model? @mschapira
Created by dskhanirfan yes, confirmed. @vchung, This is unfortunate. Now this website says that the submission is out of time range and therefor i cannot submit, even though it is only 10pm PST. Absolutely nothing is provided in a straight forward manner. The instructions are all over the place and it takes time to fish out the right information from the challenge website. After spending so much time outside of my day job on this project, I am left with a message that I cannot submit. When the deadline is 27th May, I expect 27th May at 12 midnight (my local time). If not it should have been mentioned clearly that 27th 12 PM (EST) or something so that we know that after 7 pm PST, I cannot submit. The whole thing is a wastage of my time. I don't expect to win but I sure wanted to submit to see where I stand. Hi @ARTD ,
Unfortunately, we do not see two submissions under your team name, TeamARTD. However, we do have one submission under your username, ARTD, made on May 26th (ID: 9752403)!
As the submitter for your team, you can always check the status of your submissions in the [Submission Dashboard here](https://www.synapse.org/Synapse:syn65660836/wiki/632249). Please note that uploading files into your Project/Files tab is not considered a submission; you will also need to submit those files to the challenge queues, just as you did on the 26th.
Hope this helps! I have submitted two files (TeamARTD). Could you confirm that you have got those ? From Project/Files Tab, I uploaded the two files, but don't know if the eval team has gotten those two. Thanks. Hi @khanirfan ,
Confirming that we have your 3 submissions under your team name in the system. If you are the submitter for your team, you can also check the status of your submissions in the [Submission Dashboard here](https://www.synapse.org/Synapse:syn65660836/wiki/632249). Hello,
Some fingerprint vectors in the Test file have different lengths when compared to the data from the training dataset. For the ECFP4 fingerprint, for example, 18972 molecules have a different length!

 @jeriscience @vchung I submitted 3 files under 2025PakWin, can you kindly confirm that you have received the submission? @jeriscience
Thank you very much for your response. That's really helpful. @khanirfan I dont see your submission, @vchung can you please check? I submitted files without ranking but its really easy to rank can the organisers do it to score participants? (1) Should we fill in the labels in sel200 and sel500 columns for all the 339k molecules?
Labels not filled will be considered zeros
(2) Should we order the 339k molecules by descending order of score (4th column)?
yes
(3) Should we only submit the prediction csv file? Is there anything else e.g. the code/model that we need to submit?
Please submit a brief-writeup, no code is needed at this point
Thanks for the interest Hi! I have a couple of questions regarding the submission:
(1) Should we fill in the labels in sel_200 and sel_500 columns for all the 339k molecules?
(2) Should we order the 339k molecules by descending order of score (4th column)?
(3) Should we only submit the prediction csv file? Is there anything else e.g. the code/model that we need to submit?
Thanks,
Shakun the last column is generated by the model trained on training data. but what does ranking mean? The last column "score" is a probability (i.e. score) that the compound binds the protein target. A probability needs to be assigned to each of the 339,257 compounds. What does ranking mean? the submission file has 4 columns with 340000 rows of random ids, last column is y_score but there is no explanation for basis of ranking. does it mean reordering rows based on descending order or y_score in 4th column? if so, it can be easily done by the evaluation team
Probabilities assigned to all compounds may be used if multiple teams are tied based on primary evaluation steps. You are therefore asked to rank compounds. Thanks. Hi Is it ok if we do not rank the compounds for submission in task 1? You decide how to assign the labels to the molecules, you could give the 1 label to the top 200/500 molecules, or to the top 200/500 molecules after removing similar molecules using clustering, there are no specific guidelines on how you should assign them.
Yes, the column Sel_200 should have only 200 rows with the value 1, and similar for 500. i understand the score column but i do not understand on what basis are sel_200 and sel_500 are assigned 1 or 0. and in sel_200 only 200 rows out of ~339K compounds are supposed to have 1? The submission file should contain ids for ~339K compounds, the score given by your machine learning model in the Score column, and a binary label (0,1) in the Sel_200 and Sel_500 columns indicating if a molecule is selected (1) or not (0). See here for an example https://www.synapse.org/Synapse:syn65660836/wiki/632256 ok for clarity, the submission file should contain ids for ~339K compounds and the y-score, the top 200 y-score should be in cluster200 and top500 y-score should be in cluster500. Do I understand correctly? The file should contain scores and labels (selected in the 200, selected in the 500) for all ~339K compounds. This is necessary to calculate AUROC and ROC-PR, as well as for future analysis and comparison of the challenge results. Hi @LucaChiesa Thanks for explaining. Is it acceptable if the total rows of submission file are just 500? Hello, this is Luca, one of the notebook authors.
The example notebook available in the repository should not be considered as an accurate guideline for submission, but rather as a collection of useful code snippets that could be used for the challenge. The code was released to give a baseline for people which might not be familiar with one or more of the described steps (parquet file reading, model training, cross-validation, clustering based on chemical similarity).
For the submission you just need to select 200 and 500 molecules which are likely to be binders of the target protein. One could just select the top 200 or 500 molecules based on their confidence score and this would represents a perfectly valid submission.
In the notebook we suggest to run clustering on the top N predicted molecules since the main evaluation criteria for the challenge is the number of identified diverse molecules. By clustering the molecules and selecting the best ranking one from each cluster it is possible to account for chemical diversity in the selection, but this is not guaranteed to improve the final results. In the example clustering was run on the top ranking 5000 molecules, this number was selected randomly to represent a sufficiently large sample size. For the challenge one might want to use a different number of top ranking molecules, or only molecules whose confidence score is above a certain threshold, or skip clustering. am i supposed to generate clusters for x_test[top_5000] as in example notebook or for all the rows because we can find out the Number or hits found in the selected 200/500 molecules as in the example notebook This is your job, or maube II dont understand your question?
Drop files to upload
generating submission file from trained model page is loading…