Requesting a clarification on #clusters found in our submission

We were reviewing our performances in the phases 1 and 2, and evaluating the no. of clusters found in top 200 and top 500. Our best submission in Phase 1 - submission#3, ROW_ID 9752483 - contained 11 hits in top 500, which we found could be partitioned into 5 clusters following the procedure suggested in the guidelines. The same submission contains five hits in the top 200, which could be partitioned into 3 clusters. However the published leaderboard table shows that this submission has 3 clusters in the top 500 and 2 clusters in the top 200 only (but the # hits is matching in both cases). Is there a discrepancy here? Could the organizers please share the code used for the evaluation? Also, we would like to submit the write-up for our submissions, please. Thank you

Created by Ashok Palaniappan apalania
Thanks @LucaChiesa , the writeup is now submitted. Ashok
Hello @apalania, the submission queue has been reopened.
Hi @LucaChiesa , thanks so much for sharing the script and the clarification. Please let us know if we could submit the wiki write-up. The same is available [here](https://www.synapse.org/Synapse:syn66721024/wiki/635865) Thanks !
Thank you @apalania for the detailed description. The script you used is mostly correct (below an extract from the original script for reproducibility using the data uploaded to Synapse), the issue stems from the distance threshold parameter. I just realized in the evaluation page the description is misleading, 0.32 represents the similarity threshold used for clustering, which translates to a 0.68 distance threshold. With the current settings you should get 56 clusters from the 138 active molecules. ``` import pandas as pd from scipy.spatial.distance import pdist, squareform from sklearn.cluster import AgglomerativeClustering data = pd.read_parquet("Step3_TestData_Target2035.parquet", columns=['RandomID', 'Label', 'ECFP4']) def process_data(X, column_name): return np.stack(X[column_name].apply(lambda x: np.fromstring(x, sep=',', dtype=np.float32))) fps = process_data(data.query("Label == 1"), 'ECFP4') dists_binary = pdist(fps != 0, metric = 'jaccard') th = 0.68 ac = AgglomerativeClustering(n_clusters=None, distance_threshold=th, metric="precomputed", linkage='complete') labels = ac.fit_predict(squareform(dists_binary['ECFP4'])) ```
We have also prepared the writeup for submission, if you will open up the submission queue. @vchung @LucaChiesa Thanks !
@LucaChiesa I can now confirm that we have analyzed with full molecules. This has only confirmed the intuition that **full molecules are not as similar as scaffolds** (despite the conserved R-groups you have mentioned). In fact, on the 138 active molecules, we obtained 137 clusters! Just one cluster containing two active molecules - ID_30905 & ID_33124. All the rest are singleton clusters. With respect to the hits that we found in phase-1 as noted in my original post, all the 11 appear to be clusters by themselves, if working with full molecules. Obviously we are doing something wrong. I would like to request and appreciate your clarification to this issue. Also @mschapira , @jeriscience , @okko Unless you share some evaluation code and some statistics about the clustering, it is hard for us to evaluate what we have accomplished. I would like to apologize for my oversight if this information is already available in the Challenge pages. For reference, I'm posting the clustering analysis code we used, with respect to full molecules: ``` import numpy as np import pandas as pd from rdkit import Chem from rdkit.Chem import Draw, AllChem, DataStructs from collections import Counter, defaultdict import os from sklearn import cluster df = pd.read_csv(r"Hits_wSMILES.csv") #File to be analyzed with Molecule ID and its SMILES smiles_list = df['SMILES'].to_list() mols =[] for i, smiles in enumerate(smiles_list): mol = Chem.MolFromSmiles(smiles) mols.append(mol) n=len(mols); fps = [AllChem.GetMorganFingerprintAsBitVect(s, 2, nBits=2048) for s in mols] sim_matrix = np.zeros((n, n)) for i in range(n): for j in range(i,n): sim = DataStructs.TanimotoSimilarity(fps[i], fps[j]) sim_matrix[i, j] = sim sim_matrix[j, i] = sim distmat = [1-sim_matrix[i,j] for i in range(n) for j in range(n)] distmat = np.array(distmat).reshape(n,n) css = cluster.AgglomerativeClustering(distance_threshold=0.32, linkage = 'complete',n_clusters=None).fit(distmat) print(css.labels_) print(css.n_clusters_) print(Counter(css.labels_.tolist())) clusters = Counter(css.labels_.tolist()) clustsizes = [val for key,val in clusters.items()] print(Counter(clustsizes)) ```
Hi @apalania , The submission queue for writeups is currently closed, but if @LucaChiesa agrees, I can open it briefly so that you may submit your writeup.
Yes, thanks so much.
For the writeup @vchung should be able to help you. If the submission portal is now closed you can send me the writeup via email.
Hi, yes we used ECFP4 for the similarity scoring. It was not clear to us that clustering was to be performed on the full molecule - since the challenge was to obtain diverse scaffolds, the connection here is missing. In any case, I will let you know if further doubts remain. We would also like to know if the writeup could be submitted in our team's synapse project folder. Thanks
Thank you @apalania for the detail response. Clustering for evaluation was performed on the full molecule, to account for the role of conserved R-groups, instead that on the Murcko scaffold only. The rest of the clustering procedure was the same as you reported, assuming you used Morgan fingerprints with radius=2. Let me know if this clarified the issue.
Hello, this is the procedure that we followed: (1) we found the Murcko Scaffold, (2) then Morgan Fingerprints as bit-vector (nBits = 2048) for these scaffolds, (3) then the distance matrix based on the Tanimoto Similarity, (4) finally Agglomerative Clustering with complete linkage at distance threshold = 0.32 We computed the clusters based on the assignment for the full dataset of 138 actives. For the full dataset, we obtained 89 singleton clusters, and a total of 100 clusters. How does this sound? Thank you
Hello, sorry for the delay in the response. The cluster assignment was calculated on the full dataset of 138 active molecules before the beginning of the challenge based on Tanimoto similarity on binary ECFP4. Could you please share the exact procedure you used to perform clustering on your hit compounds ?
Some more details with respect to the above submission: Hits in Top 500: {'ID_123456', _'ID_304803'_, 'ID_164150', 'ID_3499', 'ID_115187', 'ID_267452', 'ID_1480', 'ID_122643', 'ID_259106', 'ID_56915', 'ID_22930'} Their Scaffolds: {'O=C(Nc1c(-c2cccs2)nc2ccccn12)c1ccccc1': 5, 'O=C(Nc1c(-c2cccs2)nc2ncccn12)c1ccccc1': 2, 'O=C(Nc1c(-c2ccccc2)nc2ccccn12)c1ccccc1': 2, 'O=C(Nc1c(-c2ccccc2)nc2ncccn12)c1ccccc1': 1, 'O=C(Nc1c(-c2ccco2)nc2ccccn12)c1ccccc1': 1} Their cluster assignments based on agglomerative clustering (dist =0.32): [1, 0, 2, 0, 3, 1, 0, 4, 0, 2, 0] #respectively Thanks in advance for your clarification.

Your web browser must have JavaScript enabled in order for this application to display correctly.
If you are an automated web crawler from a search engine, follow this AJAX application crawl link

Drop files to upload

Requesting a clarification on #clusters found in our submission page is loading…