Hello, Thank you for releasing the training data! I have a question about the structure of the training dataset. Could you please confirm the following? 1. Are patches in each .tar shard drawn from disjoint sets of slides/cases, or do patches from the same slide appear across multiple shards? 2. If the latter, is a manifest available that maps each patch key (e.g. train_<16hex>) to its source slide or case ID? This would allow running slide-grouped cross validation on the training data and produce a meaningful internal validation estimate before the the submission window opens. 3. If no per-slide mapping is available, do you have a recommended way to construct a slide-respecting train/val split from the released training data? Thanks! Best regards, Lingyi

Created by Lingyi Zhao lzhao
Hi @jaydenyou, No problem! Thanks for the update. Best, Lingyi
Hi @lzhao, It will take some time for updates, and it should be in the same place. Apologies for the delay. Cheers, Jayden
Hi @jaydenyou, Thanks for your detailed answers! They are very helpful. Will the mapping CSV file be released in the same place where the training dataset is? Best regards, Lingyi
Hi @lzhao Thanks for the questions. During our patching process, we found that the slide generated various numbers of patches for different labels. Sometimes, no ROI patches for that label got extracted. For that reason, we did not limit the extracted patches from the same slide to appear in different tar shards, so we treated each patch as a "case". For question 2, we have uploaded the mapping CSV file that contains the patient (anonymized id) and slide (anonymized id) mapping. It should be released soon. Cheers, Jayden

Task 5 training dataset patches structure page is loading…