Dear challenge organizers,
We have encountered some issues regarding the alignment of segmentations with the corresponding images in the challenge dataset. Specifically, in several cases, the segmentations appear to be significantly misaligned with the anatomy in the LF images, likely due to registration errors between the HF and LF scans.
For example, in the case of **LISA0001** from the training dataset, the segmentation of the basal ganglia appears to partially cover the ventricles (see yellow arrows in the attached image):
${imageLink?synapseId=syn68685468&align=None&scale=100&responsive=true&altText=}

We assume that the segmentations of the ventricles were also propagated from the HF scans, as the pattern of misalignment (e.g., translation and rotation) is consistent across both ventricles and basal ganglia.
We would greatly appreciate your clarification on the following points:
1. Was any quantitative evaluation of registration accuracy performed? For instance, did you measure Target Registration Error using identifiable anatomical landmarks across HF and LF images? Do you have perhaps any other internal measurements of the registration or annotation quality for the data (so we could filter out good/bad annotations, or perhaps you could weight the results based on it)?
2. Do you expect the level of registration error to be consistent across the training/validation and test subsets?
Given that evaluation is performed against HF segmentations, we are wondering if this could introduce a bias toward learning misaligned predictions (i.e., models learning to replicate the registration error)?
In our own experiments, we observed that pre-training on HF datasets with manual annotations helps the model produce predictions that are better aligned with image contrast and with LF segmentations of LISA. However, fine-tuning on HF labels of LISA appears to cause the model to overfit the registration error, which, while aligned with the HF segmentations used for evaluation, raises questions about whether this results in a bias that prioritizes registration artifacts over anatomical accuracy.
Also, we have noticed that the segmentations for the left and right lentiform nucleus seem quite abruptly cut (you can see the difference between the HF/LF basal ganglia segmentation and the predictions of our model (pre-trained on an external dataset), highlighed by the dotted yellow line). Is there any particular reason for it? What anatomical guidelines were used for deciding on the borders of the structures?
Thank you very much for your time and for organizing this amazing challenge. We look forward to your thoughts on these matters.
Created by Vladyslav Zalevskyi vzalevskyi24 Hello @vzalevskyi24 and @rgonl,
Thank you for your insightful question.
1. There was no further quantitative evaluation of registration accuracy performed beyond FSL FLIRT's algorithm. In order to correct for any misregistration, the LF segmentation was derived from overlaying the HF segmentation on LF images and editing the result to match what can be seen from the low field. This helps to improve matching the segmentation to the low field image.
2. The same processing was applied to the training, validation, and test images.
To your question on bias -
The labels transferred from high-field to low-field images may differ slightly, typically by a few voxels due to registration misalignment. However, these small discrepancies do not introduce significant bias during training. We performed linear registration using the best available methods. However, because we do not have access to the specific registration procedures used by the Hyperfine imaging system, perfect linear alignment between the HF and LF labels is not achievable.
For evaluation and final model ranking, the potential misalignment does not affect the relative performance of models. To ensure fairness and robustness, we used an unseen validation cohort and will also evaluate models on an entirely separate unseen test set. Based on our internal experiments, when one model outperforms another on the validation cohort, it consistently shows similar performance trends on the test cohort. This gives us confidence that minor registration differences do not compromise evaluation reliability.
Given the inherent black-box nature of deep learning models, it's difficult to fully interpret how each architecture extracts and prioritizes features during training. Performance can vary depending on the model architecture and its capacity to handle feature variability. For this reason, we allowed participants to submit an unlimited number of entries, enabling iterative improvement and adaptation to domain-specific challenges.
To ensure a fair and balanced evaluation, we did not rely solely on a single metric such as Dice Similarity Coefficient (DSC). Instead, we considered the average performance across multiple metrics, including 1-DSC, HD95, Hausdorff Distance (HD), Relative Volume Error (RVE), and Average Symmetric Surface Distance (ASSD). This comprehensive approach reduces evaluation bias and provides a more robust assessment of model performance for the final ranking.
To your question on the cut in the LFN segmentation -
We are aware that manual segmentation can have some subjectivity and some decisions need to be made for areas that are less clear. In our process, we aimed to strike a balance between anatomical knowledge and the actual image contrast available in the scan. Our protocol prioritizes visible contrast boundaries over assumed anatomical extent to ensure consistency and avoid over-segmentation in low-contrast areas. As a result, you may observe abrupt cuts in the LFN, where the contrast is insufficient to confidently trace the full anatomical boundary.
Please let us know if you have any further questions.
Best,
The LISA 2025 Challenge Organizers Hi,
We have noticed similar things, is there any update as to how to approach this?
Again, thanks for your time and organizing this challenge.
Drop files to upload
Accuracy of the segmentation alignment page is loading…