Evaluation, Reproducibility, Benchmarks Meeting 34

Minutes of Meeting 34

Date: 23rd April, 2025

The MICCAI deadline is coming up, so we might have fewer today.

Sat together to work through changes on the live website
We would like to implement a suggestion box if possible. Carole will chase Michael on this
Should we list the names of all group members on the website? They are in meeting notes and other WGs don't seem to be listing members, so we guess not
Two of the links seem to be swapped on the Metrics card. Carole will work on getting access to change this

Regarding missing data -- should this become NaN, or worst possible value?
Could generate two columns, one the worst possible for aggregation, another NaNs
For MSD, the worst possible would be based on the shape of the object and the size of the image
What is best practice here when metrics fail?
- If the reference has missing data, usually exclude this case in aggregation
- If the prediction has missing data, usually use the worst possible value
Michela is finding lots of edge cases in running these implementations for the full decathlon implementation
- This would make for a good tutorial
We now have a dashboard seeing how often each metric is recommended. Accuracy is a the very bottom of the list!

Parametric methods are struggling somewhat on logit values
- Appears to be normal, but there's not much literature on this
- Perhaps this is an artifact of the loss... empirically it seems somewhat reliable
- Would be nice to cite something explaining this phenomenon. Nobody in the group knew of any, though
Lots of progress has been made on this project. Mostly just simulations and writing remain

Is there anything we could do to measure/quantify 'implementation noise' in metrics?
Distance metrics, in particular, have ambiguity in their definitions which can make a difference
It might be worth exploring later on...
For now, it's not such a big deal if the above project returns slightly different metric values than the official results of the decathlon

Could look to lighthouse challenge criteria to help add new criteria
Data section in general is probably the weakest -- data description should be expanded

Copyright (c) MONAI Consortium