-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Evaluation, Reproducibility, Benchmarks Meeting 34
Nicholas Heller edited this page Apr 23, 2025
·
1 revision
Date: 23rd April, 2025
The MICCAI deadline is coming up, so we might have fewer today.
- Carole
- Annika
- Anne
- Olivier
- Nick
- Michela
- Sat together to work through changes on the live website
- We would like to implement a suggestion box if possible. Carole will chase Michael on this
- Should we list the names of all group members on the website? They are in meeting notes and other WGs don't seem to be listing members, so we guess not
- Two of the links seem to be swapped on the Metrics card. Carole will work on getting access to change this
- Regarding missing data -- should this become NaN, or worst possible value?
- Could generate two columns, one the worst possible for aggregation, another NaNs
- For MSD, the worst possible would be based on the shape of the object and the size of the image
- What is best practice here when metrics fail?
- If the reference has missing data, usually exclude this case in aggregation
- If the prediction has missing data, usually use the worst possible value
- Michela is finding lots of edge cases in running these implementations for the full decathlon implementation
- This would make for a good tutorial
- We now have a dashboard seeing how often each metric is recommended. Accuracy is a the very bottom of the list!
- Parametric methods are struggling somewhat on logit values
- Appears to be normal, but there's not much literature on this
- Perhaps this is an artifact of the loss... empirically it seems somewhat reliable
- Would be nice to cite something explaining this phenomenon. Nobody in the group knew of any, though
- Lots of progress has been made on this project. Mostly just simulations and writing remain
- Is there anything we could do to measure/quantify 'implementation noise' in metrics?
- Distance metrics, in particular, have ambiguity in their definitions which can make a difference
- It might be worth exploring later on...
- For now, it's not such a big deal if the above project returns slightly different metric values than the official results of the decathlon
- Could look to lighthouse challenge criteria to help add new criteria
- Data section in general is probably the weakest -- data description should be expanded