Skip to content

Evaluation, Reproducibility, Benchmarks Meeting 34

Nicholas Heller edited this page Apr 23, 2025 · 1 revision

Minutes of Meeting 34

Date: 23rd April, 2025

Present

The MICCAI deadline is coming up, so we might have fewer today.

  • Carole
  • Annika
  • Anne
  • Olivier
  • Nick
  • Michela

Website Feedback

  • Sat together to work through changes on the live website
  • We would like to implement a suggestion box if possible. Carole will chase Michael on this
  • Should we list the names of all group members on the website? They are in meeting notes and other WGs don't seem to be listing members, so we guess not
  • Two of the links seem to be swapped on the Metrics card. Carole will work on getting access to change this

Quick Dicussion of Metrics Reloaded

  • Regarding missing data -- should this become NaN, or worst possible value?
  • Could generate two columns, one the worst possible for aggregation, another NaNs
  • For MSD, the worst possible would be based on the shape of the object and the size of the image
  • What is best practice here when metrics fail?
    • If the reference has missing data, usually exclude this case in aggregation
    • If the prediction has missing data, usually use the worst possible value
  • Michela is finding lots of edge cases in running these implementations for the full decathlon implementation
    • This would make for a good tutorial
  • We now have a dashboard seeing how often each metric is recommended. Accuracy is a the very bottom of the list!

Quick Report on Confidence Intervals Project

  • Parametric methods are struggling somewhat on logit values
    • Appears to be normal, but there's not much literature on this
    • Perhaps this is an artifact of the loss... empirically it seems somewhat reliable
    • Would be nice to cite something explaining this phenomenon. Nobody in the group knew of any, though
  • Lots of progress has been made on this project. Mostly just simulations and writing remain

Numerical Instability in Metrics/Implementation Differences

  • Is there anything we could do to measure/quantify 'implementation noise' in metrics?
  • Distance metrics, in particular, have ambiguity in their definitions which can make a difference
  • It might be worth exploring later on...
  • For now, it's not such a big deal if the above project returns slightly different metric values than the official results of the decathlon

Next Generation of BIAS Initiative

  • Could look to lighthouse challenge criteria to help add new criteria
  • Data section in general is probably the weakest -- data description should be expanded
Clone this wiki locally