2 comments

  • anatolecallies 3 hours ago ago

    We run an LLM system that reads messy medical records and determines clinical trial eligibility.

    We tried eval platforms, LLM-as-judge, and automated prompt optimizers. None helped with what actually mattered: hidden domain policies that weren’t explicitly written anywhere.

    We ended up building our own annotation UI, prompt integration workflow (via Claude Code SDK), and HTML diff-based experiment reports.

    The biggest lesson: off-the-shelves Eval/Annotations/Prompt Optimization tools are sub-part because they can only be generic.

    Curious whether others building AI products have reached the same conclusion.

    • consumer451 3 hours ago ago

      I will be dealing with something along these lines next month. Thanks for sharing. I already had Opus build me an OLTP compatible tracing system, which took all of 4 hours, with annotations and experiments.