SolidStart - Hacker News

QueensGambit 8 days ago ago

Hi everyone, I built @promptrepo/score because we’re no longer using generative AI just for suggestions — we’re making decisions with it. But generative AI is probabilistic, not the deterministic systems we’re used to. So when AI makes decisions, we need to know how confident it is, and how much we can trust each field in the output.

This tool looks simple — it just converts OpenAI’s logprobs into field-level confidence scores — but that changes how you use AI in production. It lets you mark low-confidence fields, send them for human review, or retry with better grounding. In high-volume systems, you can also track low-confidence patterns to improve prompts or fine-tune with better data. Its a lightweight npm and has no dependencies, so its easy to integrate it into your AI workflows. Would love to hear your thoughts!

siva7 8 days ago ago

Wait, this doesn't work with a current-gen model like 4o? Is this a technical limitation?

[-]

QueensGambit 8 days ago ago

Yes, it does. Updated the doc. Thanks for pointing it out! As long as you have the logprobs, it should work:

const jsonOutput = JSON.parse(response.choices[0].message.content);

const result = calculateConfidenceScores(jsonOutput, response.choices[0].logprobs.content);

rboobesh 8 days ago ago

Can this work with nested JSON objects or arrays?

[-]

QueensGambit 8 days ago ago

Yes, it does. It recursively walks through the JSON structure, calculating a confidence score for each individual field — whether it’s a top-level key, nested inside objects, or part of an array of objects. Each leaf field gets a {value, score} pair, and parent objects get an aggregated score based on the confidence of their children.

manidoraisamy 8 days ago ago

Does it support Claude?

[-]

QueensGambit 8 days ago ago

Not yet. @promptrepo/score relies on token-level logprobs, which OpenAI exposes for their models. But, Anthropic’s Claude currently doesn’t expose token-level confidence (like logprobs) in their API. So, we can’t support Claude until they do. We’d love to add support if/when Claude exposes this capability.

Show HN: Calculate confidence score for OpenAI JSON output