Hacker News: scottmu

New comment by scottmu in "Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam)"

scottmu — Fri, 27 Mar 2026 23:00:08 +0000

If 2 (or more) tokens are synonymous with each other with high probabilities (49.9% each for a total of 99.8%), that's still low entropy. Not as low as a singular high-probability token, but low enough for us to consider this a low-entropy token distribution.

You can't look at a single token distribution, though. There are many legitimate high-confidence, high-accuracy cases in which many tokens could come next. For example, the first token of a paragraph. You need to look at pools of entropies over segments of the output or the whole output sequence.

Although there's a correlation between uncertainty and hallucinations or inaccuracies, there's no guarantee. This is a challenging area that we're monitoring the latest literature for and contributing where we can.

New comment by scottmu in "Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam)"

scottmu — Fri, 27 Mar 2026 22:20:42 +0000

I like the direction you're going with this strategy. There are many approaches, nuances, edge cases, and clever tricks to each of these steps, even without taking into account token probability distributions. Very powerful to get it right.

New comment by scottmu in "Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam)"

scottmu — Fri, 27 Mar 2026 20:08:06 +0000

Yes, there's a wide variety of use cases that require different ratios of accuracy/speed. If you require 3 responses to be accurate, you have to multiply all 3 response accuracy probabilities, and as you've shown, this can reduce overall accuracy quite a bit. Of course, this does make the assumption that those 3 responses are independent of one another.

New comment by scottmu in "Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam)"

scottmu — Fri, 27 Mar 2026 20:01:28 +0000

Great question. What I can say is we experimented a _ton_. If you take a basic approach and simply ask the same prompt of a bunch of LLMs and then ask another LLM to combine the results, you'll get a pretty poor answer. At best, you'll get a response that is the average of the ensemble, which by definition is going to be worse than the best model of the ensemble. Of course, you're going to want a mechanism to choose the ensemble effectively. At worst, you'll regurgitate the worst model of the ensemble. And you'll have the added expense and potential latency, too. Not a good solution at all.

We didn't experiment with different ensemble mechanisms rigorously enough for a research paper. We will, though.

Majority voting was actually how we started, and we came up with great mechanisms for stopping early, saving token costs and time, along with other interesting things we could do with that simple mechanism. The issue we had was that the orchestration could already choose a model beforehand almost as good (according to simpler benchmarks than HLE we ran at the time) as majority voting could pick after the responses were complete. And we tried many voting mechanisms, such as all models in the ensemble voting on all others.

An ablation study would be great to do now, with many other ideas we've played with. We have better benchmarks than we did just a few months ago, and it would be great to understand the tradeoffs of different approaches so that there could be alternative options for different use cases.

New comment by scottmu in "Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam)"

scottmu — Fri, 27 Mar 2026 19:43:20 +0000

You're right! I could've phrased my comment better. Ken actually wanted to edit his post, but it was too late. So he asked me to write a response explaining what he meant. Of course, he could've commented too. I was just trying to be helpful to him and others wanting an explanation.

New comment by scottmu in "Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam)"

scottmu — Fri, 27 Mar 2026 19:41:08 +0000

I wouldn't say it's easy to detect hallucinations. Understanding output token probability distributions is only part of a solution, and we still aren't perfect. Just better than individual models.

Hallucinations may seem rarer for a few reasons. First, models are more accurate with certain prompts. Second, models are more convincing when they do hallucinate. They may get an overall idea, but hallucinate the details. Hallucinations are still a major problem and are fundamental to the way modern LLMs work.

New comment by scottmu in "Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam)"

scottmu — Fri, 27 Mar 2026 06:36:22 +0000

I've felt your pain. Models aren't always trained well enough on edge cases and configs.

Would love to hear how Sup works out for you.

New comment by scottmu in "Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam)"

scottmu — Thu, 26 Mar 2026 21:04:24 +0000

I want to clarify what Ken meant by "entropy in the output token probability distributions." Whenever an LLM outputs a token, it's choosing that token out of all possible tokens. Every possible output token has a probability assigned by the model (typically a logarithm of the probability). This is a probability distribution (the output token probabilities sum to 1). Entropy is a measure of uncertainty and can quantify if a token probability distribution is certain (1 token has a 99.9% probability, and the rest share the leftover 0.1% probability) or uncertain (every token has roughly the same probability, so it's pretty much random which token is selected). Low entropy is the former case, and high entropy is the latter.

There is interesting research in the correlation of entropy with accuracy and hallucinations:

- https://www.nature.com/articles/s41586-024-07421-0

- https://arxiv.org/abs/2405.19648

- https://arxiv.org/abs/2509.04492 (when only a small number of probabilities are available, which is something we frequently deal with)

- https://arxiv.org/abs/2603.18940

- tons more, happy to chat about if interested