Hacker News: kostaj

New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"

kostaj — Thu, 28 May 2026 16:14:02 +0000

Some models struggle combining JSON schema and web search capabilities.

New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"

kostaj — Thu, 28 May 2026 15:46:56 +0000

Good point. Will publish in the next version also the results with a prompt that allows the models to "think out loud" before providing the final verdict.

New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"

kostaj — Thu, 28 May 2026 15:24:04 +0000

Awesome. We do plan to human-label the 1,000 claims and then compare Lenz' performance vs the 5 models. We've done some limited internal research with 150 claims, but more are needed for statistical significance.

New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"

kostaj — Thu, 28 May 2026 15:21:19 +0000

Agree that some of the claims are forward-looking. The messiness of the real-world and real-user fact checks. No ground-truth verdicts are provided or used in the study though. It only measures the level of agreement between the selected models, not which one is right on which claim. I.e. none of the claims is actually labelled.

New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"

kostaj — Thu, 28 May 2026 15:18:35 +0000

Good idea about publishing intra-model variance data! Will include in the next version. Even if we put aside the two middle buckets (Mostly True and Misleading), that are somewhat subject to interpretation and hedging: On 21% of the claims still at least two models provide polar-opposite verdicts (one model saying True, and another saying False)

New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"

kostaj — Thu, 28 May 2026 15:13:24 +0000

Good point. Processing the substance of the answer might be too labor-consuming (1,000 claims x 5 models), but "thinking out loud" might improve the quality of the answers indeed. And we can still force/ask them to respond with a clear verdict at the end of their reasoning, as per the chosen rubric.

New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"

kostaj — Thu, 28 May 2026 15:05:07 +0000

This is in line with my observations and tests as well. Also supported by the distribution of the verdicts across the 4-buckets -- Gemini uses the middle buckets (Mostly True and Misleading) much less often - 6% combined for Gemini w/o search. And Opus uses them the most - 45% combined. Looks like Gemini is calibrated to be confident and Opus to be careful.

New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"

kostaj — Thu, 28 May 2026 15:00:57 +0000

Indeed. For algorithms and coding, my personal routine nowadays is to review every detailed plan with Opus 4.7 and GPT-5.5. They tend to find very different type of gaps.

New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"

kostaj — Thu, 28 May 2026 14:57:30 +0000

Agree that True and Mostly True might be very close and could be a calibration difference. Misleading and False, as well. A better headline number might be the 34% claims with substantial or polar-opposite verdicts.

New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"

kostaj — Thu, 28 May 2026 14:54:22 +0000

Agree. Human experts also struggle agreeing on this type of claims. The inter-annotator agreement on the verdicts on the AVeriTeC corpus across 50 organizations is κ=0.619 - substantial but well short of perfect.

New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"

kostaj — Thu, 28 May 2026 14:50:34 +0000

Agree with @pjdesno, that the 34% substantive or polar disagreement might be a better headline number. Or even the 21% polar disagreement (at least one model True, and at least one model False), which is still high for many real-world applications.

New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"

kostaj — Thu, 28 May 2026 14:47:52 +0000

That's a valid point. During the preliminary research, we did try also more explicit prompts (with explanation for each of the 4 buckets), as well as a five-bucket rubric (with Abstain option). Will show in a follow-up paper how the concise vs explicit prompt impacts the distribution of the verdicts and the level of disagreement. One issue to note with the longer prompts is that they open to much room for discussion around the exact prompt used. Probably we should preregister the prompt before running any further tests.

New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"

kostaj — Thu, 28 May 2026 14:30:25 +0000

Quick note on the second effect - how LLMs reduce that to a four-category judgment: On 21% of the claims at least two models provide polar-opposite verdicts (at least one model False, and at least one model True). This might be a better measurement of the strict disagreement than the 67% disagreement on the four-bucket rubric.

New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"

kostaj — Thu, 28 May 2026 14:19:10 +0000

Agree about comparing models with and without search capabilities. Even the two models with search capabilities (Sonar Pro and Gemini) agree only on 58% of the claims.

New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"

kostaj — Thu, 28 May 2026 14:09:57 +0000

Will add a human-labelled expected response and measure against it in a follow up research. This one only captures the disagreement between the models, but not which model is write/wrong.

New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"

kostaj — Thu, 28 May 2026 14:08:11 +0000

The reason for the "No explanations, no qualifiers" in the prompt was to force the models to put the claim in one of the four buckets and answer with the bucket name only. It's a pure quantitive analysis (first in a series) and it does indeed lack the qualitative aspect.

New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"

kostaj — Thu, 28 May 2026 14:03:53 +0000

@john_strinlai @gcr, depends on the application. In many cases an "I don't know" answer is indeed better than a forced answer. But in many production systems, LLMs generate content/response anyway.

Although inheriting the messiness of the real-world, the majority of these claims are objective enough to be classifiable by human experts with access to research. Plan to human-label the 1,000 claims and publish a follow-up research. Will consider adding an "I don't know" bucket too, as well as a clear instructions about the meaning of each of the 4 buckets.

New comment by kostaj in "Disagreement Among Frontier LLMs on Real-World Fact-Checks"

kostaj — Thu, 28 May 2026 13:53:48 +0000

Search was enabled for 2 of the 5 models -- Gemini and Sonar Pro. The disagreement between them is still high - different verdict on 42% of the claims. Fully agree, that some of those claims are hard to classify for a human as well -- the real-world messiness...

New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"

kostaj — Thu, 28 May 2026 13:46:20 +0000

Two of the five models used (Gemini+Search and Sonar Pro) have retrieval capabilities and used search when classifying the claims. The disagreement between them is still quite significant - 42%.

New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"

kostaj — Thu, 28 May 2026 13:43:41 +0000

Indeed. I prompted each model ones, plus one retry on errors. Very good point to measure the inter-model disagreement! Will add in the next version.

Section "4.2 Agreement w/ peer majority" shows the level of agreement of each model with the majority.

Yes, planning of human-labelling the same corpus of 1,000 claims and publishing a second study measuring the models performance against the human-labels on corpus that the models have not seen during training.