<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: kostaj</title><link>https://news.ycombinator.com/user?id=kostaj</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Mon, 15 Jun 2026 10:28:22 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=kostaj" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"]]></title><description><![CDATA[
<p>Some models struggle combining JSON schema and web search capabilities.</p>
]]></description><pubDate>Thu, 28 May 2026 16:14:02 +0000</pubDate><link>https://news.ycombinator.com/item?id=48311040</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48311040</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48311040</guid></item><item><title><![CDATA[New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"]]></title><description><![CDATA[
<p>Good point. Will publish in the next version also the results with a prompt that allows the models to "think out loud" before providing the final verdict.</p>
]]></description><pubDate>Thu, 28 May 2026 15:46:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=48310667</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48310667</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48310667</guid></item><item><title><![CDATA[New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"]]></title><description><![CDATA[
<p>Awesome. We do plan to human-label the 1,000 claims and then compare Lenz' performance vs the 5 models. We've done some limited internal research with 150 claims, but more are needed for statistical significance.</p>
]]></description><pubDate>Thu, 28 May 2026 15:24:04 +0000</pubDate><link>https://news.ycombinator.com/item?id=48310300</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48310300</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48310300</guid></item><item><title><![CDATA[New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"]]></title><description><![CDATA[
<p>Agree that some of the claims are forward-looking. The messiness of the real-world and real-user fact checks. No ground-truth verdicts are provided or used in the study though. It only measures the level of agreement between the selected models, not which one is right on which claim. I.e. none of the claims is actually labelled.</p>
]]></description><pubDate>Thu, 28 May 2026 15:21:19 +0000</pubDate><link>https://news.ycombinator.com/item?id=48310255</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48310255</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48310255</guid></item><item><title><![CDATA[New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"]]></title><description><![CDATA[
<p>Good idea about publishing intra-model variance data! Will include in the next version.
Even if we put aside the two middle buckets (Mostly True and Misleading), that are somewhat subject to interpretation and hedging: On 21% of the claims still at least two models provide polar-opposite verdicts (one model saying True, and another saying False)</p>
]]></description><pubDate>Thu, 28 May 2026 15:18:35 +0000</pubDate><link>https://news.ycombinator.com/item?id=48310209</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48310209</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48310209</guid></item><item><title><![CDATA[New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"]]></title><description><![CDATA[
<p>Good point. Processing the substance of the answer might be too labor-consuming (1,000 claims x 5 models), but "thinking out loud" might improve the quality of the answers indeed. And we can still force/ask them to respond with a clear verdict at the end of their reasoning, as per the chosen rubric.</p>
]]></description><pubDate>Thu, 28 May 2026 15:13:24 +0000</pubDate><link>https://news.ycombinator.com/item?id=48310112</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48310112</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48310112</guid></item><item><title><![CDATA[New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"]]></title><description><![CDATA[
<p>This is in line with my observations and tests as well. Also supported by the distribution of the verdicts across the 4-buckets -- Gemini uses the middle buckets (Mostly True and Misleading) much less often - 6% combined for Gemini w/o search. And Opus uses them the most - 45% combined. Looks like Gemini is calibrated to be confident and Opus to be careful.</p>
]]></description><pubDate>Thu, 28 May 2026 15:05:07 +0000</pubDate><link>https://news.ycombinator.com/item?id=48309987</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48309987</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48309987</guid></item><item><title><![CDATA[New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"]]></title><description><![CDATA[
<p>Indeed. For algorithms and coding, my personal routine nowadays is to review every detailed plan with Opus 4.7 and GPT-5.5. They tend to find very different type of gaps.</p>
]]></description><pubDate>Thu, 28 May 2026 15:00:57 +0000</pubDate><link>https://news.ycombinator.com/item?id=48309934</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48309934</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48309934</guid></item><item><title><![CDATA[New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"]]></title><description><![CDATA[
<p>Agree that True and Mostly True might be very close and could be a calibration difference. Misleading and False, as well. A better headline number might be the 34% claims with substantial or polar-opposite verdicts.</p>
]]></description><pubDate>Thu, 28 May 2026 14:57:30 +0000</pubDate><link>https://news.ycombinator.com/item?id=48309885</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48309885</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48309885</guid></item><item><title><![CDATA[New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"]]></title><description><![CDATA[
<p>Agree. Human experts also struggle agreeing on this type of claims. The inter-annotator agreement on the verdicts on the AVeriTeC corpus across 50 organizations is κ=0.619 - substantial but well short of perfect.</p>
]]></description><pubDate>Thu, 28 May 2026 14:54:22 +0000</pubDate><link>https://news.ycombinator.com/item?id=48309839</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48309839</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48309839</guid></item><item><title><![CDATA[New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"]]></title><description><![CDATA[
<p>Agree with @pjdesno, that the 34% substantive or polar disagreement might be a better headline number. Or even the 21% polar disagreement (at least one model True, and at least one model False), which is still high for many real-world applications.</p>
]]></description><pubDate>Thu, 28 May 2026 14:50:34 +0000</pubDate><link>https://news.ycombinator.com/item?id=48309781</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48309781</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48309781</guid></item><item><title><![CDATA[New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"]]></title><description><![CDATA[
<p>That's a valid point. During the preliminary research, we did try also more explicit prompts (with explanation for each of the 4 buckets), as well as a five-bucket rubric (with Abstain option). Will show in a follow-up paper how the concise vs explicit prompt impacts the distribution of the verdicts and the level of disagreement. One issue to note with the longer prompts is that they open to much room for discussion around the exact prompt used. Probably we should preregister the prompt before running any further tests.</p>
]]></description><pubDate>Thu, 28 May 2026 14:47:52 +0000</pubDate><link>https://news.ycombinator.com/item?id=48309739</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48309739</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48309739</guid></item><item><title><![CDATA[New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"]]></title><description><![CDATA[
<p>Quick note on the second effect - how LLMs reduce that to a four-category judgment: On 21% of the claims at least two models provide polar-opposite verdicts (at least one model False, and at least one model True). This might be a better measurement of the strict disagreement than the 67% disagreement on the four-bucket rubric.</p>
]]></description><pubDate>Thu, 28 May 2026 14:30:25 +0000</pubDate><link>https://news.ycombinator.com/item?id=48309468</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48309468</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48309468</guid></item><item><title><![CDATA[New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"]]></title><description><![CDATA[
<p>Agree about comparing models with and without search capabilities. Even the two models with search capabilities (Sonar Pro and Gemini) agree only on 58% of the claims.</p>
]]></description><pubDate>Thu, 28 May 2026 14:19:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=48309313</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48309313</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48309313</guid></item><item><title><![CDATA[New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"]]></title><description><![CDATA[
<p>Will add a human-labelled expected response and measure against it in a follow up research. This one only captures the disagreement between the models, but not which model is write/wrong.</p>
]]></description><pubDate>Thu, 28 May 2026 14:09:57 +0000</pubDate><link>https://news.ycombinator.com/item?id=48309193</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48309193</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48309193</guid></item><item><title><![CDATA[New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"]]></title><description><![CDATA[
<p>The reason for the "No explanations, no qualifiers" in the prompt was to force the models to put the claim in one of the four buckets and answer with the bucket name only. It's a pure quantitive analysis (first in a series) and it does indeed lack the qualitative aspect.</p>
]]></description><pubDate>Thu, 28 May 2026 14:08:11 +0000</pubDate><link>https://news.ycombinator.com/item?id=48309169</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48309169</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48309169</guid></item><item><title><![CDATA[New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"]]></title><description><![CDATA[
<p>@john_strinlai @gcr, depends on the application. In many cases an "I don't know" answer is indeed better than a forced answer. But in many production systems, LLMs generate content/response anyway.<p>Although inheriting the messiness of the real-world, the majority  of these claims are objective enough to be classifiable by human experts with access to research. Plan to human-label the 1,000 claims and publish a follow-up research. Will consider adding an "I don't know" bucket too, as well as a clear instructions about the meaning of each of the 4 buckets.</p>
]]></description><pubDate>Thu, 28 May 2026 14:03:53 +0000</pubDate><link>https://news.ycombinator.com/item?id=48309117</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48309117</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48309117</guid></item><item><title><![CDATA[New comment by kostaj in "Disagreement Among Frontier LLMs on Real-World Fact-Checks"]]></title><description><![CDATA[
<p>Search was enabled for 2 of the 5 models -- Gemini and Sonar Pro. The disagreement between them is still high - different verdict on 42% of the claims. Fully agree, that some of those claims are hard to classify for a human as well -- the real-world messiness...</p>
]]></description><pubDate>Thu, 28 May 2026 13:53:48 +0000</pubDate><link>https://news.ycombinator.com/item?id=48308994</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48308994</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48308994</guid></item><item><title><![CDATA[New comment by kostaj in "Disagreement among frontier LLMs on real-world fact-checks"]]></title><description><![CDATA[
<p>Two of the five models used (Gemini+Search and Sonar Pro) have retrieval capabilities and used search when classifying the claims. The disagreement between them is still quite significant - 42%.</p>
]]></description><pubDate>Thu, 28 May 2026 13:46:20 +0000</pubDate><link>https://news.ycombinator.com/item?id=48308909</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48308909</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48308909</guid></item><item><title><![CDATA[New comment by kostaj in "Five frontier LLMs disagree on 67% of 1k real-world fact-check claims"]]></title><description><![CDATA[
<p>Indeed. I prompted each model ones, plus one retry on errors. Very good point to measure the inter-model disagreement! Will add in the next version.<p>Section "4.2 Agreement w/ peer majority" shows the level of agreement of each model with the majority.<p>Yes, planning of human-labelling the same corpus of 1,000 claims and publishing a second study measuring the models performance against the human-labels on corpus that the models have not seen during training.</p>
]]></description><pubDate>Thu, 28 May 2026 13:43:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=48308881</link><dc:creator>kostaj</dc:creator><comments>https://news.ycombinator.com/item?id=48308881</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48308881</guid></item></channel></rss>