<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News - Newest: &#34;LLM evaluation&#34;</title><link>https://news.ycombinator.com/newest</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Thu, 14 May 2026 18:37:02 +0000</lastBuildDate><atom:link href="https://hnrss.org/newest?q=LLM+evaluation" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[LLM-eval-kit: Distributed LLM evaluation framework (v0.3.0)]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/benmeryem-tech/llm-eval-kit">https://github.com/benmeryem-tech/llm-eval-kit</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47977901">https://news.ycombinator.com/item?id=47977901</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Fri, 01 May 2026 17:59:47 +0000</pubDate><link>https://github.com/benmeryem-tech/llm-eval-kit</link><dc:creator>benmeryem_ai</dc:creator><comments>https://news.ycombinator.com/item?id=47977901</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47977901</guid></item><item><title><![CDATA[A Synthesis of LLM Evaluation]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.aroy.sh/posts/llm-agent-evals/">https://www.aroy.sh/posts/llm-agent-evals/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47417019">https://news.ycombinator.com/item?id=47417019</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Tue, 17 Mar 2026 19:23:06 +0000</pubDate><link>https://www.aroy.sh/posts/llm-agent-evals/</link><dc:creator>dpe82</dc:creator><comments>https://news.ycombinator.com/item?id=47417019</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47417019</guid></item><item><title><![CDATA[LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation">https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47346921">https://news.ycombinator.com/item?id=47346921</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Thu, 12 Mar 2026 05:40:19 +0000</pubDate><link>https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation</link><dc:creator>vismit2000</dc:creator><comments>https://news.ycombinator.com/item?id=47346921</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47346921</guid></item><item><title><![CDATA[Show HN: Valohai LLM – Track and compare LLM evaluation results in one dashboard]]></title><description><![CDATA[
<p>We built Valohai LLM for tracking and comparing LLM evaluation results. Whether your evals live in notebooks and spreadsheets, or you're using an observability tool that wasn't built for comparison, this gives you a purpose-built eval comparison dashboard.<p>Run evals with a Python library (pip install valohai-llm), results stream in, and you can compare up to 6 configurations side by side. Group by any dimension (model, category, difficulty) to see where each model excels.<p>It doesn't do tracing or production observability, for now just eval tracking and comparison. What's cool is that you can define parameters you would like to test with and run a sweep across all of them.<p>Feedback welcome, especially from anyone comparing models and evaluating regularly!</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47073412">https://news.ycombinator.com/item?id=47073412</a></p>
<p>Points: 3</p>
<p># Comments: 0</p>
]]></description><pubDate>Thu, 19 Feb 2026 13:18:03 +0000</pubDate><link>https://valohai.com/llm/</link><dc:creator>radicain</dc:creator><comments>https://news.ycombinator.com/item?id=47073412</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47073412</guid></item><item><title><![CDATA[The Vocabulary Priming Confound in LLM Evaluation [pdf]]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/Palmerschallon/Dharma_Code/blob/main/paper/vocab_priming_confound.pdf">https://github.com/Palmerschallon/Dharma_Code/blob/main/paper/vocab_priming_confound.pdf</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46955474">https://news.ycombinator.com/item?id=46955474</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Tue, 10 Feb 2026 04:43:27 +0000</pubDate><link>https://github.com/Palmerschallon/Dharma_Code/blob/main/paper/vocab_priming_confound.pdf</link><dc:creator>palmerschallon</dc:creator><comments>https://news.ycombinator.com/item?id=46955474</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46955474</guid></item><item><title><![CDATA[Show HN: Dokimos – LLM Evaluation Framework for Java]]></title><description><![CDATA[
<p>I tried building LLM applications in Java and wanted the evaluation experience I know from the Python ecosystem: datasets, experiments, built in metrics, tracking results over time.<p>Dokimos brings that to Java. It's a framework for evaluating LLM outputs with:<p>- Built in evaluators for both LLM-based and traditional metrics
- Dataset support (JSON, CSV, or programmatic)
- JUnit and CI/CD integration so evaluations run as parameterized tests alongside your existing test suite
- Experiment tracking with aggregated metrics and export to multiple formats
- Optional server for viewing results over time<p>It integrates with LangChain4j and Spring AI, but works with any LLM client on your local machine.<p>The goal is to make evaluating LLM applications feel like a natural part of Java development. Define your test cases, create or generate datasets, pick your evaluators, run in CI, catch regressions.<p>GitHub: <a href="https://github.com/dokimos-dev/dokimos" rel="nofollow">https://github.com/dokimos-dev/dokimos</a>
Docs: <a href="https://dokimos.dev/overview" rel="nofollow">https://dokimos.dev/overview</a></p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46486108">https://news.ycombinator.com/item?id=46486108</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Sun, 04 Jan 2026 08:30:39 +0000</pubDate><link>https://github.com/dokimos-dev/dokimos</link><dc:creator>fkapsahili</dc:creator><comments>https://news.ycombinator.com/item?id=46486108</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46486108</guid></item><item><title><![CDATA[Show HN: Dokimos – LLM evaluation framework for Java]]></title><description><![CDATA[
<p>I'm working on an open-source project dokimos, because every LLM eval framework I found was Python and TypeScript-only, but a lot of companies will be building LLM apps and AI agents with Java.<p>Key features:
- JUnit 5 integration for test-driven evals
- Works with LangChain4j
- Framework-agnostic
- Supports custom evaluators and datasets<p>GitHub: <a href="https://github.com/dokimos-dev/dokimos" rel="nofollow">https://github.com/dokimos-dev/dokimos</a><p>Would love contributions or to team up with anyone who has Java experience and wants to work on this together.</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46401077">https://news.ycombinator.com/item?id=46401077</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Sat, 27 Dec 2025 11:37:58 +0000</pubDate><link>https://github.com/dokimos-dev/dokimos</link><dc:creator>fkapsahili</dc:creator><comments>https://news.ycombinator.com/item?id=46401077</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46401077</guid></item><item><title><![CDATA[Building an LLM evaluation framework: best practices]]></title><description><![CDATA[
<p>Article URL: <a href="https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/">https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46330472">https://news.ycombinator.com/item?id=46330472</a></p>
<p>Points: 1</p>
<p># Comments: 1</p>
]]></description><pubDate>Fri, 19 Dec 2025 20:25:55 +0000</pubDate><link>https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/</link><dc:creator>zenoprax</dc:creator><comments>https://news.ycombinator.com/item?id=46330472</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46330472</guid></item><item><title><![CDATA[Show HN: Spark-LLM-eval – Distributed LLM evaluation for Spark]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/bassrehab/spark-llm-eval">https://github.com/bassrehab/spark-llm-eval</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46288249">https://news.ycombinator.com/item?id=46288249</a></p>
<p>Points: 1</p>
<p># Comments: 1</p>
]]></description><pubDate>Tue, 16 Dec 2025 13:28:03 +0000</pubDate><link>https://github.com/bassrehab/spark-llm-eval</link><dc:creator>subhadipmitra</dc:creator><comments>https://news.ycombinator.com/item?id=46288249</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46288249</guid></item><item><title><![CDATA[Show HN: smallevals – Local LLM Evaluation Framework with Tiny 0.6B Models]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/mburaksayici/smallevals">https://github.com/mburaksayici/smallevals</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46150526">https://news.ycombinator.com/item?id=46150526</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Thu, 04 Dec 2025 17:48:59 +0000</pubDate><link>https://github.com/mburaksayici/smallevals</link><dc:creator>mburaksayici</dc:creator><comments>https://news.ycombinator.com/item?id=46150526</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46150526</guid></item><item><title><![CDATA[The LLM Evaluation Guidebook]]></title><description><![CDATA[
<p>Article URL: <a href="https://huggingface.co/spaces/OpenEvals/evaluation-guidebook">https://huggingface.co/spaces/OpenEvals/evaluation-guidebook</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46143116">https://news.ycombinator.com/item?id=46143116</a></p>
<p>Points: 3</p>
<p># Comments: 0</p>
]]></description><pubDate>Thu, 04 Dec 2025 02:30:55 +0000</pubDate><link>https://huggingface.co/spaces/OpenEvals/evaluation-guidebook</link><dc:creator>aratahikaru5</dc:creator><comments>https://news.ycombinator.com/item?id=46143116</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46143116</guid></item><item><title><![CDATA[A Long-Tail Professional Forum-Based Benchmark for LLM Evaluation]]></title><description><![CDATA[
<p>Article URL: <a href="https://arxiv.org/abs/2511.06346">https://arxiv.org/abs/2511.06346</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=46034384">https://news.ycombinator.com/item?id=46034384</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Mon, 24 Nov 2025 14:19:29 +0000</pubDate><link>https://arxiv.org/abs/2511.06346</link><dc:creator>wslh</dc:creator><comments>https://news.ycombinator.com/item?id=46034384</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=46034384</guid></item><item><title><![CDATA[Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)]]></title><description><![CDATA[
<p>Article URL: <a href="https://magazine.sebastianraschka.com/p/llm-evaluation-4-approaches">https://magazine.sebastianraschka.com/p/llm-evaluation-4-approaches</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=45600194">https://news.ycombinator.com/item?id=45600194</a></p>
<p>Points: 1</p>
<p># Comments: 0</p>
]]></description><pubDate>Thu, 16 Oct 2025 00:40:09 +0000</pubDate><link>https://magazine.sebastianraschka.com/p/llm-evaluation-4-approaches</link><dc:creator>ibobev</dc:creator><comments>https://news.ycombinator.com/item?id=45600194</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45600194</guid></item><item><title><![CDATA[Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)]]></title><description><![CDATA[
<p>Article URL: <a href="https://magazine.sebastianraschka.com/p/llm-evaluation-4-approaches">https://magazine.sebastianraschka.com/p/llm-evaluation-4-approaches</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=45544865">https://news.ycombinator.com/item?id=45544865</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Fri, 10 Oct 2025 23:09:46 +0000</pubDate><link>https://magazine.sebastianraschka.com/p/llm-evaluation-4-approaches</link><dc:creator>ibobev</dc:creator><comments>https://news.ycombinator.com/item?id=45544865</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45544865</guid></item><item><title><![CDATA[LLM Evaluation from Scratch: Multiple Choice, Verifiers, Leaderboards, LLM Judge]]></title><description><![CDATA[
<p>Article URL: <a href="https://magazine.sebastianraschka.com/p/llm-evaluation-4-approaches">https://magazine.sebastianraschka.com/p/llm-evaluation-4-approaches</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=45482523">https://news.ycombinator.com/item?id=45482523</a></p>
<p>Points: 4</p>
<p># Comments: 0</p>
]]></description><pubDate>Sun, 05 Oct 2025 15:55:26 +0000</pubDate><link>https://magazine.sebastianraschka.com/p/llm-evaluation-4-approaches</link><dc:creator>ModelForge</dc:creator><comments>https://news.ycombinator.com/item?id=45482523</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45482523</guid></item><item><title><![CDATA[LLM Evaluation via Rap Battles]]></title><description><![CDATA[
<p>Article URL: <a href="https://github.com/vadim0x60/rapbench/blob/master/README.md">https://github.com/vadim0x60/rapbench/blob/master/README.md</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=45140166">https://news.ycombinator.com/item?id=45140166</a></p>
<p>Points: 3</p>
<p># Comments: 0</p>
]]></description><pubDate>Fri, 05 Sep 2025 16:08:20 +0000</pubDate><link>https://github.com/vadim0x60/rapbench/blob/master/README.md</link><dc:creator>vadimdotme</dc:creator><comments>https://news.ycombinator.com/item?id=45140166</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45140166</guid></item><item><title><![CDATA[LLM Evaluation: Practical Tips at Booking.com]]></title><description><![CDATA[
<p>Article URL: <a href="https://booking.ai/llm-evaluation-practical-tips-at-booking-com-1b038a0d6662">https://booking.ai/llm-evaluation-practical-tips-at-booking-com-1b038a0d6662</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=45069847">https://news.ycombinator.com/item?id=45069847</a></p>
<p>Points: 4</p>
<p># Comments: 0</p>
]]></description><pubDate>Fri, 29 Aug 2025 21:52:32 +0000</pubDate><link>https://booking.ai/llm-evaluation-practical-tips-at-booking-com-1b038a0d6662</link><dc:creator>amrrs</dc:creator><comments>https://news.ycombinator.com/item?id=45069847</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45069847</guid></item><item><title><![CDATA[Streamline LLM Evaluation with Stax]]></title><description><![CDATA[
<p>Article URL: <a href="https://developers.googleblog.com/en/streamline-llm-evaluation-with-stax/">https://developers.googleblog.com/en/streamline-llm-evaluation-with-stax/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=45043503">https://news.ycombinator.com/item?id=45043503</a></p>
<p>Points: 5</p>
<p># Comments: 0</p>
]]></description><pubDate>Wed, 27 Aug 2025 18:52:26 +0000</pubDate><link>https://developers.googleblog.com/en/streamline-llm-evaluation-with-stax/</link><dc:creator>saikatsg</dc:creator><comments>https://news.ycombinator.com/item?id=45043503</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=45043503</guid></item><item><title><![CDATA[Viteval – an LLM evaluation framework powered by Vitest]]></title><description><![CDATA[
<p>Article URL: <a href="https://viteval.dev/">https://viteval.dev/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44948900">https://news.ycombinator.com/item?id=44948900</a></p>
<p>Points: 2</p>
<p># Comments: 0</p>
]]></description><pubDate>Tue, 19 Aug 2025 06:45:34 +0000</pubDate><link>https://viteval.dev/</link><dc:creator>Liriel</dc:creator><comments>https://news.ycombinator.com/item?id=44948900</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44948900</guid></item><item><title><![CDATA[Exploring LLM Evaluation by Using Games]]></title><description><![CDATA[
<p>Article URL: <a href="https://lmgame.org">https://lmgame.org</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=44427817">https://news.ycombinator.com/item?id=44427817</a></p>
<p>Points: 3</p>
<p># Comments: 1</p>
]]></description><pubDate>Mon, 30 Jun 2025 20:59:20 +0000</pubDate><link>https://lmgame.org</link><dc:creator>Yuxuan_Zhang13</dc:creator><comments>https://news.ycombinator.com/item?id=44427817</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=44427817</guid></item></channel></rss>