Hacker News - Newest: "LLM evaluation"

LLM-eval-kit: Distributed LLM evaluation framework (v0.3.0)

benmeryem_ai — Fri, 01 May 2026 17:59:47 +0000

Article URL: https://github.com/benmeryem-tech/llm-eval-kit

Comments URL: https://news.ycombinator.com/item?id=47977901

Points: 1

# Comments: 0

A Synthesis of LLM Evaluation

dpe82 — Tue, 17 Mar 2026 19:23:06 +0000

Article URL: https://www.aroy.sh/posts/llm-agent-evals/

Comments URL: https://news.ycombinator.com/item?id=47417019

Points: 1

# Comments: 0

LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide

vismit2000 — Thu, 12 Mar 2026 05:40:19 +0000

Article URL: https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation

Comments URL: https://news.ycombinator.com/item?id=47346921

Points: 2

# Comments: 0

Show HN: Valohai LLM – Track and compare LLM evaluation results in one dashboard

radicain — Thu, 19 Feb 2026 13:18:03 +0000

We built Valohai LLM for tracking and comparing LLM evaluation results. Whether your evals live in notebooks and spreadsheets, or you're using an observability tool that wasn't built for comparison, this gives you a purpose-built eval comparison dashboard.

Run evals with a Python library (pip install valohai-llm), results stream in, and you can compare up to 6 configurations side by side. Group by any dimension (model, category, difficulty) to see where each model excels.

It doesn't do tracing or production observability, for now just eval tracking and comparison. What's cool is that you can define parameters you would like to test with and run a sweep across all of them.

Feedback welcome, especially from anyone comparing models and evaluating regularly!

Comments URL: https://news.ycombinator.com/item?id=47073412

Points: 3

# Comments: 0

The Vocabulary Priming Confound in LLM Evaluation [pdf]

palmerschallon — Tue, 10 Feb 2026 04:43:27 +0000

Article URL: https://github.com/Palmerschallon/Dharma_Code/blob/main/paper/vocab_priming_confound.pdf

Comments URL: https://news.ycombinator.com/item?id=46955474

Points: 1

# Comments: 0

Show HN: Dokimos – LLM Evaluation Framework for Java

fkapsahili — Sun, 04 Jan 2026 08:30:39 +0000

I tried building LLM applications in Java and wanted the evaluation experience I know from the Python ecosystem: datasets, experiments, built in metrics, tracking results over time.

Dokimos brings that to Java. It's a framework for evaluating LLM outputs with:

- Built in evaluators for both LLM-based and traditional metrics - Dataset support (JSON, CSV, or programmatic) - JUnit and CI/CD integration so evaluations run as parameterized tests alongside your existing test suite - Experiment tracking with aggregated metrics and export to multiple formats - Optional server for viewing results over time

It integrates with LangChain4j and Spring AI, but works with any LLM client on your local machine.

The goal is to make evaluating LLM applications feel like a natural part of Java development. Define your test cases, create or generate datasets, pick your evaluators, run in CI, catch regressions.

GitHub: https://github.com/dokimos-dev/dokimos Docs: https://dokimos.dev/overview

Comments URL: https://news.ycombinator.com/item?id=46486108

Points: 2

# Comments: 0

Show HN: Dokimos – LLM evaluation framework for Java

fkapsahili — Sat, 27 Dec 2025 11:37:58 +0000

I'm working on an open-source project dokimos, because every LLM eval framework I found was Python and TypeScript-only, but a lot of companies will be building LLM apps and AI agents with Java.

Key features: - JUnit 5 integration for test-driven evals - Works with LangChain4j - Framework-agnostic - Supports custom evaluators and datasets

GitHub: https://github.com/dokimos-dev/dokimos

Would love contributions or to team up with anyone who has Java experience and wants to work on this together.

Comments URL: https://news.ycombinator.com/item?id=46401077

Points: 1

# Comments: 0