Hacker News: gertlabs

New comment by gertlabs in "Gbench Intelligence Benchmark"

gertlabs — Wed, 22 Apr 2026 00:41:53 +0000

We've been working on a way to address the obvious problems with existing benchmarks, by creating a single comprehensive benchmark that measures things that technical people care about, while also getting as close to an objective, "core intelligence" measurement as possible.

Some demo games are shown on /spectate that gives you an idea of how we test models and why this would be difficult to benchmax. I think our benchmark is by far the best relative measurement of artificial intelligence out there. Feedback is welcome and usually acted upon quickly.

Gbench Intelligence Benchmark

gertlabs — Wed, 22 Apr 2026 00:35:12 +0000

Article URL: https://gertlabs.com/

Comments URL: https://news.ycombinator.com/item?id=47857019

Points: 4

# Comments: 1

New comment by gertlabs in "Kimi K2.6: Advancing open-source coding"

gertlabs — Tue, 21 Apr 2026 16:35:22 +0000

Update: Kimi K2.5 one-shot results are live. It wasn't a noteworthy release compared to K2.6: https://gertlabs.com/?mode=oneshot_coding

New comment by gertlabs in "Kimi K2.6: Advancing open-source coding"

gertlabs — Tue, 21 Apr 2026 16:33:17 +0000

Thanks -- that one is categorized under Trading/Financial, whereas betting is reserved for games like Pot Limit Omaha Hilo.

That's a good idea for a feature request, including the tags for the spectatable demo games.

New comment by gertlabs in "Kimi K2.6: Advancing open-source coding"

gertlabs — Tue, 21 Apr 2026 05:19:46 +0000

Good question. We missed that release entirely. Our automated model checker only went live 2 months ago so they were manually curated prior to that. I'm adding it now. It'll be live in ~12 hours.

New comment by gertlabs in "Kimi K2.6: Advancing open-source coding"

gertlabs — Tue, 21 Apr 2026 05:14:36 +0000

We will as soon as API access is widely available. Once a model goes live, we typically have one-shot reasoning benchmarks up in ~8 hours and comprehensive agentic/combined benchmarks up after 24-48 hours. We're working on building relationships with each lab to have the results before launch.

New comment by gertlabs in "Kimi K2.6: Advancing open-source coding"

gertlabs — Tue, 21 Apr 2026 03:54:25 +0000

We recently added cost (last week), so data is sparse. Check back in a few weeks and it will be represented somewhere on the homepage, probably in the Efficiency Chart at the bottom. We also plan to show model performance deviation over time after we collect more data.

I'm interested to hear about any other data representations you'd like to see, too. The goal is to convey the most important information as densely as possible, without too much clutter.

New comment by gertlabs in "Kimi vendor verifier – verify accuracy of inference providers"

gertlabs — Tue, 21 Apr 2026 01:20:37 +0000

I did not know about this! We've put a lot of effort into probing providers and their offerings and auto-selecting the best options. I wonder how well their exacto option works.

Going to test it out, thanks!

New comment by gertlabs in "Kimi K2.6: Advancing open-source coding"

gertlabs — Tue, 21 Apr 2026 00:19:02 +0000

It's interesting; I can only speculate as to the underlying reason. When given enough time, models outperform in Rust/C++ in longer agentic tasks, and actually perform worst in Python. For tasks that aren't judged on code speed. https://gertlabs.com/?mode=agentic_coding

New comment by gertlabs in "Kimi K2.6: Advancing open-source coding"

gertlabs — Mon, 20 Apr 2026 22:55:16 +0000

Early benchmarks show tremendous improvement over Kimi K2 Thinking, which didn't perform well on our benchmarks (and we do use best available quantization).

Kimi K2.6 is currently the top open weights model in one-shot coding reasoning, a little better than GLM 5.1, and still a strong contender against SOTA models from ~3 months ago (comparable to Gemini 3.1 Pro Preview).

Agentic tests are still running, check back tomorrow. Open weights models typically struggle with longer contexts in agentic workflows, but GLM 5.1 still handled them very well, so I'm curious how Kimi ends up. Both the old Kimi and the new model are on the slower side, so that's a consideration that makes them probably less usable for agentic coding work, regardless. The old Kimi K2 model was severely benchmaxxed, and was only really interesting in the context of generating more variation and temperature, not for solving hard problems. The new one is a much stronger generalist.

Overall, the field of open weights models is looking fantastic. A new near-frontier release every week, it seems.

Comprehensive, difficult to game benchmarks at https://gertlabs.com/?mode=oneshot_coding

New comment by gertlabs in "Kimi vendor verifier – verify accuracy of inference providers"

gertlabs — Mon, 20 Apr 2026 22:54:26 +0000

This is real issue in our benchmarks. Beware of OpenRouter providers that don't specify quantizations or use lower ones than you might be expecting. OpenRouter does provide configuration options for this, and it often limits your options significantly. That being said, even with the best providers, Kimi-K2-thinking was underwhelming and slow on our benchmarks, albeit interesting and useful for temperature/variation.

Kimi K2.6, however, is the new open source leader, so far. Agentic evaluations still in progress, but one-shot coding reasoning benchmarks are ready at https://gertlabs.com/?mode=oneshot_coding

New comment by gertlabs in "Claude Opus 4.7"

gertlabs — Fri, 17 Apr 2026 01:13:50 +0000

We calculate percentiles based on successful submissions only, and then apply success rate as a separate measurement, which is incorporated into our relative rankings.

So we do penalize evals where the player failed the game, but not in the percentile measurement (success rate measures instances of playing incorrectly, did not compile, runtime errors, and other non-infrastructure related issues that can be blamed on the model). The design decision there is that percentile tells you how good the model's ideas are (when executed correctly), separately from how often it got something working correctly, but I can see how that's not great UX, at least as presented now.

But the actual score itself is a combination of percentiles and success rates with some weighting for different categories, nothing fancy.

I added a methodology page to the roadmap, thanks for pointing that out. We've converged on a benchmark methodology that should scale for a very long time, so it's time to document it better.

New comment by gertlabs in "Claude Opus 4.7"

gertlabs — Thu, 16 Apr 2026 22:27:33 +0000

We only have some basic time filtering (https://gertlabs.com/?days=30), but most of our samples are from the last 2 months. This is a visualization we plan to add when we've collected more historical data.

But we did heavily resample Claude Opus 4.6 during the height of the degraded performance fiasco, and my takeaway is that API-based eval performance was... about the same. Claude Opus 4.6 was just never significantly better than 4.5.

But we don't really know if you're getting a different model when authenticated by OAUTH/subscription vs calling the API and paying usage prices. I definitely noticed performance issues recently, too, so I suspect it had more to do with subscription-only degradation and/or hastily shipped harness changes.

New comment by gertlabs in "Claude Opus 4.7"

gertlabs — Thu, 16 Apr 2026 20:31:56 +0000

Early benchmark results on our private complex reasoning suite: https://gertlabs.com/?mode=agentic_coding

Opus 4.7 is more strategic, more intelligent, and has a higher intelligence floor than 4.6 or 4.5. It's roughly tied with GPT 5.4 as the frontier model for one-shot coding reasoning, and in agentic sessions with tools, it IS the best, as advertised (slightly edging out Opus 4.5, not a typo).

We're still running more evals, and it will take a few days to get enough decision making (non-coding) simulations to finalize leaderboard positions, but I don't expect much movement on the coding sections of the leaderboard at this point.

Even Anthropic's own model card shows context handling regressions -- we're still working on adding a context-specific visualization and benchmark to the suite to give you the objective numbers there.

New comment by gertlabs in "The M×N problem of tool calling and open-source models"

gertlabs — Tue, 14 Apr 2026 22:41:32 +0000

In our benchmarks we exclusively use a custom harness for measuring tool capability. It has common tools that any harness would have, like a thin wrapper around shell commands, basic file editors, etc. but an important part of agentic intelligence is adapting to new tools. Frontier models are already quite adaptable, especially Anthropic models, and improving with each release. I think a standardized format will become less and less important over time.

Benchmarks at https://gertlabs.com

Gemma 4 and the Economics of Selling AI

gertlabs — Tue, 14 Apr 2026 16:59:25 +0000

Article URL: https://gertlabs.com/blog/gemma-4-economics

Comments URL: https://news.ycombinator.com/item?id=47768202

Points: 6

# Comments: 0

New comment by gertlabs in "I ran Gemma 4 as a local model in Codex CLI"

gertlabs — Tue, 14 Apr 2026 06:57:54 +0000

We add samples every week, so I'm curious if the numbers will move.

They did a similar re-release during the Gemini 3.1 Pro Preview rollout, and released a custom-tools version with its own slug, which performs MUCH better on custom harnesses (mostly because the original release could not figure out tool call formatting at all).

New comment by gertlabs in "I ran Gemma 4 as a local model in Codex CLI"

gertlabs — Mon, 13 Apr 2026 19:10:24 +0000

In one shot coding, surprisingly, yes, by a decent amount. And it isn't a sample size issue. In agentic, no: https://gertlabs.com/?agentic=agentic

My early takeaway is that Gemma 26B-A4B is the best tuned out of the bunch, but being small and with few active params, it's severely constrained by context (large inputs and tasks with large required outputs tank Gemma 26B's performance). We're working on a clean visualization for this; the data is there.

It's not uncommon for a sub-release of a model to show improvements across the board on its model card, but actually have mixed real performance compared to its predecessor (sometimes even being worse on average).

New comment by gertlabs in "I ran Gemma 4 as a local model in Codex CLI"

gertlabs — Mon, 13 Apr 2026 16:27:40 +0000

Gemma 4 26B really is an outlier in its weight class.

In our little known, difficult to game benchmarks, it scored about as well as GPT 5.2 and Gemini 3 Pro Preview on one-shot coding problems. It had me re-reviewing our entire benchmarking methodology.

But it struggled in the other two sections of our benchmark: agentic coding and non-coding decision making. Tool use, iterative refinement, managing large contexts, and reasoning outside of coding brought the scores back down to reality. It actually performed worse when it had to use tools and a custom harness to write code for an eval vs getting the chance to one-shot it. No doubt it's been overfit on common harnesses and agentic benchmarks. But the main problem is likely scaling context on small models.

Still, incredible model, and incredible speed on an M-series Macbook. Benchmarks at https://gertlabs.com

New comment by gertlabs in "High-Level Rust: Getting 80% of the Benefits with 20% of the Pain"

gertlabs — Mon, 13 Apr 2026 00:35:23 +0000

I partially agree, but C++ is the second best agentic language! (of 6 tested). LLMs are pretty good at reading machine output. My pet theory is that it has more to do with the training data in lower level languages being of a more interesting algorithmic variety, on average.