<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: gertlabs</title><link>https://news.ycombinator.com/user?id=gertlabs</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Wed, 22 Apr 2026 08:41:33 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=gertlabs" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by gertlabs in "Gbench Intelligence Benchmark"]]></title><description><![CDATA[
<p>We've been working on a way to address the obvious problems with existing benchmarks, by creating a single comprehensive benchmark that measures things that technical people care about, while also getting as close to an objective, "core intelligence" measurement as possible.<p>Some demo games are shown on /spectate that gives you an idea of how we test models and why this would be difficult to benchmax. I think our benchmark is by far the best relative measurement of artificial intelligence out there. Feedback is welcome and usually acted upon quickly.</p>
]]></description><pubDate>Wed, 22 Apr 2026 00:41:53 +0000</pubDate><link>https://news.ycombinator.com/item?id=47857095</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47857095</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47857095</guid></item><item><title><![CDATA[Gbench Intelligence Benchmark]]></title><description><![CDATA[
<p>Article URL: <a href="https://gertlabs.com/">https://gertlabs.com/</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47857019">https://news.ycombinator.com/item?id=47857019</a></p>
<p>Points: 4</p>
<p># Comments: 1</p>
]]></description><pubDate>Wed, 22 Apr 2026 00:35:12 +0000</pubDate><link>https://gertlabs.com/</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47857019</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47857019</guid></item><item><title><![CDATA[New comment by gertlabs in "Kimi K2.6: Advancing open-source coding"]]></title><description><![CDATA[
<p>Update: Kimi K2.5 one-shot results are live. It wasn't a noteworthy release compared to K2.6: <a href="https://gertlabs.com/?mode=oneshot_coding" rel="nofollow">https://gertlabs.com/?mode=oneshot_coding</a></p>
]]></description><pubDate>Tue, 21 Apr 2026 16:35:22 +0000</pubDate><link>https://news.ycombinator.com/item?id=47851128</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47851128</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47851128</guid></item><item><title><![CDATA[New comment by gertlabs in "Kimi K2.6: Advancing open-source coding"]]></title><description><![CDATA[
<p>Thanks -- that one is categorized under Trading/Financial, whereas betting is reserved for games like Pot Limit Omaha Hilo.<p>That's a good idea for a feature request, including the tags for the spectatable demo games.</p>
]]></description><pubDate>Tue, 21 Apr 2026 16:33:17 +0000</pubDate><link>https://news.ycombinator.com/item?id=47851096</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47851096</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47851096</guid></item><item><title><![CDATA[New comment by gertlabs in "Kimi K2.6: Advancing open-source coding"]]></title><description><![CDATA[
<p>Good question. We missed that release entirely. Our automated model checker only went live 2 months ago so they were manually curated prior to that. I'm adding it now. It'll be live in ~12 hours.</p>
]]></description><pubDate>Tue, 21 Apr 2026 05:19:46 +0000</pubDate><link>https://news.ycombinator.com/item?id=47844826</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47844826</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47844826</guid></item><item><title><![CDATA[New comment by gertlabs in "Kimi K2.6: Advancing open-source coding"]]></title><description><![CDATA[
<p>We will as soon as API access is widely available. Once a model goes live, we typically have one-shot reasoning benchmarks up in ~8 hours and comprehensive agentic/combined benchmarks up after 24-48 hours. We're working on building relationships with each lab to have the results before launch.</p>
]]></description><pubDate>Tue, 21 Apr 2026 05:14:36 +0000</pubDate><link>https://news.ycombinator.com/item?id=47844789</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47844789</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47844789</guid></item><item><title><![CDATA[New comment by gertlabs in "Kimi K2.6: Advancing open-source coding"]]></title><description><![CDATA[
<p>We recently added cost (last week), so data is sparse. Check back in a few weeks and it will be represented somewhere on the homepage, probably in the Efficiency Chart at the bottom. We also plan to show model performance deviation over time after we collect more data.<p>I'm interested to hear about any other data representations you'd like to see, too. The goal is to convey the most important information as densely as possible, without too much clutter.</p>
]]></description><pubDate>Tue, 21 Apr 2026 03:54:25 +0000</pubDate><link>https://news.ycombinator.com/item?id=47844333</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47844333</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47844333</guid></item><item><title><![CDATA[New comment by gertlabs in "Kimi vendor verifier – verify accuracy of inference providers"]]></title><description><![CDATA[
<p>I did not know about this! We've put a lot of effort into probing providers and their offerings and auto-selecting the best options. I wonder how well their exacto option works.<p>Going to test it out, thanks!</p>
]]></description><pubDate>Tue, 21 Apr 2026 01:20:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=47843418</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47843418</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47843418</guid></item><item><title><![CDATA[New comment by gertlabs in "Kimi K2.6: Advancing open-source coding"]]></title><description><![CDATA[
<p>It's interesting; I can only speculate as to the underlying reason. When given enough time, models outperform in Rust/C++ in longer agentic tasks, and actually perform worst in Python. For tasks that aren't judged on code speed.  <a href="https://gertlabs.com/?mode=agentic_coding" rel="nofollow">https://gertlabs.com/?mode=agentic_coding</a></p>
]]></description><pubDate>Tue, 21 Apr 2026 00:19:02 +0000</pubDate><link>https://news.ycombinator.com/item?id=47842990</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47842990</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47842990</guid></item><item><title><![CDATA[New comment by gertlabs in "Kimi K2.6: Advancing open-source coding"]]></title><description><![CDATA[
<p>Early benchmarks show tremendous improvement over Kimi K2 Thinking, which didn't perform well on our benchmarks (and we do use best available quantization).<p>Kimi K2.6 is currently the top open weights model in one-shot coding reasoning, a little better than GLM 5.1, and still a strong contender against SOTA models from ~3 months ago (comparable to Gemini 3.1 Pro Preview).<p>Agentic tests are still running, check back tomorrow. Open weights models typically struggle with longer contexts in agentic workflows, but GLM 5.1 still handled them very well, so I'm curious how Kimi ends up. Both the old Kimi and the new model are on the slower side, so that's a consideration that makes them probably less usable for agentic coding work, regardless. The old Kimi K2 model was severely benchmaxxed, and was only really interesting in the context of generating more variation and temperature, not for solving hard problems. The new one is a much stronger generalist.<p>Overall, the field of open weights models is looking <i>fantastic</i>. A new near-frontier release every week, it seems.<p>Comprehensive, difficult to game benchmarks at <a href="https://gertlabs.com/?mode=oneshot_coding" rel="nofollow">https://gertlabs.com/?mode=oneshot_coding</a></p>
]]></description><pubDate>Mon, 20 Apr 2026 22:55:16 +0000</pubDate><link>https://news.ycombinator.com/item?id=47842140</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47842140</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47842140</guid></item><item><title><![CDATA[New comment by gertlabs in "Kimi vendor verifier – verify accuracy of inference providers"]]></title><description><![CDATA[
<p>This is real issue in our benchmarks. Beware of OpenRouter providers that don't specify quantizations or use lower ones than you might be expecting. OpenRouter does provide configuration options for this, and it often limits your options significantly. That being said, even with the best providers, Kimi-K2-thinking was underwhelming and slow on our benchmarks, albeit interesting and useful for temperature/variation.<p>Kimi K2.6, however, is the new open source leader, so far. Agentic evaluations still in progress, but one-shot coding reasoning benchmarks are ready at <a href="https://gertlabs.com/?mode=oneshot_coding" rel="nofollow">https://gertlabs.com/?mode=oneshot_coding</a></p>
]]></description><pubDate>Mon, 20 Apr 2026 22:54:26 +0000</pubDate><link>https://news.ycombinator.com/item?id=47842129</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47842129</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47842129</guid></item><item><title><![CDATA[New comment by gertlabs in "Claude Opus 4.7"]]></title><description><![CDATA[
<p>We calculate percentiles based on successful submissions only, and then apply success rate as a separate measurement, which is incorporated into our relative rankings.<p>So we do penalize evals where the player failed the game, but not in the percentile measurement (success rate measures instances of playing incorrectly, did not compile, runtime errors, and other non-infrastructure related issues that can be blamed on the model). The design decision there is that percentile tells you how good the model's ideas are (when executed correctly), separately from how often it got something working correctly, but I can see how that's not great UX, at least as presented now.<p>But the actual score itself is a combination of percentiles and success rates with some weighting for different categories, nothing fancy.<p>I added a methodology page to the roadmap, thanks for pointing that out. We've converged on a benchmark methodology that should scale for a very long time, so it's time to document it better.</p>
]]></description><pubDate>Fri, 17 Apr 2026 01:13:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=47801513</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47801513</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47801513</guid></item><item><title><![CDATA[New comment by gertlabs in "Claude Opus 4.7"]]></title><description><![CDATA[
<p>We only have some basic time filtering (<a href="https://gertlabs.com/?days=30" rel="nofollow">https://gertlabs.com/?days=30</a>), but most of our samples are from the last 2 months. This is a visualization we plan to add when we've collected more historical data.<p>But we did heavily resample Claude Opus 4.6 during the height of the degraded performance fiasco, and my takeaway is that API-based eval performance was... about the same. Claude Opus 4.6 was just never significantly better than 4.5.<p>But we don't really know if you're getting a different model when authenticated by OAUTH/subscription vs calling the API and paying usage prices. I definitely noticed performance issues recently, too, so I suspect it had more to do with subscription-only degradation and/or hastily shipped harness changes.</p>
]]></description><pubDate>Thu, 16 Apr 2026 22:27:33 +0000</pubDate><link>https://news.ycombinator.com/item?id=47800332</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47800332</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47800332</guid></item><item><title><![CDATA[New comment by gertlabs in "Claude Opus 4.7"]]></title><description><![CDATA[
<p>Early benchmark results on our private complex reasoning suite: <a href="https://gertlabs.com/?mode=agentic_coding" rel="nofollow">https://gertlabs.com/?mode=agentic_coding</a><p>Opus 4.7 is more strategic, more intelligent, and has a higher intelligence floor than 4.6 or 4.5. It's roughly tied with GPT 5.4 as the frontier model for one-shot coding reasoning, and in agentic sessions with tools, it IS the best, as advertised (slightly edging out Opus 4.5, not a typo).<p>We're still running more evals, and it will take a few days to get enough decision making (non-coding) simulations to finalize leaderboard positions, but I don't expect much movement on the coding sections of the leaderboard at this point.<p>Even Anthropic's own model card shows context handling regressions -- we're still working on adding a context-specific visualization and benchmark to the suite to give you the objective numbers there.</p>
]]></description><pubDate>Thu, 16 Apr 2026 20:31:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=47799123</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47799123</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47799123</guid></item><item><title><![CDATA[New comment by gertlabs in "The M×N problem of tool calling and open-source models"]]></title><description><![CDATA[
<p>In our benchmarks we exclusively use a custom harness for measuring tool capability. It has common tools that any harness would have, like a thin wrapper around shell commands, basic file editors, etc. but an important part of agentic intelligence is adapting to new tools. Frontier models are already quite adaptable, especially Anthropic models, and improving with each release. I think a standardized format will become less and less important over time.<p>Benchmarks at <a href="https://gertlabs.com" rel="nofollow">https://gertlabs.com</a></p>
]]></description><pubDate>Tue, 14 Apr 2026 22:41:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=47772399</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47772399</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47772399</guid></item><item><title><![CDATA[Gemma 4 and the Economics of Selling AI]]></title><description><![CDATA[
<p>Article URL: <a href="https://gertlabs.com/blog/gemma-4-economics">https://gertlabs.com/blog/gemma-4-economics</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=47768202">https://news.ycombinator.com/item?id=47768202</a></p>
<p>Points: 6</p>
<p># Comments: 0</p>
]]></description><pubDate>Tue, 14 Apr 2026 16:59:25 +0000</pubDate><link>https://gertlabs.com/blog/gemma-4-economics</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47768202</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47768202</guid></item><item><title><![CDATA[New comment by gertlabs in "I ran Gemma 4 as a local model in Codex CLI"]]></title><description><![CDATA[
<p>We add samples every week, so I'm curious if the numbers will move.<p>They did a similar re-release during the Gemini 3.1 Pro Preview rollout, and released a custom-tools version with its own slug, which performs MUCH better on custom harnesses (mostly because the original release could not figure out tool call formatting at all).</p>
]]></description><pubDate>Tue, 14 Apr 2026 06:57:54 +0000</pubDate><link>https://news.ycombinator.com/item?id=47762177</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47762177</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47762177</guid></item><item><title><![CDATA[New comment by gertlabs in "I ran Gemma 4 as a local model in Codex CLI"]]></title><description><![CDATA[
<p>In one shot coding, surprisingly, yes, by a decent amount. And it isn't a sample size issue. In agentic, no: <a href="https://gertlabs.com/?agentic=agentic" rel="nofollow">https://gertlabs.com/?agentic=agentic</a><p>My early takeaway is that Gemma 26B-A4B is the best tuned out of the bunch, but being small and with few active params, it's severely constrained by context (large inputs and tasks with large required outputs tank Gemma 26B's performance). We're working on a clean visualization for this; the data is there.<p>It's not uncommon for a sub-release of a model to show improvements across the board on its model card, but actually have mixed real performance compared to its predecessor (sometimes even being worse on average).</p>
]]></description><pubDate>Mon, 13 Apr 2026 19:10:24 +0000</pubDate><link>https://news.ycombinator.com/item?id=47756571</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47756571</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47756571</guid></item><item><title><![CDATA[New comment by gertlabs in "I ran Gemma 4 as a local model in Codex CLI"]]></title><description><![CDATA[
<p>Gemma 4 26B really is an outlier in its weight class.<p>In our little known, difficult to game benchmarks, it scored about as well as GPT 5.2 and Gemini 3 Pro Preview on one-shot coding problems. It had me re-reviewing our entire benchmarking methodology.<p>But it struggled in the other two sections of our benchmark: agentic coding and non-coding decision making. Tool use, iterative refinement, managing large contexts, and reasoning outside of coding brought the scores back down to reality. It actually performed worse when it had to use tools and a custom harness to write code for an eval vs getting the chance to one-shot it. No doubt it's been overfit on common harnesses and agentic benchmarks. But the main problem is likely scaling context on small models.<p>Still, incredible model, and incredible speed on an M-series Macbook. Benchmarks at <a href="https://gertlabs.com" rel="nofollow">https://gertlabs.com</a></p>
]]></description><pubDate>Mon, 13 Apr 2026 16:27:40 +0000</pubDate><link>https://news.ycombinator.com/item?id=47754453</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47754453</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47754453</guid></item><item><title><![CDATA[New comment by gertlabs in "High-Level Rust: Getting 80% of the Benefits with 20% of the Pain"]]></title><description><![CDATA[
<p>I partially agree, but C++ is the second best agentic language! (of 6 tested). LLMs are pretty good at reading machine output. My pet theory is that it has more to do with the training data in lower level languages being of a more interesting algorithmic variety, on average.</p>
]]></description><pubDate>Mon, 13 Apr 2026 00:35:23 +0000</pubDate><link>https://news.ycombinator.com/item?id=47746119</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=47746119</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47746119</guid></item></channel></rss>