<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: gertlabs</title><link>https://news.ycombinator.com/user?id=gertlabs</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Fri, 19 Jun 2026 11:05:45 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=gertlabs" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by gertlabs in "GLM-5.2 is the new leading open weights model on Artificial Analysis"]]></title><description><![CDATA[
<p>They're within confidence intervals of each other, but remember how much discussion there was that Opus 4.6 had been nerfed in March. We averaged samples over the entire lifetime of Opus 4.6, which likely served many different underlying checkpoints. Even the best version of Opus 4.6 was hardly an upgrade.<p>We find a lot of interesting anomalies with our benchmark that hold up under large sample sizes.</p>
]]></description><pubDate>Thu, 18 Jun 2026 04:43:31 +0000</pubDate><link>https://news.ycombinator.com/item?id=48580915</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48580915</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48580915</guid></item><item><title><![CDATA[New comment by gertlabs in "A robot is sprinting towards you. Do you want it running on Claude or Grok?"]]></title><description><![CDATA[
<p>All of our posts have been well received by an insanely high percentage of people who have interacted on here -- most people clearly find what we're doing interesting and relevant to the HN community (AI evaluations). A flag seems pretty aggressive! Especially when the top comment on the article (after our above comment got flagged) is about tacos.<p>I'm a person running the account, and I only post where I think we have a relevant contribution.</p>
]]></description><pubDate>Wed, 17 Jun 2026 22:41:46 +0000</pubDate><link>https://news.ycombinator.com/item?id=48578009</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48578009</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48578009</guid></item><item><title><![CDATA[New comment by gertlabs in "GLM 5.2 Performance Benchmarks"]]></title><description><![CDATA[
<p>On our multi-agent coding and reasoning evaluations, GLM 5.2 is the first model we've tested that crossed the threshold of being on par with or better than Opus 4.6 (although as usual, we have GLM 5.2 and most other Chinese models a bit below most other benchmarks with test methodologies that are more vulnerable to benchmaxxing).<p>Data at <a href="https://gertlabs.com/rankings" rel="nofollow">https://gertlabs.com/rankings</a></p>
]]></description><pubDate>Wed, 17 Jun 2026 16:52:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=48573114</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48573114</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48573114</guid></item><item><title><![CDATA[New comment by gertlabs in "GLM-5.2 is the new leading open weights model on Artificial Analysis"]]></title><description><![CDATA[
<p>GLM 5.2 is the first model we've tested that is unambiguously on par with, or better than Opus 4.6 (although as usual, we have GLM 5.2 and most other Chinese models a bit below most other benchmarks with more vulnerable test methodologies).<p>Data at <a href="https://gertlabs.com/rankings" rel="nofollow">https://gertlabs.com/rankings</a></p>
]]></description><pubDate>Wed, 17 Jun 2026 16:43:44 +0000</pubDate><link>https://news.ycombinator.com/item?id=48573005</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48573005</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48573005</guid></item><item><title><![CDATA[New comment by gertlabs in "MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second"]]></title><description><![CDATA[
<p>It's likely overfit to common harnesses and iteration patterns, so it struggles with formatting tool calls and json in our testing which use our own harnesses (although there is a lot of overlap with tools that would be found in any coding harness like bash, apply_patch, etc.)<p>We didn't love the results because it draws negative scrutiny to our benchmark, but the results are real and done at scale and I think DeepSeek V4 Pro's inability to do agentic work outside of environments it was trained on is an important thing to measure, especially when so many other models can generalize to new environments just fine.<p>Google models also struggle with tools, but they have very strong initial answers, so there is more potential for them to bridge the gap with some better post-training.</p>
]]></description><pubDate>Tue, 09 Jun 2026 01:51:15 +0000</pubDate><link>https://news.ycombinator.com/item?id=48455186</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48455186</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48455186</guid></item><item><title><![CDATA[New comment by gertlabs in "MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second"]]></title><description><![CDATA[
<p>DeepSeek v4 Pro struggles with a custom harness, and all the models ranked above it don't, so it gets downweighted in the agentic coding benchmarks (although it ranks better than Flash in one-shot problem solving: <a href="https://gertlabs.com/rankings?ow=1&mode=oneshot_coding" rel="nofollow">https://gertlabs.com/rankings?ow=1&mode=oneshot_coding</a>). We ran plenty of samples.<p>MiMo v2.5 is on there, as well as the pro version.<p>We found a few anomalies in our evaluations, which makes sense -- if every new sub-release is better across the board in every area of the model card, that should raise alarms about benchmaxxing. But the main thing we found is that hype != performance, and I trust our benchmark methodology significantly more than the model cards the labs add to their press releases.</p>
]]></description><pubDate>Mon, 08 Jun 2026 19:59:39 +0000</pubDate><link>https://news.ycombinator.com/item?id=48450981</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48450981</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48450981</guid></item><item><title><![CDATA[New comment by gertlabs in "MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second"]]></title><description><![CDATA[
<p>MiMo V2.5 Pro (regular speed) remains the strongest open weights agentic coding model we've tested -- it's been interesting to see how little attention it has received relative to some lower performing releases. And the "fast mode" pricing is very competitive here.<p>Data at <a href="https://gertlabs.com/rankings" rel="nofollow">https://gertlabs.com/rankings</a></p>
]]></description><pubDate>Mon, 08 Jun 2026 16:44:13 +0000</pubDate><link>https://news.ycombinator.com/item?id=48447704</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48447704</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48447704</guid></item><item><title><![CDATA[New comment by gertlabs in "My thoughts after using Clojure for about a month"]]></title><description><![CDATA[
<p>GPT 5.4+ models are extremely good at writing Clojure, agreed. In the agentic coding part of our benchmark, they do have access to the REPL via bash if they choose to use it. Filtered here: <a href="https://gertlabs.com/rankings?mode=agentic_coding" rel="nofollow">https://gertlabs.com/rankings?mode=agentic_coding</a></p>
]]></description><pubDate>Tue, 02 Jun 2026 22:38:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=48377284</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48377284</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48377284</guid></item><item><title><![CDATA[New comment by gertlabs in "My thoughts after using Clojure for about a month"]]></title><description><![CDATA[
<p>I might just be a simpleton -- I never had the resolve to try an ambitious project in Clojure. I was not aware that you could get full OOP though, what you are describing feels like yes technically possible but kind of a hack to get inheritance / no type hierarchy enforcement. I'm no expert on the language though</p>
]]></description><pubDate>Tue, 02 Jun 2026 22:34:58 +0000</pubDate><link>https://news.ycombinator.com/item?id=48377253</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48377253</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48377253</guid></item><item><title><![CDATA[New comment by gertlabs in "My thoughts after using Clojure for about a month"]]></title><description><![CDATA[
<p>Success rate includes syntax/compilation failures as well as environment rule violations, and is almost entirely from one-shot code generations. Percentile shows how well the working submissions perform.<p>In long horizon agentic coding evaluations, strong models fix the syntax and percentile and it becomes a direct comparison of which submissions per language performed the best on average. You can filter for that here: <a href="https://gertlabs.com/rankings?provider=openai&mode=agentic_coding" rel="nofollow">https://gertlabs.com/rankings?provider=openai&mode=agentic_c...</a></p>
]]></description><pubDate>Tue, 02 Jun 2026 22:28:35 +0000</pubDate><link>https://news.ycombinator.com/item?id=48377195</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48377195</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48377195</guid></item><item><title><![CDATA[New comment by gertlabs in "My thoughts after using Clojure for about a month"]]></title><description><![CDATA[
<p>The functional paradigm is a bit uncomfortable at first, but it does make problem solving feel... different. I personally find OOP to be the most intuitive for large scale systems design, but that's just me.<p>Most models do not perform particularly well in Clojure, but OpenAI models fully utilize the power of the language. Subjectively, it kind of seems to match the personality. Data at <a href="https://gertlabs.com/rankings?provider=openai" rel="nofollow">https://gertlabs.com/rankings?provider=openai</a></p>
]]></description><pubDate>Tue, 02 Jun 2026 22:02:38 +0000</pubDate><link>https://news.ycombinator.com/item?id=48376947</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48376947</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48376947</guid></item><item><title><![CDATA[New comment by gertlabs in "Social Intelligence Benchmark"]]></title><description><![CDATA[
<p>Nice, that's a good one -- interesting dynamics can come out of deceptively simple social games.</p>
]]></description><pubDate>Tue, 02 Jun 2026 07:52:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=48367269</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48367269</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48367269</guid></item><item><title><![CDATA[Social Intelligence Benchmark]]></title><description><![CDATA[
<p>Article URL: <a href="https://gertlabs.com/blog/social-intelligence-benchmark">https://gertlabs.com/blog/social-intelligence-benchmark</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=48366885">https://news.ycombinator.com/item?id=48366885</a></p>
<p>Points: 5</p>
<p># Comments: 2</p>
]]></description><pubDate>Tue, 02 Jun 2026 06:52:37 +0000</pubDate><link>https://gertlabs.com/blog/social-intelligence-benchmark</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48366885</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48366885</guid></item><item><title><![CDATA[New comment by gertlabs in "OpenRouter raises $113M Series B"]]></title><description><![CDATA[
<p>OpenRouter is our primary provider for evaluation data, and we've been really happy with them!<p>I'm sure they're experiencing growing pains, but a larger model selection (and faster releases for open weights models), would keep us from using other providers. For example, it took much longer than it should have to get Qwen 3.6 ~30B class models released (almost 2 weeks if I recall)</p>
]]></description><pubDate>Sat, 30 May 2026 18:42:26 +0000</pubDate><link>https://news.ycombinator.com/item?id=48339400</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48339400</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48339400</guid></item><item><title><![CDATA[New comment by gertlabs in "LLM Paper Trading"]]></title><description><![CDATA[
<p>That's an interesting point -- they are told that they are paper trading. Maybe we should run another session that A/B tests this.</p>
]]></description><pubDate>Sat, 30 May 2026 07:35:01 +0000</pubDate><link>https://news.ycombinator.com/item?id=48333663</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48333663</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48333663</guid></item><item><title><![CDATA[New comment by gertlabs in "LLM Paper Trading"]]></title><description><![CDATA[
<p>We're a month into a long running experiment to see how recent models perform in day trading, where they have a constant harness giving them the ability to write code, access the web, take notes, and install handlers to trade for them. Realistic slippage and fees and margin requirements are built into the simulation.<p>The harness and the capabilities built into the environment gives the models all the resources they would realistically ever have access to in order to help them succeed (or fail).<p>We just tell the models to maximize portfolio balance in the long term. The models are responsible for any good (or bad) ideas.</p>
]]></description><pubDate>Sat, 30 May 2026 07:18:14 +0000</pubDate><link>https://news.ycombinator.com/item?id=48333565</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48333565</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48333565</guid></item><item><title><![CDATA[LLM Paper Trading]]></title><description><![CDATA[
<p>Article URL: <a href="https://gertlabs.com/spectate?game=trading">https://gertlabs.com/spectate?game=trading</a></p>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=48333564">https://news.ycombinator.com/item?id=48333564</a></p>
<p>Points: 6</p>
<p># Comments: 5</p>
]]></description><pubDate>Sat, 30 May 2026 07:18:14 +0000</pubDate><link>https://gertlabs.com/spectate?game=trading</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48333564</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48333564</guid></item><item><title><![CDATA[New comment by gertlabs in "Notes from the Mistral AI Now Summit in Paris"]]></title><description><![CDATA[
<p>There is likely a theoretical limit to how much intelligence you can pack into a model of a given size (especially when stretching that over a large input context size).<p>Our evals are pretty complex so we only recently started testing ~30B class models, which are now becoming quite smart (on par with the frontier from 1 year ago). Mistral is far behind, but I'm rooting for them.<p>Data at <a href="https://gertlabs.com/rankings" rel="nofollow">https://gertlabs.com/rankings</a></p>
]]></description><pubDate>Fri, 29 May 2026 19:01:43 +0000</pubDate><link>https://news.ycombinator.com/item?id=48327769</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48327769</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48327769</guid></item><item><title><![CDATA[New comment by gertlabs in "Claude Opus 4.8"]]></title><description><![CDATA[
<p>That'll populate over the next couple weeks -- those are the live games on the spectate tab which take a while to generate statistically worthwhile data. I'm curious how it does. From using it all day, I can say Opus 4.8 is my new favorite model, hands down.</p>
]]></description><pubDate>Fri, 29 May 2026 14:28:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=48323538</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48323538</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48323538</guid></item><item><title><![CDATA[New comment by gertlabs in "Claude Opus 4.8"]]></title><description><![CDATA[
<p>Appreciate that! Results are live: <a href="https://gertlabs.com/rankings" rel="nofollow">https://gertlabs.com/rankings</a><p>Opus 4.8 is the first tangible improvement since Opus 4.5. And it doesn't seem to have the personality problems of the last release -- I've been enjoying using it.</p>
]]></description><pubDate>Fri, 29 May 2026 03:58:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=48318854</link><dc:creator>gertlabs</dc:creator><comments>https://news.ycombinator.com/item?id=48318854</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=48318854</guid></item></channel></rss>