<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: XCSme</title><link>https://news.ycombinator.com/user?id=XCSme</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Sun, 12 Apr 2026 18:57:15 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=XCSme" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[New comment by XCSme in "GLM-5.1: Towards Long-Horizon Tasks"]]></title><description><![CDATA[
<p>If it's relevant to the discussion, I hope not.<p>I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.<p>Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.</p>
]]></description><pubDate>Wed, 08 Apr 2026 08:05:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=47686919</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47686919</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47686919</guid></item><item><title><![CDATA[New comment by XCSme in "GLM-5.1: Towards Long-Horizon Tasks"]]></title><description><![CDATA[
<p>General intelligence (not coding) comparison: <a href="https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-medium/moonshotai-kimi-k2-5-medium/qwen-qwen3-6-plus-preview-medium/" rel="nofollow">https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-...</a></p>
]]></description><pubDate>Wed, 08 Apr 2026 00:12:51 +0000</pubDate><link>https://news.ycombinator.com/item?id=47683008</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47683008</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47683008</guid></item><item><title><![CDATA[New comment by XCSme in "GLM-5.1: Towards Long-Horizon Tasks"]]></title><description><![CDATA[
<p>The (none) version especially shows considerable degradation.</p>
]]></description><pubDate>Wed, 08 Apr 2026 00:11:10 +0000</pubDate><link>https://news.ycombinator.com/item?id=47682994</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47682994</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47682994</guid></item><item><title><![CDATA[New comment by XCSme in "GLM-5.1: Towards Long-Horizon Tasks"]]></title><description><![CDATA[
<p>GLM 5.1 does worse than GLM 5 in my tests[0] (both medium reasoning OR no reasoning).<p>I think the model is now tuned more towards agentic use/coding than general intelligence.<p>[0]: <a href="https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-medium/z-ai-glm-5-none/z-ai-glm-5-1-none/" rel="nofollow">https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-...</a></p>
]]></description><pubDate>Wed, 08 Apr 2026 00:10:17 +0000</pubDate><link>https://news.ycombinator.com/item?id=47682987</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47682987</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47682987</guid></item><item><title><![CDATA[New comment by XCSme in "Gemma 4 on iPhone"]]></title><description><![CDATA[
<p>Gemma 4 is great: <a href="https://aibenchy.com/compare/google-gemma-4-31b-it-medium/google-gemma-4-26b-a4b-it-medium/google-gemini-3-pro-preview-medium/" rel="nofollow">https://aibenchy.com/compare/google-gemma-4-31b-it-medium/go...</a><p>I assume it is the 26B A4B one, if it runs locally?</p>
]]></description><pubDate>Sun, 05 Apr 2026 21:58:57 +0000</pubDate><link>https://news.ycombinator.com/item?id=47654323</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47654323</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47654323</guid></item><item><title><![CDATA[New comment by XCSme in "The CMS is dead, long live the CMS"]]></title><description><![CDATA[
<p>I tried using Astro for <a href="https://aibenchy.com" rel="nofollow">https://aibenchy.com</a>, initially it went great, but then I got into static-website limitations (such as dynamically generating all comparison pages, which would been generating N^4 pages, where N is the number of tested models).<p>I ended up switching to plain PHP, and it worked great. It is still mostly "static", but I can dynamically include the same content on multiple pages without having to duplicate/build it every time.</p>
]]></description><pubDate>Sun, 05 Apr 2026 21:53:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=47654270</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47654270</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47654270</guid></item><item><title><![CDATA[New comment by XCSme in "Google releases Gemma 4 open models"]]></title><description><![CDATA[
<p>I don't have coding tests yet, will add soon</p>
]]></description><pubDate>Sat, 04 Apr 2026 21:08:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=47643399</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47643399</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47643399</guid></item><item><title><![CDATA[New comment by XCSme in "Google releases Gemma 4 open models"]]></title><description><![CDATA[
<p>Good question! I might add them, but there were multiple reasons:<p>1. Most variants on HIGH/XHIGH provide only marginal improvements in accuracy, but at drastically increased latency and cost. One special example is Gemini 3.1 Flash Lite, which on High used 1.5M reasoning tokens, and it's cost was 5x the one of running 5.3-Codex: <a href="https://aibenchy.com/compare/google-gemini-3-1-flash-lite-preview-high/google-gemini-3-1-flash-lite-preview-medium/openai-gpt-5-3-codex-medium/" rel="nofollow">https://aibenchy.com/compare/google-gemini-3-1-flash-lite-pr...</a><p>2. On medium it seems like most models use a similar amount of reasoning tokens, this should be a more fair comparison.<p>3. Most models in the wild are used on medium (chat apps, default coding apps, tools, etc.).<p>4. Running on models on HIGH/XHIGH can lead to huge costs for me maintaining the test suite. I might add more models on high, if I can do it in a sustainable way.<p>5. Running models on HIGH would make running tests suites take much longer, so the results won't be published as fast.<p>6. Some models even show degradation when used on HIGH, as they tend to overthink/doubt themselves more. This seems to be a trend especially for new models, which wore trained to actually say "wait, but" quite a lot...<p>Overall, I am happy with how the current leaderboard/comparisons work. I might test some models on high, but for me, a better indication of true intelligence of a model/AGI is how well it does with "none"/no reasoning, than how well it does with high.</p>
]]></description><pubDate>Sat, 04 Apr 2026 17:15:09 +0000</pubDate><link>https://news.ycombinator.com/item?id=47641076</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47641076</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47641076</guid></item><item><title><![CDATA[New comment by XCSme in "Google releases Gemma 4 open models"]]></title><description><![CDATA[
<p>It does quite well on my limited/not-so-scientific private tests (note the tests don't include coding tests): <a href="https://aibenchy.com/compare/google-gemma-4-31b-it-medium/google-gemini-3-pro-preview-medium/z-ai-glm-5-turbo-medium/" rel="nofollow">https://aibenchy.com/compare/google-gemma-4-31b-it-medium/go...</a></p>
]]></description><pubDate>Thu, 02 Apr 2026 22:29:31 +0000</pubDate><link>https://news.ycombinator.com/item?id=47621025</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47621025</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47621025</guid></item><item><title><![CDATA[New comment by XCSme in "Qwen3.6-Plus: Towards real world agents"]]></title><description><![CDATA[
<p>3.6 Plus seems to be simply a refined/more consistent 3.5 Plus: <a href="https://aibenchy.com/compare/qwen-qwen3-5-plus-02-15-medium/qwen-qwen3-6-plus-medium/" rel="nofollow">https://aibenchy.com/compare/qwen-qwen3-5-plus-02-15-medium/...</a></p>
]]></description><pubDate>Thu, 02 Apr 2026 22:29:24 +0000</pubDate><link>https://news.ycombinator.com/item?id=47621022</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47621022</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47621022</guid></item><item><title><![CDATA[New comment by XCSme in "Google releases Gemma 4 open models"]]></title><description><![CDATA[
<p>Good work, it's quite close to Gemini 3 Pro in my tests, but 10x cheaper:<p><a href="https://aibenchy.com/compare/google-gemma-4-31b-it-medium/google-gemini-3-flash-preview-medium/google-gemini-3-pro-preview-medium/google-gemini-3-1-pro-preview-medium/" rel="nofollow">https://aibenchy.com/compare/google-gemma-4-31b-it-medium/go...</a></p>
]]></description><pubDate>Thu, 02 Apr 2026 22:23:55 +0000</pubDate><link>https://news.ycombinator.com/item?id=47620971</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47620971</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47620971</guid></item><item><title><![CDATA[New comment by XCSme in "$500 GPU outperforms Claude Sonnet on coding benchmarks"]]></title><description><![CDATA[
<p>It's 8.3 vs 8.1, I wouldn't call that significantly better.<p>I think GLM got a bit in front, because on some tests that both got wrong, GLM did sometimes (inconsistently) respond with the correct answer.<p>That being said, yes, in this case probably with more and more tests added, gpt-5.4 would edge in front, especially if a coding would be added (there are no coding tests yet).</p>
]]></description><pubDate>Sat, 28 Mar 2026 01:47:20 +0000</pubDate><link>https://news.ycombinator.com/item?id=47550712</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47550712</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47550712</guid></item><item><title><![CDATA[New comment by XCSme in "$500 GPU outperforms Claude Sonnet on coding benchmarks"]]></title><description><![CDATA[
<p>The questions do ask specifically to respond with the answer only, with an example format given in many cases.<p>Note that all reasoning models are tested with "medium" reasoning.<p>The benchmarks are questions/data processing tasks that an average user will likely ask, not coding questions (I didn't add any coding tests yet).<p>Gemini models also tend to be very consistent. Asking the same question will likely give the same result.<p>The two models you mention scored the same, the only difference is that Gemini was better at domain-specific questions (i.e. you ask something quite technical/niche).</p>
]]></description><pubDate>Sat, 28 Mar 2026 01:42:19 +0000</pubDate><link>https://news.ycombinator.com/item?id=47550672</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47550672</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47550672</guid></item><item><title><![CDATA[New comment by XCSme in "$500 GPU outperforms Claude Sonnet on coding benchmarks"]]></title><description><![CDATA[
<p>Why not? I described this in more detail in other comments.<p>Even when using structured output, sometimes you want to define how the data should be displayed or formatted, especially for cases like chat bots, article writing, tool usage, calling external api's, parsing documents, etc.<p>Most models get this right. Also, this is just one failure mode of Claude.</p>
]]></description><pubDate>Fri, 27 Mar 2026 08:34:15 +0000</pubDate><link>https://news.ycombinator.com/item?id=47540252</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47540252</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47540252</guid></item><item><title><![CDATA[New comment by XCSme in "$500 GPU outperforms Claude Sonnet on coding benchmarks"]]></title><description><![CDATA[
<p>The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.</p>
]]></description><pubDate>Fri, 27 Mar 2026 08:28:45 +0000</pubDate><link>https://news.ycombinator.com/item?id=47540218</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47540218</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47540218</guid></item><item><title><![CDATA[New comment by XCSme in "$500 GPU outperforms Claude Sonnet on coding benchmarks"]]></title><description><![CDATA[
<p>Oh, I didn't think about this, that's a good idea. I also feel generally model performance changes over time (usually it gets worse).<p>The problem with doing this is cost. Constsntly testing a lot of models on a large dataset can get really costly.</p>
]]></description><pubDate>Fri, 27 Mar 2026 08:27:46 +0000</pubDate><link>https://news.ycombinator.com/item?id=47540213</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47540213</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47540213</guid></item><item><title><![CDATA[New comment by XCSme in "$500 GPU outperforms Claude Sonnet on coding benchmarks"]]></title><description><![CDATA[
<p>I used qwen 3.5 plus in production, it was really good at instruction following and tool calling.</p>
]]></description><pubDate>Fri, 27 Mar 2026 04:01:16 +0000</pubDate><link>https://news.ycombinator.com/item?id=47539004</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47539004</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47539004</guid></item><item><title><![CDATA[New comment by XCSme in "$500 GPU outperforms Claude Sonnet on coding benchmarks"]]></title><description><![CDATA[
<p>Yup, they do quite poorly on random non-coding tasks:<p><a href="https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moonshotai-kimi-k2-5-medium/z-ai-glm-5-medium/google-gemini-3-1-flash-lite-preview-medium/" rel="nofollow">https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...</a></p>
]]></description><pubDate>Fri, 27 Mar 2026 01:38:47 +0000</pubDate><link>https://news.ycombinator.com/item?id=47538122</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47538122</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47538122</guid></item><item><title><![CDATA[New comment by XCSme in "Show HN: Email.md – Markdown to responsive, email-safe HTML"]]></title><description><![CDATA[
<p>But I have to send the same sort of information (albeit shorter) via email on a regular basis.<p>A lot of alerts, reporting, quotes, code snippets, short documentation or step by step instructions, etc.<p>I don't just send emails to say "Hey, let's meet at 5". You know the memes with "this could have been an email", it usually is this case.<p>Just to be clear, most of those rich emails are the automatic/transactional emails.</p>
]]></description><pubDate>Wed, 25 Mar 2026 11:29:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=47515927</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47515927</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47515927</guid></item><item><title><![CDATA[New comment by XCSme in "Show HN: Email.md – Markdown to responsive, email-safe HTML"]]></title><description><![CDATA[
<p>Why isn't this website plain text then?</p>
]]></description><pubDate>Wed, 25 Mar 2026 11:03:16 +0000</pubDate><link>https://news.ycombinator.com/item?id=47515735</link><dc:creator>XCSme</dc:creator><comments>https://news.ycombinator.com/item?id=47515735</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=47515735</guid></item></channel></rss>