Hacker News: XCSme

New comment by XCSme in "GLM-5.1: Towards Long-Horizon Tasks"

XCSme — Wed, 08 Apr 2026 08:05:47 +0000

If it's relevant to the discussion, I hope not.

I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.

Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.

New comment by XCSme in "GLM-5.1: Towards Long-Horizon Tasks"

XCSme — Wed, 08 Apr 2026 00:12:51 +0000

General intelligence (not coding) comparison: https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-...

New comment by XCSme in "GLM-5.1: Towards Long-Horizon Tasks"

XCSme — Wed, 08 Apr 2026 00:11:10 +0000

The (none) version especially shows considerable degradation.

New comment by XCSme in "GLM-5.1: Towards Long-Horizon Tasks"

XCSme — Wed, 08 Apr 2026 00:10:17 +0000

GLM 5.1 does worse than GLM 5 in my tests[0] (both medium reasoning OR no reasoning).

I think the model is now tuned more towards agentic use/coding than general intelligence.

[0]: https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-...

New comment by XCSme in "Gemma 4 on iPhone"

XCSme — Sun, 05 Apr 2026 21:58:57 +0000

Gemma 4 is great: https://aibenchy.com/compare/google-gemma-4-31b-it-medium/go...

I assume it is the 26B A4B one, if it runs locally?

New comment by XCSme in "The CMS is dead, long live the CMS"

XCSme — Sun, 05 Apr 2026 21:53:55 +0000

I tried using Astro for https://aibenchy.com, initially it went great, but then I got into static-website limitations (such as dynamically generating all comparison pages, which would been generating N^4 pages, where N is the number of tested models).

I ended up switching to plain PHP, and it worked great. It is still mostly "static", but I can dynamically include the same content on multiple pages without having to duplicate/build it every time.

New comment by XCSme in "Google releases Gemma 4 open models"

XCSme — Sat, 04 Apr 2026 21:08:55 +0000

I don't have coding tests yet, will add soon

New comment by XCSme in "Google releases Gemma 4 open models"

XCSme — Sat, 04 Apr 2026 17:15:09 +0000

Good question! I might add them, but there were multiple reasons:

1. Most variants on HIGH/XHIGH provide only marginal improvements in accuracy, but at drastically increased latency and cost. One special example is Gemini 3.1 Flash Lite, which on High used 1.5M reasoning tokens, and it's cost was 5x the one of running 5.3-Codex: https://aibenchy.com/compare/google-gemini-3-1-flash-lite-pr...

2. On medium it seems like most models use a similar amount of reasoning tokens, this should be a more fair comparison.

3. Most models in the wild are used on medium (chat apps, default coding apps, tools, etc.).

4. Running on models on HIGH/XHIGH can lead to huge costs for me maintaining the test suite. I might add more models on high, if I can do it in a sustainable way.

5. Running models on HIGH would make running tests suites take much longer, so the results won't be published as fast.

6. Some models even show degradation when used on HIGH, as they tend to overthink/doubt themselves more. This seems to be a trend especially for new models, which wore trained to actually say "wait, but" quite a lot...

Overall, I am happy with how the current leaderboard/comparisons work. I might test some models on high, but for me, a better indication of true intelligence of a model/AGI is how well it does with "none"/no reasoning, than how well it does with high.

New comment by XCSme in "Google releases Gemma 4 open models"

XCSme — Thu, 02 Apr 2026 22:29:31 +0000

It does quite well on my limited/not-so-scientific private tests (note the tests don't include coding tests): https://aibenchy.com/compare/google-gemma-4-31b-it-medium/go...

New comment by XCSme in "Qwen3.6-Plus: Towards real world agents"

XCSme — Thu, 02 Apr 2026 22:29:24 +0000

3.6 Plus seems to be simply a refined/more consistent 3.5 Plus: https://aibenchy.com/compare/qwen-qwen3-5-plus-02-15-medium/...

New comment by XCSme in "Google releases Gemma 4 open models"

XCSme — Thu, 02 Apr 2026 22:23:55 +0000

Good work, it's quite close to Gemini 3 Pro in my tests, but 10x cheaper:

https://aibenchy.com/compare/google-gemma-4-31b-it-medium/go...

New comment by XCSme in "$500 GPU outperforms Claude Sonnet on coding benchmarks"

XCSme — Sat, 28 Mar 2026 01:47:20 +0000

It's 8.3 vs 8.1, I wouldn't call that significantly better.

I think GLM got a bit in front, because on some tests that both got wrong, GLM did sometimes (inconsistently) respond with the correct answer.

That being said, yes, in this case probably with more and more tests added, gpt-5.4 would edge in front, especially if a coding would be added (there are no coding tests yet).

New comment by XCSme in "$500 GPU outperforms Claude Sonnet on coding benchmarks"

XCSme — Sat, 28 Mar 2026 01:42:19 +0000

The questions do ask specifically to respond with the answer only, with an example format given in many cases.

Note that all reasoning models are tested with "medium" reasoning.

The benchmarks are questions/data processing tasks that an average user will likely ask, not coding questions (I didn't add any coding tests yet).

Gemini models also tend to be very consistent. Asking the same question will likely give the same result.

The two models you mention scored the same, the only difference is that Gemini was better at domain-specific questions (i.e. you ask something quite technical/niche).

New comment by XCSme in "$500 GPU outperforms Claude Sonnet on coding benchmarks"

XCSme — Fri, 27 Mar 2026 08:34:15 +0000

Why not? I described this in more detail in other comments.

Even when using structured output, sometimes you want to define how the data should be displayed or formatted, especially for cases like chat bots, article writing, tool usage, calling external api's, parsing documents, etc.

Most models get this right. Also, this is just one failure mode of Claude.

New comment by XCSme in "$500 GPU outperforms Claude Sonnet on coding benchmarks"

XCSme — Fri, 27 Mar 2026 08:28:45 +0000

The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.

New comment by XCSme in "$500 GPU outperforms Claude Sonnet on coding benchmarks"

XCSme — Fri, 27 Mar 2026 08:27:46 +0000

Oh, I didn't think about this, that's a good idea. I also feel generally model performance changes over time (usually it gets worse).

The problem with doing this is cost. Constsntly testing a lot of models on a large dataset can get really costly.

New comment by XCSme in "$500 GPU outperforms Claude Sonnet on coding benchmarks"

XCSme — Fri, 27 Mar 2026 04:01:16 +0000

I used qwen 3.5 plus in production, it was really good at instruction following and tool calling.

New comment by XCSme in "$500 GPU outperforms Claude Sonnet on coding benchmarks"

XCSme — Fri, 27 Mar 2026 01:38:47 +0000

Yup, they do quite poorly on random non-coding tasks:

https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...

New comment by XCSme in "Show HN: Email.md – Markdown to responsive, email-safe HTML"

XCSme — Wed, 25 Mar 2026 11:29:32 +0000

But I have to send the same sort of information (albeit shorter) via email on a regular basis.

A lot of alerts, reporting, quotes, code snippets, short documentation or step by step instructions, etc.

I don't just send emails to say "Hey, let's meet at 5". You know the memes with "this could have been an email", it usually is this case.

Just to be clear, most of those rich emails are the automatic/transactional emails.

New comment by XCSme in "Show HN: Email.md – Markdown to responsive, email-safe HTML"

XCSme — Wed, 25 Mar 2026 11:03:16 +0000

Why isn't this website plain text then?