Hacker News: yorwba

New comment by yorwba in "中文 Literacy Speedrun II: Character Cyclotron"

yorwba — Fri, 17 Apr 2026 11:31:51 +0000

> A guy on a forum had hired a calligrapher to write three thousand characters in ballpoint pen

A shame that this amazing resource is not linked.

New comment by yorwba in "Ask HN: What are the machine requirements for a LLM like Llama-3.1-8B?"

yorwba — Fri, 17 Apr 2026 08:10:30 +0000

https://www.canirun.ai can help you find a model that'll run on your hardware.

New comment by yorwba in "Android CLI: Build Android apps 3x faster using any agent"

yorwba — Fri, 17 Apr 2026 07:18:17 +0000

I'm pretty sure this will just call itself in a loop. You need to use the absolute path to the wrapped binary to distinguish it from the wrapper.

New comment by yorwba in "Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7"

yorwba — Thu, 16 Apr 2026 21:55:45 +0000

If all models are trained on the benchmark data, you cannot extrapolate the benchmark scores to performance on unseen data, but the ranking of different models still tells you something. A model that solves 95/98 benchmark problems may turn out much worse than that in real life, but probably not much worse than the one that only solved 11/98 despite training on the benchmark problems.

This doesn't hold if some models trained on the benchmark and some didn't, but you can fix this by deliberately fine-tuning all models for the benchmark before comparing them. For more in-depth discussion of this, see https://mlbenchmarks.org/11-evaluating-language-models.html#...

New comment by yorwba in "Germany suspends military approval for long stays abroad for men under 45"

yorwba — Thu, 16 Apr 2026 09:03:42 +0000

The previous major war was in Afghanistan with 150,000 German soldiers: https://www.dw.com/en/germany-honors-soldiers-who-fought-in-...

New comment by yorwba in "Cybersecurity looks like proof of work now"

yorwba — Thu, 16 Apr 2026 08:29:33 +0000

I think you took away the wrong lesson from that podcast:

I think there is work to be done on scaffolding the models better. This exponential right now reminds me of the exponential from CPU speeds going up until let’s say 2000 or something where you had these game developers who would develop really impressive games on the current thing of hardware and they do it by writing like really detailed intricate x86 instruction sequences for like just exactly whatever this, like, you know, whatever 486 can do, knowing full well that in 2 years, you know, the pen team is gonna be able to do this much faster and they didn’t need to do it. But like you need to do it now because you wanna sell your game today and like, yeah, you can’t just like wait and like have everyone be able to do this. And so I do think that there definitely is value in squeezing out all of the last little juice that you can from the current model.

Everything you can do today will eventually be obsoleted by some future technology, but if you need better results today, you actually have to do the work. If you just drop everything and wait for the singularity, you're just going to unnecessarily cap your potential in the meantime.

New comment by yorwba in "Germany suspends military approval for long stays abroad for men under 45"

yorwba — Thu, 16 Apr 2026 07:55:49 +0000

The suspension doesn't change whether you get drafted or not, it just reduces peacetime bureaucracy at the expense of making a future draft more chaotic if it does happen.

New comment by yorwba in "Ask HN: Who is using OpenClaw?"

yorwba — Wed, 15 Apr 2026 22:04:39 +0000

It's massive in the sense of people hyping it on social media and grifters trying to profit from it. Pure FOMO, not dissimilar from the earlier "earning a side income using ChatGPT" hype. I doubt there are many people using it successfully for any purpose other than producing social media content promoting courses promising people to teach them how to use OpenClaw, for a fee of course.

New comment by yorwba in "The M×N problem of tool calling and open-source models"

yorwba — Tue, 14 Apr 2026 14:56:15 +0000

Yes, typically the tags used for tool calls get their own special tokens, e.g. https://huggingface.co/google/gemma-4-E4B-it/blob/main/token...

New comment by yorwba in "The M×N problem of tool calling and open-source models"

yorwba — Tue, 14 Apr 2026 13:33:37 +0000

Each text token already represents the activation of certain neurons. There is nothing "more direct." And you cannot fully separate data and metadata if you want them to influence the output. At best you can clearly distinguish them and hope that this is enough for the model to learn to treat them differently.

New comment by yorwba in "Alibaba's Qwen family captures over 50% of global open-source model downloads"

yorwba — Tue, 14 Apr 2026 08:53:54 +0000

The report doesn't even count downloads from ModelScope.cn, the Chinese HuggingFace competitor.

New comment by yorwba in "N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?"

yorwba — Tue, 14 Apr 2026 08:19:19 +0000

Yeah, the LLM judge is a bit too gullible. GLM 5.1 here https://ndaybench.winfunc.com/traces/trace_585887808ff443cca... claims that onnx/checker.cc doesn't reject hardlinks, even though it does (and the model output even quotes the lines that perform the check). The actual patch https://github.com/onnx/onnx/commit/4755f8053928dce18a61db8f... instead adds using std::filesystem::weakly_canonical to catch path traversal through symlinks. It also adds a Python function that does the same (?) checks when saving files. Honestly, even that patch seems LLM-generated to me, the way it duplicates code in a bunch of places instead of channeling all file accesses through a single hardened function.

Anyway, GLM 5.1 gets a score of 93 for its incorrect report.

New comment by yorwba in "All elementary functions from a single binary operator"

yorwba — Mon, 13 Apr 2026 12:56:06 +0000

Yes, metamath uses a large collection of specialized but reusable building blocks, so it doesn't blow up exponentially. However, if you want to "just do gradient descent" on general trees built from a single universal primitive, you now have to rediscover all those building blocks on the fly. And while the final result may have a compact representation as a DAG by merging common subexpressions, you also need to be able to represent potential alternative solutions, and that's where the exponential blowup comes in.

Or you could accept that there's already a large collection of known useful special functions, and work with shallower trees of those instead, e.g. https://arxiv.org/abs/1905.11481

New comment by yorwba in "Opus 4.6 hallucinates twice as more today than when it released"

yorwba — Mon, 13 Apr 2026 10:27:34 +0000

Yeah, of those 6 tasks, only "halluc-doc-http-handler" isn't within 1% of the previous result. 86.6% is 13/15 rounded down, so if they sampled 15 attempts for that task, the probability of getting 100% when the true success rate was 13/15 would be (13/15)^15 > 0.11, which is not all that unlikely.

New comment by yorwba in "Apple's accidental moat: How the "AI Loser" may end up winning"

yorwba — Mon, 13 Apr 2026 08:19:18 +0000

I think you underestimate the amount of knowledge needed to deal with the complexities of language in general as opposed to specific applications. We had algorithms to do complex mathematical reasoning before we had LLMs, the drawback being that they require input in restricted formal languages. Removing that restriction is what LLMs brought to the table.

Once the difficult problem of figuring out what the input is supposed to mean was somewhat solved, bolting on reasoning was easy in comparison. It basically fell out with just a bit of prompting, "let's think step by step."

If you want to remove that knowledge to shrink the model, we're back to contorting our input into a restricted language to get the output we want, i.e. programming.

New comment by yorwba in "Silicon Valley is quietly running on Chinese open source models"

yorwba — Sun, 12 Apr 2026 08:44:51 +0000

OpenRouter has a list of providers: https://openrouter.ai/minimax/minimax-m2.5

New comment by yorwba in "Silicon Valley is quietly running on Chinese open source models"

yorwba — Sun, 12 Apr 2026 08:18:21 +0000

There are US-based companies offering inference for MiniMax models charging slightly less than what MiniMax charges. MiniMax themselves claim to be using data centers in the US. US companies training their own closed-weight models charge so much more because they can. They're monopoly providers for their own models, so they can ask for whatever amount people are willing to pay.

New comment by yorwba in "Small models also found the vulnerabilities that Mythos found"

yorwba — Sun, 12 Apr 2026 08:03:00 +0000

When people criticize Aisle's methodology, they aren't "defending Mythos," they're bashing Aisle for their disingenuous claims.

New comment by yorwba in "Show HN: Pardonned.com – A searchable database of US Pardons"

yorwba — Sat, 11 Apr 2026 18:56:11 +0000

The first time was under the name Adriana Shayota: https://www.justice.gov/usao-sdca/pr/federal-jury-convicts-s...

New comment by yorwba in "Small models also found the vulnerabilities that Mythos found"

yorwba — Sat, 11 Apr 2026 18:11:14 +0000

We don't even need to hypothesize that much on the irrelevant nonsense, since they helpfully provide data with the detected vulnerability patched: https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jag... and half of the small models they touted as finding the vulnerability still found it in the patched code in 3/3 runs. A model that finds a vulnerability 100% of the time even when there is none is just as informative as a model that finds a vulnerability 0% of the time even when there is one. You could replace it with a rock that has "There's a vulnerability somewhere." engraved on it.

They're a company selling a system for detecting vulnerabilities reliant on models trained by others, so they're strongly incentivized to claim that the moat is in the system, not the model, and this post really puts the thumb on the scale. They set up a test that can hardly distinguish between models (just three runs, really??) unless some are completely broken or work perfectly, the test indeed suggests that some are completely broken, and then they try to spin it as a win anyway!

A high false-positive rate isn't necessarily an issue if you can produce a working PoC to demonstrate the true positives, where they kinda-sorta admit that you might need a stronger model for this (a.k.a. what they can't provide to their customers).

Overall I rate Aisle intellectually dishonest hypemongers talking their own book.