Hacker News: porridgeraisin

New comment by porridgeraisin in "Introspective Diffusion Language Models"

porridgeraisin — Thu, 16 Apr 2026 09:27:06 +0000

Yeah, I think it's a super neat way to do MTP. Conceptually much more pleasing and simple than existing methods. Especially since this way scaling `k` as models get better will be easier. Wish it had been presented as such.

New comment by porridgeraisin in "Ask HN: Who is using OpenClaw?"

porridgeraisin — Thu, 16 Apr 2026 09:25:15 +0000

Didn't know about qmd.

I use a mix of markdown notes, an sqlite database, and my image store searchable by text. I use immich.

For now I do it manually by giving it skills for each data store I wanna access.

My usecases are all ad-hoc I am not a "pro" user by any means. So I don't mind some manual work.

New comment by porridgeraisin in "Moving a large-scale metrics pipeline from StatsD to OpenTelemetry / Prometheus"

porridgeraisin — Thu, 16 Apr 2026 09:20:10 +0000

Yeah, at previous work we used both as well. The transition from prom to vm was "ongoing" and from the time I joined to the time I left we did parallel writes to both. Never faced issues with either. If I remember correctly, we wrote from services to a kafka queue first, and then a consumer took that and pushed it to (both) the metrics endpoint(s).

New comment by porridgeraisin in "Ask HN: Who is using OpenClaw?"

porridgeraisin — Wed, 15 Apr 2026 22:21:02 +0000

Yep. I had posted a comment earlier detailing my usecases. But I too replaced that with my own system that does those same things.

It's way too bloaty, felt like operating windows start menu search.

But you might have missed so far some of the ideas they have. So it's useful to try it out, see what combination of features you use in particular and then just set those up for yourself with claude code or whatever as the LLM harness. Telegram integration is dead easy.

New comment by porridgeraisin in "CRISPR takes important step toward silencing Down syndrome’s extra chromosome"

porridgeraisin — Wed, 15 Apr 2026 17:37:43 +0000

Sure, but given the choice to not have down syndrome, I'm sure they will choose it. Were they given the choice? Not as a hypothetical. But in front of their eyes.

New comment by porridgeraisin in "Study: Back-to-basics approach can match or outperform AI in language analysis"

porridgeraisin — Wed, 15 Apr 2026 17:36:30 +0000

Tbh. The accuracy of translation is, while much better than prior methods, not that great yet. For tamil atleast.

Microsoft Takes over Norway Stargate Data Center from OpenAI

porridgeraisin — Wed, 15 Apr 2026 13:56:57 +0000

Article URL: https://www.bloomberg.com/news/articles/2026-04-14/microsoft-takes-over-norway-openai-data-center-capacity

Comments URL: https://news.ycombinator.com/item?id=47779021

Points: 3

# Comments: 0

New comment by porridgeraisin in "Introspective Diffusion Language Models"

porridgeraisin — Tue, 14 Apr 2026 19:09:43 +0000

Eh. There is nothing diffusion about this. Nothing to do with denoising. This setup is still purely causal, making it quite a dishonest framing IMO. There is no more introspection here than what happens in MTP + SD setups.

Let me explain what is going on here. This is basically a form of multi-token prediction. And speculative decoding in inference. See my earlier post[1] to understand what that is. TL;DR, in multi-token prediction you train separate LM heads to predict the next as well as next to next token as well as... Upto chosen next kth token. Training multiple LM heads is expensive and can be unnecessary, so what people typically do is have a common base for all the k heads, explained further in [1]. These guys do another variant.

Here is what they do mechanically, given a sequence p consisting of five tokens PE([p1, p2, p3, p4, p5]). Where PE(.) adds relative position info to each token.

1. Create an augmented sequence PE([p1 MASK MASK MASK MASK]). Do a training pass on that, with the ground truth sequence p1..5. Here it is trained to, for example, to predict p3 given p1+pos=-2 MASK+pos=-1 MASK+pos=0, loosely notating.

2. Then separately[2], train it as usual on PE([p1 p2 p3 p4 p5]).

Step (1) teaches it to do multi-token prediction, essentially the single LM head will (very very loosely speaking) condition on the position `k` of the special MASK token and "route" it to the "implicit" k'th LM head.

Step (2) teaches it to be a usual LLM and predict the next token. No MASK tokens involved.

So far, you have trained a multi-token predictor.

Now during inference

You use this for speculative decoding. You generate 5 tokens ahead at once with MASK tokens. And then you run that sequence through the LLM again. This has the same benefits as usual speculative decoding, namely that you can do matrix-matrix multiplication as opposed to matrix-vector. The former is more memory-bandwidth efficient due to higher arithmetic intensity.

here is an example,

query = ["what", "is", "2+2"]) prompt = PE([...query, MASK*5]) you run output = LLM(prompt). Say output is ["what", "is", "2+2", "it", "is", "4"]. Note that the NN is trained to predict the kth next token when faced with positionally encoded MASK tokens. So you get all 5 in one go. To be precise, it learns to predict "4" given ["what", "is", "2+2", MASK, MASK]. Since it does not need the "it" and "is" explicitly, you can do it in parallel with generating the "it" and the "is". "is" is predicted given ["what", "is", "2+2", MASK], for example, and that also doesn't depend on the explicit "it" being there, and thus can also be done in parallel with generating "it", which is just normal generating the next token given the query. And then you use this as a draft in your speculative decoding setup.

Their claim is that using a multi-token predictor this way as a draft model works really well. To be clear, this is still causal, the reason diffusion models have hype is because they are capable of global refinement. This is not. In the same thread as [1], I explain how increasing the number of MASK tokens, i.e increasing `k`, i.e the number of tokens you predict at once in your multi-token prediction setup quickly leads to poor quality. This paper agrees with that. They try out k=2,3,4,8. They see a drop in quality at 8 itself. So finally, this is 4-token-prediction with self-speculative decoding(sans LayerSkip or such), removing seemingly no existing limitation of such setups. It is definitely an interesting way to train MTP though.

[1] https://news.ycombinator.com/item?id=45221692

[2] Note that it is computationally a single forward pass. Attention masks help you fuse steps 1 and 2 into a single operation. However, you still have 2 separate loss values.

New comment by porridgeraisin in "Missouri town fires half its city council over data center deal"

porridgeraisin — Mon, 13 Apr 2026 18:13:39 +0000

I guess their point is that of all possible industrial usecases, data centers are the least obnoxious one. I live in one of the countries that actually manufactures things, unlike the US, and I find it hard to argue with that. Any noise pollution caused by data centers is far far less than most industrial setups. It's the same with every other resource, water, electricity, effect on local shared infrastructure like roads and commerce, etc,. Other industries are an order of magnitude worse.

Given that you _have_ to have some industrial setup unless you want to import everything (tokens, in this case), datacenters are far and away the best choice.

I'll add a qualifier to the above, modifying it to say that of all industrial setups generating atleast X dollars of economic value, datacenters are far and away the best in terms of impact on nbhd.

The jobs argument also falls apart, when you consider that it's essentially 100 jobs in return for just an office building worth of space. If you want a thousand job plant just build that as well next town over, it will take way way more space and other resources though. The reason that didnt happen even before this datacenter boom is because most manufacturing setups are fairly infeasible in rich countries like the US. I can't imagine the response to a textile plant or a steel plant if this is the response to datacenters.

I agree however, that if you colocate a gigantic power plant, then you get the worst of both worlds. Fewer jobs and the hindrance of a big power plant near residential areas. Grid expansion being slow in developed areas like most of the US is not surprising though.

But this is pretty much the best case scenario. Tolerating the power plant until the grid expands is the way to go I suppose.

New comment by porridgeraisin in "Microsoft isn't removing Copilot from Windows 11, it's just renaming it"

porridgeraisin — Mon, 13 Apr 2026 16:38:03 +0000

The copilot executable and the edge executable are actually the same! It looks at argv[0] to decide which to show you. You can move mscopilot.exe to msedge.exe, it still opens edge. And vice versa.

New comment by porridgeraisin in "Pro Max 5x quota exhausted in 1.5 hours despite moderate usage"

porridgeraisin — Mon, 13 Apr 2026 14:34:34 +0000

I wanted this as well. Even asked about it at an openai talk. Basically a way to get the KV cache to the client (they can encrypt it if they care about me REing it, make a compressed latent if they don't wanna egress 20GB, whatever, I'm fine with a black box) so that I can load it later and avoid these cache misses.

I think the primary reason they cannot do this is that they change the memory and communication layouts in their serving stack rather aggressively. And naturally keeping the KV cache portable across all such layouts is a very difficult task. So you'd have to version the cache down to a specific deployment, and invalidate it the moment anything even small changes. So giving the user a handle to the cache sort of prevents you from making large changes to memory layout. Which is I suppose not that enticing. Also, client side KV caches are only meaningful in today's 1M contexts. Few y back it wasn't necessary, since just recomputing would be better for everybody.

To be clear, I don't mean they send it along with every request. Rather, they do their current TTL cache, and then when I'm at the end of a session, I request it in one shot and then close the session. And it doesn't have to come to the literal client, they can egress it to a storage service that we pay for, whatever. But ya the compat problem makes it all a non starter.

New comment by porridgeraisin in "Many African families spend fortunes burying their dead"

porridgeraisin — Fri, 10 Apr 2026 08:15:43 +0000

Funerals are huge in india too. It runs for 13 days in some communities. To be clear, the actual cremation happens immediately, but the funeral ceremonies continue for 13 days after that.

Most of the expenses are days of one-meal-a-day for guests, and the general extra expenses of having a lot of relatives over at your house. The ceremonies themselves are fairly cheap - it's mostly prayers.

However there is no insurance and so on, since the aforementioned expenses scale with usual QoL.

New comment by porridgeraisin in "Six (and a half) intuitions for KL divergence"

porridgeraisin — Fri, 10 Apr 2026 06:27:05 +0000

> So minimising the cross entropy over theta is the same as maximising KL(P,Q)

Minimising*

New comment by porridgeraisin in "Issue: Claude Code is unusable for complex engineering tasks with Feb updates"

porridgeraisin — Tue, 07 Apr 2026 05:52:37 +0000

IMO, it's an expectations vs reality thing.

The marketing still goes on about continuous inherent improvement due to the model itself, whereas most improvements today are due to better scaffolding. The key now is to build tooling around these LLMs to make them reliably productive - whatever level that may be at.

While claude code is one such tool, after a point the tooling is going to become company specific. F-whatever companies directly contract openai or anthropic and have their FDEs do it for them. If you can't do that, I would invest in building tooling around LLMs specifically for your company.

Note that LLMs are approximate retrieval machines. You still need a planner* and a verifier around it. Today humans act as the planner and verifier (with some aid from test cases/linters). Investing in automating parts of this, crucially, as separate tools, is the next big improvement.

* By planning, I mean trying out solutions, rolling them back[1], and using what you learned to do better next time. The solution search process. Context management also falls under this.

[1] and no, LLMs going "wait no..." doesn't count.

New comment by porridgeraisin in "Got kicked out of uni and had the cops called for a social media website I made"

porridgeraisin — Tue, 07 Apr 2026 05:39:32 +0000

The law is pretty much redundant here. Even if he was connection-less, reserved, etc, the same problem would have happened.

See https://news.ycombinator.com/item?id=47668009

New comment by porridgeraisin in "Got kicked out of uni and had the cops called for a social media website I made"

porridgeraisin — Tue, 07 Apr 2026 05:38:28 +0000

The admin behaviour is expected in the Indian context. See my other comment.

https://news.ycombinator.com/item?id=47668009

New comment by porridgeraisin in "Got kicked out of uni and had the cops called for a social media website I made"

porridgeraisin — Mon, 06 Apr 2026 22:13:21 +0000

Quite the personality this kid, I must say.

The admin behaviour is expected in an Indian context, provided you behave the way this guy did. I am not saying it's good to snatch the guys phone, but it's expected.

Let me explain the core issue here.

The issue is that if the platform ever devolves into something that can be construed as cyberbullying, then the admin is suddenly in trouble.

In the Indian context, elite public colleges like IITD have some students from quite poor non urban backgrounds, These colleges are cheap, have a strict entrance exam (JEE) and there's no money requirement so you have people from all financial strata. As such, the social dynamic is that the parents "entrust" the college with "taking care" of their kid. Especially in first generation educated. In contrast, in private colleges with homogenous, richer families the social dynamic puts more responsibility on the student. The age of 18 is completely irrelevant in this dynamic.

The point is, the admin in this college is also somewhat of a caretaker of the students. And will face social liability for cyberbullying "happening under their nose". This is true even if it happens on reddit by the way (and the bully is in the same college). Essentially, if there is a way for the dean to intervene and he doesn't, he has failed in his job as a caretaker. That's the dynamic here. Obviously he has deniability if some random american bullies a IITD kid on say HN. But if a IITD kid bullies a IITD kid on any social platform they will come down on it heavily.

Thus, the platform was never going to work and it's problematic before the law even comes into play. Talking about "tell me what rule I broke" without considering the above social dynamics is fairly immature. If they had done the same thing at say an Ashoka University (expensive private college) then they would have faced none of these issues by contrast. If I'm allowed a swipe at the author, this situation is entirely expected given their privileged background.

New comment by porridgeraisin in "Nvim-treesitter (13K+ Stars) is Archived"

porridgeraisin — Sun, 05 Apr 2026 05:30:36 +0000

This is why I built nvim from source, and git pull plugins into the pack directory. I think it's even a static binary. Whatever changes I need I git pull. After they added LSP I have not wished for anything else really, so I stopped pulling. I think I pulled LSP completion API in 0.11 era but that's it.

Hate it when people break backwards compatibility. For me it's sacrosanct, more important than absolutely anything else.

I only have a handful of plugins so the system works well. And I have a 500 line init.vim (and no other config).

Some ecosystems like golang share this principle and so I can freely update packages without worrying about breakages. But other ecosystems(nvim, python, etc) I'm a lone warrior

New comment by porridgeraisin in "Embarrassingly simple self-distillation improves code generation"

porridgeraisin — Sat, 04 Apr 2026 16:18:47 +0000

There's an obvious baseline which seems missing

If you sample from the base model with T=1.6, top_k=20, top_p=0.8, i.e, the decode settings used for the distillation's ground truth, does it match the SSD'd model + some decoding? Performance wise.

Their sweep is missing this. And only covers "standard" decoding settings.

New comment by porridgeraisin in "Understanding young news audiences at a time of rapid change"

porridgeraisin — Fri, 03 Apr 2026 20:20:31 +0000

I do the same in my 20s :)

We actually had a newspaper at home and I used to sneak a peek, atleast the sports page. But that stopped during Covid and we never renewed it... No one in the house is missing it so far. And we don't really watch news channels (our TV is just a few streaming subscriptions). At max I see a few headlines on social media recommendations. And I don't use twitter. Meaning no news. Its great highly recommended.