Hacker News: valine

New comment by valine in "TurboQuant: Redefining AI efficiency with extreme compression"

valine — Wed, 25 Mar 2026 08:44:38 +0000

Yup exactly, in principle it helps with both inference speed by reducing memory bandwidth usage and also reduces the memory footprint of your kvcache.

New comment by valine in "TurboQuant: Redefining AI efficiency with extreme compression"

valine — Wed, 25 Mar 2026 08:27:17 +0000

So let’s start with a really simple decoder transformer with a single layer and single attention head, and train it to predict the next token in a sequence of text. To predict the next token you need a few things: a query for the very last token in the sequence, and a key and value for every prior token. You take your query and compute a dot product with every prior key (two large vectors in, scaler attention score out). That scaler attention score first goes through softmax, and then becomes the weight you use to compute a weighted average of your values, new value goes through the mlp, mlp output is projected into the logits from which you sample your next token (that’s the general idea at least skipped a few steps).

The last query in the sequence will be new for every new token you predict, but the set of prior keys and values stay the same, ie keys and values are reusable. The key value cache gets bigger and bigger for each new token you add to the sequence, and that’s where compression comes in. You have to store the keys and values in vram, and you’d like to keep the size down by not storing the raw uncompressed tensors. To make this work well your compression needs two things: it needs to be fast so that you can compress and decompress on the fly, and it needs to play well with softmax attention. Prior attempts at compression usually suck at one or the other, either the speed to decompress is too slow and your token/s takes a hit, or you lose important precision and the model output quality suffers. The claim in the paper is that they’ve made progress on both.

New comment by valine in "Justice Dept. launches criminal investigation of Minnesota governor"

valine — Sat, 17 Jan 2026 04:01:03 +0000

The allegations of fraud made by the people invading my home and terrorizing my friends? I’ll take those with a grain of salt.

And for the record, I’m not afraid of ICE, never said I was. ICE is racially profiling people, arresting them without cause, and deporting them without due process. I happen to be white, so that doesn’t apply in my case. It’s also coincidentally the same reason I feel safe posting a comment like this online. Free speech is being chilled in communities that ice targets, and I feel responsibility to relay what I’m hearing.

ICE also just gunned down a US citizen in the street, and that should scare everyone.

New comment by valine in "Justice Dept. launches criminal investigation of Minnesota governor"

valine — Sat, 17 Jan 2026 03:31:28 +0000

I lived my whole life in the twin cities and have a lot of friends, US citizens, who are too scared to go out to eat right now because of the ICE raids. If that wasn’t the point it is certainly the effect. I applaud Walz and Frey, and I will be ranking Frey first next time he’s up for reelection. Something tells me though he will be on to bigger things than mayor of Minneapolis.

New comment by valine in "Horses: AI progress is steady. Human equivalence is sudden"

valine — Tue, 09 Dec 2025 16:44:33 +0000

> About the second point, I've been under the impression that because LLMs are trained on average code, they infer that the bugs and architectural flaws are desirable

This is really only true about base models that haven’t undergone post training. The big difference between ChatGPT and GPT3 was OpenAI’s instruct fine tuning. Out of the box, language models behave how you describe. Ask them a question and half the time they generate a list of questions instead of an answer. The primary goal of post training is to coerce the model into a state in which it’s more likely to output things as if it were a helpful assistant. The simplest version is text at the start of your context window like: “the following is code was written by a meticulous senior engineer”. After a prompt like that the most likely next tokens will never be the models imitation of a sloppy code. Instruct fine tuning does the same thing but as permanent modifications to the weights of the model.

New comment by valine in "Horses: AI progress is steady. Human equivalence is sudden"

valine — Tue, 09 Dec 2025 07:10:56 +0000

Humans don’t learn to write messy complex code. Messy, complex code is the default, writing clean code takes skill.

You’re assuming the LLM produces extra complexity because it’s mimicking human code. I think it’s more likely that LLMs output complex code because it requires less thought and planning, and LLMs are still bad at planning.

New comment by valine in "Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?"

valine — Thu, 23 Oct 2025 17:04:18 +0000

Image generation and image input are two totally different things. This is about feeding text into LLMs as images, it has nothing to do with image generation.

New comment by valine in "Show HN: I created a small 2D game about an ant"

valine — Fri, 19 Sep 2025 05:03:17 +0000

Here's my version, took about 5 minutes to create inside the ChatGPT web interface. https://valine.github.io/vibe-coded-ant-game/

I don't know if this game was vibe coded, but it certainly could have been. Most notable thing about this post is probably that vibe coded games are good enough now to fool HN.

New comment by valine in "Show HN: I created a small 2D game about an ant"

valine — Fri, 19 Sep 2025 04:34:42 +0000

Both Claude and GPT5 can single shot this type of game. The score counter looks exactly like the type of thing Claude spits out.

New comment by valine in "iPhone Air"

valine — Tue, 09 Sep 2025 21:09:57 +0000

Same. There are really only two features I care about in a phone: a high refresh rates and weight. At 165 grams the iPhone air is by far the lightest 120hz phone apple has ever made. Second place is the iPhone 15 Pro at 187 grams. Getting ready to ditch my 15 pro.

New comment by valine in "From tokens to thoughts: How LLMs and humans trade compression for meaning"

valine — Thu, 05 Jun 2025 18:00:02 +0000

Embedding models are not always trained with the rest of the model. That’s the whole idea behind VLLMs. First layer embeddings are so interchangeable you can literally feed in the output of other models using linear projection layers.

And like the other commenter said, you can absolutely feed single tokens through the model. Your point doesn’t make any sense though regardless. How about priming the model with “You’re a helpful assistant” just like everyone else does.

New comment by valine in "From tokens to thoughts: How LLMs and humans trade compression for meaning"

valine — Thu, 05 Jun 2025 10:21:52 +0000

>> For each LLM, we extract static, token-level embeddings from its input embedding layer (the ‘E‘matrix). This choice aligns our analysis with the context-free nature of stimuli typical in human categorization experiments, ensuring a comparable representational basis.

They're analyzing input embedding models, not LLMs. I'm not sure how the authors justify making claims about the inner workings of LLMs when they haven't actually computed a forward pass. The EMatrix is not an LLM, its a lookup table.

Just to highlight the ridiculousness of this research, no attention was computed! Not a single dot product between keys and queries. All of their conclusions are drawn from the output of an embedding lookup table.

The figure showing their alignment score correlated with model size is particularly egregious. Model size is meaningless when you never activate any model parameters. If Bert is outperforming Qwen and Gemma something is wrong with your methodology.

New comment by valine in "Outcome-Based Reinforcement Learning to Predict the Future"

valine — Tue, 27 May 2025 18:26:58 +0000

So instead of next token prediction its next event prediction. At some point this just loops around and we're back to teaching models to predict the next token in the sequence.

New comment by valine in "Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens"

valine — Sat, 24 May 2025 16:53:03 +0000

You’re thinking about this like the final layer of the model is all that exists. It’s highly likely reasoning is happening at a lower layer, in a different latent space that can’t natively be projected into logits.

New comment by valine in "Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens"

valine — Sat, 24 May 2025 16:47:03 +0000

My personal theory is that it’s an emergent property of many attention heads working together. If each attention head is a bird, reasoning would be the movement of the flock.

New comment by valine in "Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens"

valine — Sat, 24 May 2025 16:43:41 +0000

That’s true yeah. The model can do that because calculating latents is independent of next token prediction. You do a forward pass for each token in your sequence without the final projection to logits.

New comment by valine in "Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens"

valine — Fri, 23 May 2025 22:29:46 +0000

Attention computes a weighted average of all previous latents. So yes, it’s a new token as input to the forward pass, but after it feeds through an attention head it contains a little bit of every previous latent.

New comment by valine in "Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens"

valine — Fri, 23 May 2025 19:16:25 +0000

The dimensionality I suppose depends on the vocab size and your hidden dimension size, but that’s not really relevant. It’s a single linear projection to go from latents to logits.

Reasoning is definitely not happening in the linear projection to logits if that’s what you mean.

New comment by valine in "Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens"

valine — Fri, 23 May 2025 18:46:41 +0000

The lower dimensional logits are discarded, the original high dimensional latents are not.

But yeah, the LLM doesn’t even know the sampler exists. I used the last layer as an example, but it’s likely that reasoning traces exist in the latent space of every layer not just the final one, with the most complex reasoning concentrated in the middle layers.

New comment by valine in "Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens"

valine — Fri, 23 May 2025 17:09:04 +0000

I think it’s helpful to remember that language models are not producing tokens, they are producing a distribution of possible next tokens. Just because your sampler picks a sequence of tokens that contain incorrect reasoning doesn't mean a useful reasoning trace isn’t also contained within the latent space.

It’s a misconception that transformers reason in token space. Tokens don’t attend to other tokens. High dimensional latents attend to other high dimensional latents. The final layer of a decoder only transformer has full access to entire latent space of all previous latents, the same latents you can project into a distribution of next tokens.