Hacker News: techbruv

New comment by techbruv in "How uv got so fast"

techbruv — Fri, 26 Dec 2025 21:30:25 +0000

At a previous job, I recall updating a dependency via poetry would take on the order of ~5-30m. God forbid after 30 minutes something didn’t resolve and you had to wait another 30 minutes to see if the change you made fixed the problem. Was not an enjoyable experience.

uv has been a delight to use

New comment by techbruv in "Claude’s memory architecture is the opposite of ChatGPT’s"

techbruv — Thu, 11 Sep 2025 20:32:03 +0000

I don’t understand the argument “AI is just XYZ mechanism, therefore it cannot be intelligent”.

Does the mechanism really disqualify it from intelligence if behaviorally, you cannot distinguish it from “real” intelligence?

I’m not saying that LLMs have certainly surpassed the “cannot distinguish from real intelligence” threshold, but saying there’s not even a little bit of intelligence in a system that can solve more complex math problems than I can seems like a stretch.

New comment by techbruv in "Better and Faster Large Language Models via Multi-Token Prediction"

techbruv — Wed, 01 May 2024 13:42:54 +0000

> So it will not get worse in performance but only faster

A bit confused by this statement. Speculative decoding does not decrease the performance of the model in terms of "accuracy" or "quality" of output. Mathematically, the altered distribution being sampled from is identical to the original distribution if you had just used regular autoregressive decoding. The only reason you get variability between autoregressive vs speculative is simply due to randomness.

Unless you meant performance as in "speed", in which case it's possible that speculative decoding could degrade speed (but on most inputs, and with a good selection of the draft model, this shouldn't be the case).

New comment by techbruv in "What is a transformer model? (2022)"

techbruv — Fri, 23 Jun 2023 18:55:57 +0000

Some other good resources:

[0]: The original paper: https://arxiv.org/abs/1706.03762

[1]: Full walkthrough for building a GPT from Scratch: https://www.youtube.com/watch?v=kCc8FmEb1nY

[2]: A simple inference only implementation in just NumPy, that's only 60 lines: https://jaykmody.com/blog/gpt-from-scratch/

[3]: Some great visualizations and high-level explanations: http://jalammar.github.io/illustrated-transformer/

[4]: An implementation that is presented side-by-side with the original paper: https://nlp.seas.harvard.edu/2018/04/03/attention.html

New comment by techbruv in "PaLM 2 Technical Report [pdf]"

techbruv — Wed, 10 May 2023 19:51:16 +0000

The idea that GPT-4 is 1 trillion parameters has been refuted by Sam Altman himself on the Lex Fridman podcast (THIS IS WRONG, SEE CORRECTION BELOW).

These days, the largest models that have been trained optimally (in terms of model size w.r.t. tokens) typically hover around 50B (likely PaLM 2-L size and LLaMa is maxed at 70B). We simply do not have enough pre-training data to optimally train a 1T parameter model. For GPT-4 to be 1 trillion parameters, OpenAI would have needed to:

1) somehow magically unlocked 20x the amount of data (1T tokens -> 20T tokens) 2) somehow engineered an incredibly fast inference engine for a 1T GPT model that significantly better than anything anyone else has built 3) is somehow is able to eat the cost of hosting 1T parameter models

The probability that all the above 3 have happened seem incredibly low.

CORRECTION: The refutation for the size of GPT-4 on the lex fridman podcast was that GPT-4 was 100T parameters (and not directly, they were just joking about it), not 1T, however, the above 3 points still stand.

New comment by techbruv in "PaLM 2 Technical Report [pdf]"

techbruv — Wed, 10 May 2023 19:23:51 +0000

> "We then train several models from 400M to 15B on the same pre-training mixture for up to 1 × 1022 FLOPs."

Seems that for the last year or so these models are getting smaller. I would be surprised if GPT-4 had > the number of parameters as GPT-3 (i.e. 175B).

Edit: Seems those numbers are just for their scaling laws study. They don't explicitly say the size of PaLM 2-L, but they do say "The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute.". So likely on the range of 10B - 100B.

New comment by techbruv in "What is ChatGPT doing and why does it work?"

techbruv — Tue, 14 Feb 2023 23:08:38 +0000

ChatGPT and other LLMs for that matter are most definitely not using beam search or greedy sampling.

Greedy sampling is prone to repetition and just in general gives pretty subpar results that make no sense.

While beam search is better than greedy sampling, it's too expensive (beam search with a beam width of 4 is 4x more expensive) and performs worse than other methods.

In practice, you probably just wanna sample from the distribution directly after applying something like top-p: https://arxiv.org/pdf/1904.09751.pdf