Hacker News: halflings

Sampling from LLMs: Art and Science

halflings — Tue, 07 Apr 2026 09:59:24 +0000

Article URL: https://kachkach.com/blog/llm-sampling

Comments URL: https://news.ycombinator.com/item?id=47672828

Points: 1

# Comments: 0

New comment by halflings in "Men are ditching TV for YouTube as AI usage and social media fatigue grow"

halflings — Thu, 02 Apr 2026 11:18:54 +0000

> Youtube charges $10 per month and doesn't produce a single video

It is different from Netflix (that pays upfront for production costs), but there's of course a revenue share + the bulk of the revenue for creators is actually from sponsorships (which YT doesn't take a share of).

New comment by halflings in "Generating one token at a time is a blessing in disguise"

halflings — Sun, 29 Mar 2026 20:49:55 +0000

LLMs generate their output one token at a time. The first thought when you learn this is that this is a huge performance bottleneck, as we are used to highly parallelized systems.

However, a large part of what makes LLMs feel so magical comes from this bottleneck.

Generating one token at a time is a blessing in disguise

halflings — Sun, 29 Mar 2026 20:49:55 +0000

Article URL: https://kachkach.com/blog/generating-one-token-at-a-time-is-a-blessing-in-disguise

Comments URL: https://news.ycombinator.com/item?id=47567160

Points: 3

# Comments: 1

New comment by halflings in "The Codex App"

halflings — Mon, 02 Feb 2026 20:50:38 +0000

The main thing I noticed in the video is that they have heavily sped up all the code generation sections... seems to be on 5x speed or more. (because people got used to how fast and good Sonnet, and especially Gemini 3.0 Flash, are)

New comment by halflings in "The Codex App"

halflings — Mon, 02 Feb 2026 20:48:04 +0000

Deploying from Antigravity is as easy as say connecting the Firebase MCP [1] and asking it "deploy my app to firebase".

[1] https://firebase.google.com/docs/ai-assistance/mcp-server

New comment by halflings in "Python 3.15’s interpreter for Windows x86-64 should hopefully be 15% faster"

halflings — Thu, 25 Dec 2025 18:39:02 +0000

+1, reading through the post, the PR updating the documentation... thanks for being transparent, but also don't be so hard on yourself!

That was a very niche error, that you promptly corrected, no need to be so apologetic about it! And thanks for all the hard work making Python faster!

New comment by halflings in "Getting a Gemini API key is an exercise in frustration"

halflings — Thu, 11 Dec 2025 10:14:23 +0000

"The models perform differently when called via the API vs in the Gemini UI."

This shouldn't be surprised, e.g. the model != the product. The same way GPT4o behaves differently than the ChatGPT product when using GPT4o.

New comment by halflings in "US vs. Google amicus curiae brief of Y Combinator in support of plaintiffs [pdf]"

halflings — Sun, 11 May 2025 12:01:57 +0000

I would also add that search has already moved elsewhere.

Less and less people are using search engines to shop, ex:Amazon makes >$57B a year from search ads, but also look at Temu and Shein which are mostly glorified product search platforms.

No one is searching for "funny videos" when you can just open Instagram and Tiktok.

The only real unique thing that search engines can do is queries that are not directly commercial (e.g. education, information seeking, etc.) and competition is insanely intense (w/ ChatGPT, Perplexity, etc) there.

New comment by halflings in "Gemma 3 QAT Models: Bringing AI to Consumer GPUs"

halflings — Sun, 20 Apr 2025 13:28:23 +0000

That's what the chart says yes. 14.1GB VRAM usage for the 27B model.

New comment by halflings in "Genie 2: A large-scale foundation world model"

halflings — Thu, 05 Dec 2024 10:23:55 +0000

> I cannot be the first person to think about such possibilities

Differentiable Rendering [1] is the closest thing to what you are describing. And yes, people have been working on this for the same reason you outline, it is more data/compute efficient and hence should generalize better.

[1] https://blog.qarnot.com/article/an-overview-of-differentiabl...

But also: > While cool, this also seems utterly wasteful. Video games offer known "analytical" solutions for the interactions that the model provides as a "statistical approximation", so to say.

A bit of the same debate as people calling LLMs a "blurry JPEG of the web" and hence useless.

Yes this is a statistical approximation to an analytical problem... but that's a very reductive framing to what is going on. To find the symbolic/analytical solution here would require to constrain the problem greatly: not all things on the screen have a differentiable representation, for example complex simulations might involve some kind of custom internal loop/simulation.

You waste compute to get a solution that can just be trained on billions of unlabeled (synthetic) examples, and then generalize to previously unseen prompts/environments.

New comment by halflings in "Open source AI is the path forward"

halflings — Tue, 23 Jul 2024 20:50:40 +0000

Training code is only useful to people in academia, and the closest thing to "code you can modify" are open weights.

People are framing this as if it was an open-source hierarchy, with "actual" open-source requiring all training code to be shared. This is not obvious to me, as I'm not asking people that share open-source libraries to also share the tools they used to develop them. I'm also not asking them to share all the design documents/architecture discussion behind this software. It's sufficient that I can take the end result and reshape it in any way I desire.

This is coming from an LLM practitioner that finetunes models for a living; and this constant debate about open-source vs open-weights seems like a huge distraction vs the impact open-sourcing something like Llama has... this is truly a Linux-like moment. (at a much smaller scale of course, for now at least)

New comment by halflings in "Suicide is on the rise for young Americans, with no clear answers"

halflings — Fri, 12 Apr 2024 13:25:05 +0000

> The world is teetering on the edge of world war

The world probably has never been as peaceful as in the last 50 years or so.

Same goes for access to drinkable water, food, decent shelter, gender equality, freedoms, technology, etc.

But I suppose your comment is a good illustration of the problem at hands (that so many people deeply believe that things are fucked)

New comment by halflings in "Groq CEO: 'We No Longer Sell Hardware'"

halflings — Tue, 09 Apr 2024 21:13:36 +0000

Thanks for putting this together! Will give it a watch now

New comment by halflings in "Groq CEO: 'We No Longer Sell Hardware'"

halflings — Tue, 09 Apr 2024 21:09:51 +0000

The # of chips is not the most important metric.

Most important, even ignoring latency, is throughput (tokens) per $$$. And according to their own benchmark [1] (famous last words :)), they're quite cost efficient.

[1] https://www.semianalysis.com/p/groq-inference-tokenomics-spe...

New comment by halflings in "Groq CEO: 'We No Longer Sell Hardware'"

halflings — Sun, 07 Apr 2024 23:51:16 +0000

No HBM because they use tons of fast SRAM instead. Isn't that the main driver for performance here?

(the way I understood it => it's still cost effective at scale due to throughput increase this brings)

New comment by halflings in "Google's First Tensor Processing Unit: Architecture"

halflings — Tue, 26 Mar 2024 03:50:30 +0000

Agree re:hallucinations/safety issues, that was likely one of the main blockers.

And here's the sad part: they had this back in 2019... see this paper released in Jan 2020: https://blog.research.google/2020/01/towards-conversational-...

New comment by halflings in "Google's First Tensor Processing Unit: Architecture"

halflings — Tue, 26 Mar 2024 03:48:15 +0000

This (innovator's dilemma / too afraid of disrupting your own ads business model) is the most common explanation folks are giving for this, but seems to be some sort of post-rationalization of why such a large company full of competent researchers/engineers would drop the ball this hard.

My read (having seen some of this on the inside), is that it was a mix of being too worried about safety issues (OMG, the chatbot occasionally says something offensive!) and being too complacent (too comfortable with incremental changes in Search, no appetite for launching an entirely new type of product / doing something really out there). There are many ways to monetize a chatbot, OpenAI for example is raking billions in subscription fees.

New comment by halflings in "Pyenv – lets you easily switch between multiple versions of Python"

halflings — Mon, 25 Mar 2024 09:48:42 +0000

uv has been really awesome as a replacement for pip: https://github.com/astral-sh/uv

So fast it finally made virtual environments usable for me. But it's not (yet) a full replacement for conda, e.g. it won't install things outside of Python packages

New comment by halflings in "Show HN: Matrix Multiplication with Half the Multiplications"

halflings — Fri, 15 Mar 2024 11:38:49 +0000

This looks pretty cool! What's the catch? e.g. why isn't this already implemented in accelerators, is it really just a forgotten algorithm, or this has some implications on the cost of building the accelerator or else?