Hacker News: anonymoushn

New comment by anonymoushn in "Show HN: Actual Claude Tokenizer"

anonymoushn — Mon, 04 May 2026 22:53:57 +0000

I couldn't reproduce this behavior with Sonnet 4, and Sonnet 3.7 has been deprecated since I messed with this stuff. You can try tokenizing the string " "

I think the correct tokenization of the string will not have any tokens that contain mixed punctuation and letters, but the result of this approach does contain such claimed tokens.

New comment by anonymoushn in "Show HN: Actual Claude Tokenizer"

anonymoushn — Mon, 04 May 2026 20:03:30 +0000

That's " 'd ".strip(), an english contraction suffix. it's 1 token, but using this echo approach you will be served the apostrophe and the subsequent letter for the first time in different steps.

New comment by anonymoushn in "Show HN: Actual Claude Tokenizer"

anonymoushn — Tue, 21 Apr 2026 09:40:56 +0000

You can't reliably obtain correct token boundaries with this method. For example, "'d" is 1 token, but the API will return "d" stuck to the next token. Weirdly this seems to be specific to the letter "d". Similar stuff happens around "<". About all caps words, some words are in the vocab in all caps, such as MERCHANTABILITY.

New comment by anonymoushn in "Claude Token Counter, now with model comparisons"

anonymoushn — Mon, 20 Apr 2026 05:28:59 +0000

their old tokenizer performed some space collapsing that allowed them to use the same token id for a word with and without the leading space (in cases where the context usually implies a space and one is not present, a "no space" symbol is used).

New comment by anonymoushn in "[dead]"

anonymoushn — Mon, 20 Apr 2026 04:16:15 +0000

Is this the wrong URL? this seems to be a blog post from October 2025 called "Introducing: Local Browser AI"

New comment by anonymoushn in "Issue: Claude Code is unusable for complex engineering tasks with Feb updates"

anonymoushn — Mon, 06 Apr 2026 22:31:14 +0000

How do you guys decide which settings should be configurable via environment variables but not settings files and which settings should be configurable via settings files but not environment variables?

New comment by anonymoushn in "Issue: Claude Code is unusable for complex engineering tasks with Feb updates"

anonymoushn — Mon, 06 Apr 2026 22:23:46 +0000

> On of our product principles is to avoid changing settings on users' behalf

Ideally there wouldn't be silent changes that greatly reduce the utility of the user's session files until they set a newly introduced flag.

I happen to think this is just true in general, but another reason it might be true is that the experience the user has is identical to the experience they would have had if you first introduced the setting, defaulting it to the existing behavior, and then subsequently changed it on users' behalf.

New comment by anonymoushn in "A Faster Alternative to Jq"

anonymoushn — Wed, 01 Apr 2026 15:18:39 +0000

Oh, can you post some benchmarks? I didn't know that parser throughput per core would change with the amount of data like that.

New comment by anonymoushn in "The Claude Code Source Leak: fake tools, frustration regexes, undercover mode"

anonymoushn — Tue, 31 Mar 2026 18:37:10 +0000

why

New comment by anonymoushn in "A Faster Alternative to Jq"

anonymoushn — Fri, 27 Mar 2026 12:44:30 +0000

are those tools known for their fast json parsers?

New comment by anonymoushn in "Sub-Millisecond RAG on Apple Silicon. No Server. No API. One File"

anonymoushn — Tue, 17 Feb 2026 19:54:39 +0000

ideally users could be banned for posting LLM outputs as if they were authored by humans https://www.pangram.com/history/49335ddf-118d-43e4-9340-a58a...

New comment by anonymoushn in "LLM Structured Outputs Handbook"

anonymoushn — Sat, 17 Jan 2026 06:41:57 +0000

Hello, the part about canonical filtering in https://openreview.net/pdf?id=DFybOGeGDS doesn't seem to try to account for pretokenization. For example, if you receive " 天天中彩票APP" in o200k, it means there has to be a lowercase letter within the span of letters, and while tokens like (4 spaces) may be pairwise compatible with tokens like "123" according to the BPE merge rules, the pretokenizer would split the span of spaces to give (3 spaces), " ", "123" instead. Are you aware of any work that does actual canonical generation for models with this kind of pretokenization regex?

New comment by anonymoushn in "Ask HN: Cursor (LLM) Costs"

anonymoushn — Tue, 13 Jan 2026 13:09:45 +0000

use claude code if you want to use opus

New comment by anonymoushn in "Show HN: Create LLM-optimized random identifiers"

anonymoushn — Mon, 12 Jan 2026 16:24:50 +0000

what does "logprobs look off" mean

New comment by anonymoushn in "I/O is no longer the bottleneck? (2022)"

anonymoushn — Tue, 06 Jan 2026 12:39:22 +0000

Hello, a couple years ago I participated in a contest to count word frequencies and generate a sorted histogram. There's a cool post about it featuring a video discussing the tricks used by some participants. https://easyperf.net/blog/2022/05/28/Performance-analysis-an...

Some other participants said that they measured 0 difference in runtime between pshufb+eq and eqx3+orx2, but i think your problem has more classes of whitespace, and for the histogram problem, considerations about how to hash all the words in a chunk of the input dominate considerations about how to obtain the bitmasks of word-start or word-end positions.

New comment by anonymoushn in "Show HN: Steganography in natural language using LLM logit-rank steering"

anonymoushn — Sat, 03 Jan 2026 16:32:11 +0000

requires fully deterministic inference, which turns out to be unusual, but for this sort of thing it's probably fine if you do really slow inference on cpu. cool idea.

New comment by anonymoushn in "Ask HN: Startup launch destroyed by Bolt.new's AI. 10M tokens gone, no response"

anonymoushn — Sun, 21 Dec 2025 04:50:40 +0000

please write your own posts from now on

New comment by anonymoushn in "From text to token: How tokenization pipelines work"

anonymoushn — Tue, 16 Dec 2025 22:29:49 +0000

i love stemming, i love searching for "anime" and getting "animal"

New comment by anonymoushn in "SSE sucks for transporting LLM tokens"

anonymoushn — Sat, 13 Dec 2025 19:05:33 +0000

so sad to hear that about Streaming SIMD Extensions

New comment by anonymoushn in "Leaving Intel"

anonymoushn — Sat, 06 Dec 2025 03:46:01 +0000

This is true economically but in reality if you have much larger cost savings than that for sale then these companies mostly say "we would be happy to buy that for $0 while we pay you a million a year to move to the united states"