Hacker News: bluecoconut

New comment by bluecoconut in "Show HN: I made Google Trends for Hacker News by indexing 18 years of comments"

bluecoconut — Thu, 25 Jun 2026 16:08:42 +0000

Very cool!

one subtle consistency bug that made it hard for me to interpret when I was clicking around: the small thumbnail plot vs the full plot often (always?) seem to use different colors.

The blue / orange gets assigned to the opposite labels in the A vs. B when you click, which made it confusing to understand.

New comment by bluecoconut in "Show HN: Got sick of ads, so I made my own logic puzzle site"

bluecoconut — Mon, 22 Jun 2026 16:09:14 +0000

For those who like these types of puzzles, i made a benchmark called pencil puzzle bench

Testing AI model's ability to solve puzzles like these. https://ppbench.com/

Can play the puzzles and compare your timing and accuracy to many AI models on the leaderboards

Show HN: Pencil Puzzle Bench – LLM Benchmark for Multi-Step Verifiable Reasoning

bluecoconut — Tue, 03 Mar 2026 16:45:18 +0000

I've been working on applying LLMs to long-context, verifiable problems over the past year, and today I'm releasing a benchmark of 62,000 pencil puzzles across 94 types (sudoku, nonori, slitherlink, etc.). The benchmark also allows for intermediate checks /rule breaks for all varieties at any step.

I tested 51 models against a subset (300 puzzles) in two modes: single-shot (output the full solution) and agentic (iterate with verifier feedback).

Some results:

- Best model (GPT 5.2@xhigh) solves 56%. (~ half the puzzles are unsolved by any model)

- Agentic solves average 29 turns. The longest attempt took ~1,200 turns over 14 hours.

- Cost per success varies wildly (cheapest: $0.00033 — Grok 4.1 Fast Reasoning, most expensive: $238.16 — Claude Sonnet 4.6 (1M context))

- Reasoning depth (eg. @medium, @high, @xhigh) dramatically improves capability (up to repeated infrastructure failure for @xhigh)

- Stark difference between US closed models (3 at >33%) and Chinese open models (top: 6%)

Made the website to show off the dataset + play every puzzle, and even every replay AI agent solves step-by-step (fun to watch how it gets to solutions).

Also here's the paper: https://arxiv.org/abs/2603.02119

I didn't test human ability to solve, but it seems these puzzles are pretty difficult. I'd be curious how HN audience fares on the puzzles.

Comments URL: https://news.ycombinator.com/item?id=47235084

Points: 5

# Comments: 0

New comment by bluecoconut in "Cosmologically Unique IDs"

bluecoconut — Wed, 18 Feb 2026 19:29:39 +0000

Fun read.

One upside of the deterministic schemes is they include provenance/lineage. Can literally "trace up" the path the history back to the original ID giver.

Kinda has me curious about how much information is required to represent any arbitrary provenance tree/graph on a network of N-nodes/objects (entirely via the self-described ID)?

(thinking in the comment: I guess if worst case linear chain, and you assume that the information of the full provenance should be accessible by the id, that scales as O(N x id_size), so its quite bad. But, assuming "best case" (that any node is expected to be log(N) steps from root, depth of log(N)) feels like global_id_size = log(N) x local_id_size is roughly the optimal limit? so effectively the size of the global_id grows as log(N)^2? Would that mean: from the 399 bit number, with lineage, would be a lower limit for a global_id_size be like (400 bit)^2 ~= 20 kB (because of carrying the ordered-local-id provenance information, and not relative to local shared knowledge)

New comment by bluecoconut in "Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation"

bluecoconut — Wed, 04 Feb 2026 15:22:00 +0000

I almost feel like this goes opposite to what attention is good at. This would be good at approximating all the places where attention is low / not sharp. Where attention/the exponential is key is when it selects out / needle-in-haystack / winner-takes-all focus (the word "attention" itself sorta implies this), and this is where the taylor expression would fail to represent the values well. This just... softens attentions ability to attend?

(I'm imagining that if in the context there's ~4-8 "similar" attention-targets that should be sharp, and regular attention learns to select the correct one, this taylor approximation version would wash out any difference and they'd all loosly be attended to, and it'd fail to isolate the correct signal)

Really wish this had some downstream tests -- apply it to a pretrained model and see how performance degrades, train a fresh one, etc. The tests are worth doing, but I somehow don't feel that hopeful this is the unlock required for sub-quadratic attention. It's possible that a freshly trained model with this learns to attend without the sharp attention signals, but that seems a bit dubious to me.

But also, maybe this combined with some other selective (sparse attention) trick, means that the hybrid model gets the "fuzzy long tail" of attention well represented as well as the sharpness well represented, and all together it could actually be a part of the larger solution.

New comment by bluecoconut in "Google Titans architecture, helping AI have long-term memory"

bluecoconut — Sun, 07 Dec 2025 16:53:56 +0000

Bytedance is publishing pretty aggressively.

Recently, my favorite from them was lumine: https://arxiv.org/abs/2511.08892

Here's their official page: https://seed.bytedance.com/en/research

New comment by bluecoconut in "DeepSeek OCR"

bluecoconut — Mon, 20 Oct 2025 14:38:30 +0000

Previous paper from DeepSeek has mentioned Anna’s Archive.

> We cleaned 860K English and 180K Chinese e-books from Anna’s Archive (Anna’s Archive, 2024) alongside millions of K-12 education exam questions. https://arxiv.org/abs/2403.05525 DeepSeek-VL paper

New comment by bluecoconut in "Building your own CLI coding agent with Pydantic-AI"

bluecoconut — Thu, 28 Aug 2025 19:55:00 +0000

After maintaining my own agents library for a while, I’ve switched over to pydantic ai recently. I have some minor nits, but overall it's been working great for me. I’ve especially liked combining it with langfuse.

Towards coding agents, I wonder if there are any good / efficient ways to measure how much different implementations work on coding? SWE-bench seems good, but expensive to run. Effectively I’m curious for things like: given tool definition X vs Y (eg. diff vs full file edit), prompt for tool X vs Y (how it’s described, does it use examples), model choice (eg. MCP with Claude, but python-exec inline with GPT-5), sub-agents, todo lists, etc. how much across each ablation, does it matter? And measure not just success, but cost to success too (efficiency).

Overall, it seems like in the phase space of options, everything “kinda works” but I’m very curious if there are any major lifts, big gotchas, etc.

I ask, because it feels like the Claude code cli always does a little bit better, subjectively for me, but I haven’t seen a LLMarena or clear A vs B, comparison or measure.

New comment by bluecoconut in "Yamanot.es: A music box of train station melodies from the JR Yamanote Line"

bluecoconut — Wed, 27 Aug 2025 22:25:57 +0000

The first time I got off at and heard Komagome's tune I mistakenly thought it was some halloween special because it was late October at the time, and the song felt so distinct and unique.

New comment by bluecoconut in "Yamanot.es: A music box of train station melodies from the JR Yamanote Line"

bluecoconut — Wed, 27 Aug 2025 22:20:39 +0000

Interestingly this one seems it is from before 高輪ゲートウェイ (Takanawa Gateway) station which opened in 2020, but the numbering shows the gap (JY 25 -> JY 27). That led me to looking it up, and turns out that they introduced the numbering in 2016, and that already came pre-planned with the gap ready [1].

[1] https://www.jreast.co.jp/press/2016/20160402.pdf

New comment by bluecoconut in "Open models by OpenAI"

bluecoconut — Tue, 05 Aug 2025 22:15:46 +0000

Not getting around it, just benefiting from parallel compute / huge flops of GPUs. Fundamentally, it's just that prefill compute is itself highly parallel and HBM is just that much faster than LPDDR. Effectively H100s and B100s can chew through the prefill in under a second at ~50k token lengths, so the TTFT (Time to First Token) can feel amazingly fast.

New comment by bluecoconut in "Open models by OpenAI"

bluecoconut — Tue, 05 Aug 2025 22:11:27 +0000

I was able to get gpt-oss:20b wired up to claude code locally via a thin proxy and ollama.

It's fun that it works, but the prefill time makes it feel unusable. (2-3 minutes per tool-use / completion). Means a ~10-20 tool-use interaction could take 30-60 minutes.

(This editing a single server.py file that was ~1000 lines, the tool definitions + claude context was around 30k tokens input, and then after the file read, input was around ~50k tokens. Definitely could be optimized. Also I'm not sure if ollama supports a kv-cache between invocations of /v1/completions, which could help)

New comment by bluecoconut in "AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms"

bluecoconut — Thu, 15 May 2025 06:18:31 +0000

I've been working on something very similar as a tool for my own AI research -- though I don't have the success they claim. Mine often plateaus on the optimization metric. I think there's secret sauce in the meta-prompting and meta-heuristic comments from the paper that are quite vague, but it makes sense -- it changes the dynamics of the search space and helps the LLM get out of ruts. I'm now going to try to integrate some ideas based off of my interpretation of their work to see how it goes.

If it goes well, I could open source it.

What are the things you would want to optimize with such a framework? (So far I've been focusing on optimizing ML training and architecture search itself). Hearing other ideas would help motivate me to open source if there's real demand for something like this.

New comment by bluecoconut in "Whisky is no longer actively maintained"

bluecoconut — Wed, 09 Apr 2025 12:48:28 +0000

I’ve been using whisky to play Elden ring on my M4 MBP and it’s been great! I love that the Game porting toolkit and wine all work so well. I did have to do some pinning of steam to an older version to keep it working recently. I guess I’ll move over to crossover soon

New comment by bluecoconut in "ForeverVM: Run AI-generated code in stateful sandboxes that run forever"

bluecoconut — Wed, 26 Feb 2025 20:37:49 +0000

I tried to do this myself about ~1.5 years ago, but ran into issues with capturing state for sockets and open files (which started to show up when using some data science packages, jupyter widgets, etc.)

What are some of the edge cases where ForeverVM works and doesn't work? I don't see anything in the documentation about installing new packages, do you pre-bake what is available, and how can you see what libraries are available?

I do like that it seems the ForeverVM REPL also captures the state of the local drive (eg. can open a file, write to it, and then read from it).

For context on what I've tried: I used CRIU[1] to make the dumps of the process state and then would reload them. It worked for basic things, but ran into the issues stated above and abandoned the project. (I was trying to create a stack / undo context for REPLs that LLMs could use, since they often put themselves into bad states, and reverting to previous states seemed useful). If I remember correctly, I also ran into issues because capturing the various outputs (ipython capture_output concepts) proved to be difficult outside of a jupyter environment, and jupyter environments themselves were even harder to snapshot. In the end I settled for ephemeral but still real-server jupyter kernels where I via wrapper managed locals() and globals() as a cache, and would re-execute commands in order to rebuild state after the server restarts / crashes. This allowed me to also pip install new packages as well, so it proved more useful than simply static building my image/environment. But, I did lose the "serialization" property of the machine state, which was something I wanted.

That said, even though I personally abanonded the project, I still hold onto the dream of a full Tree/Graph of VMs (where each edge is code that is executed), and each VM state can be analyzed (files, memory, etc.). Love what ForeverVM is doing and the early promise here.

[1] https://criu.org/Main_Page

New comment by bluecoconut in "Why HNSW is not the answer and disk-based alternatives might be more practical"

bluecoconut — Mon, 23 Dec 2024 19:28:14 +0000

I don’t quite understand this - by 30k pages, is this the number of entries in your index? Did you mean 30M?

At the <100k scale I just full compute / inner product directly, and I don’t mess with vector stores or added complexity. No ANN algo needed — they’ll all be slower than actual exact kNN re ranking. (10k7684 =30MB, a scan over it and a sort is on the ~100us or faster). frankly, I’ve even sent at the 5k scale to client and done that client side in JS.

Often, I find i use an ANN algo / index to get me my nearest 10k then I do final re ranking with more expensive algorithms/compute in that reduced space.

The original HNSW paper was testing/benchmarking at the 5M-15M scales. That’s where it shines compared to alternatives.

When pushing to the 1B scale (I have an instance at 200M) the memory consumption does become a frustration (100GB of ram usage). Needing to vertically scale nodes that use the index. But it’s still very fast and good. I wouldn’t call it “dangerous” just “expensive”.

Interestingly though, I found that usearch package worked great and let me split and offload indexes into separate files on disk, greatly lowered ram usage and latency is still quite good on average, but has some spikes (eg. sometimes when doing nearest 10k though can be ~1-3 seconds on the 200M dataset)

New comment by bluecoconut in "OpenAI O3 breakthrough high score on ARC-AGI-PUB"

bluecoconut — Fri, 20 Dec 2024 20:10:36 +0000

By my estimates, for this single benchmark, this is comparable cost to training a ~70B model from scratch today. Literally from 0 to a GPT-3 scale model for the compute they ran on 100 ARC tasks.

I double checked with some flop estimates (P100 for 12 hours = Kaggle limit, they claim ~100-1000x for O3-low, and x172 for O3-high) so roughly on the order of 10^22-10^23 flops.

In another way, using H100 market price $2/chip -> at $350k, that's ~175k hours. Or 10^24 FLOPs in total.

So, huge margin, but 10^22 - 10^24 flop is the band I think we can estimate.

These are the scale of numbers that show up in the chinchilla optimal paper, haha. Truly GPT-3 scale models.

New comment by bluecoconut in "OpenAI O3 breakthrough high score on ARC-AGI-PUB"

bluecoconut — Fri, 20 Dec 2024 19:33:12 +0000

3400 came from counting pixels on the plot.

Also its $20 on for the o3-low via the table for the semi-private, which x172 is 3440, also coming in close to the 3400 number

New comment by bluecoconut in "OpenAI O3 breakthrough high score on ARC-AGI-PUB"

bluecoconut — Fri, 20 Dec 2024 19:31:10 +0000

they use some poor language.

"High Efficiency" is O3 Low "Low Efficiency" is O3 High

They left the "Low efficiency" (O3 High) values as `-` but you can infer them from the plot at the top.

Note the $20 and $17 per task aligns with the X-axis of the O3-low

New comment by bluecoconut in "OpenAI O3 breakthrough high score on ARC-AGI-PUB"

bluecoconut — Fri, 20 Dec 2024 19:29:10 +0000

some other imporant quotes: "Average human off the street: 70-80%. STEM college grad: >95%. Panel of 10 random humans: 99-100%" -@fchollet on X

So, considering that the $3400/task system isn't able to compete with STEM college grad yet, we still have some room (but it is shrinking, i expect even more compute will be thrown and we'll see these barriers broken in coming years)

Also, some other back of envelope calculations:

The gap in cost is roughly 10^3 between O3 High and Avg. mechanical turkers (humans). Via Pure GPU cost improvement (~doubling every 2-2.5 years) puts us at 20~25 years.

The question is now, can we close this "to human" gap (10^3) quickly with algorithms, or are we stuck waiting for the 20-25 years for GPU improvements. (I think it feels obvious: this is new technology, things are moving fast, the chance for algorithmic innovation here is high!)

I also personally think that we need to adjust our efficiency priors, and start looking not at "humans" as the bar to beat, but theoretical computatble limits (show gaps much larger ~10^9-10^15 for modest problems). Though, it may simply be the case that tool/code use + AGI at near human cost covers a lot of that gap.