Hacker News: zackangelo

New comment by zackangelo in "Kimi K2.7-Code: open-source coding model with better token efficiency"

zackangelo — Fri, 12 Jun 2026 17:51:47 +0000

I don't believe safetensors has a native int4 dtype, so they packed 4 int4s into a bf16 in this checkpoint.

New comment by zackangelo in "The real cost of owning a home"

zackangelo — Tue, 26 May 2026 16:58:24 +0000

If you're in SF and weighing this decision, it's easy to get tilted in the buy direction because the rental stock is so horrific. Landlords have very little incentive to update properties or provide basic amenities that people take for granted in other major cities (good luck getting a washer/dryer).

New comment by zackangelo in "Qwen3.7-Max: The Agent Frontier"

zackangelo — Wed, 20 May 2026 15:09:32 +0000

With the 3.5 release, the Plus model was just a rebrand of the open weight 397B. But I suspect that will change going forward. They haven’t released the weights for 3.6 but they did make it available through a few US providers.

New comment by zackangelo in "I’ve joined Anthropic"

zackangelo — Tue, 19 May 2026 17:29:14 +0000

absolutely not, take Kimi K2.6 for a spin

How do agents see your website?

zackangelo — Wed, 13 May 2026 16:11:19 +0000

Article URL: https://what-do-agents-see.runtype.app/

Comments URL: https://news.ycombinator.com/item?id=48123838

Points: 4

# Comments: 0

New comment by zackangelo in "Mistral Medium 3.5"

zackangelo — Wed, 29 Apr 2026 17:56:11 +0000

Isn't Kimi K2.6 natively INT4?

New comment by zackangelo in "HashiCorp co-founder says GitHub 'no longer a place for serious work'"

zackangelo — Wed, 29 Apr 2026 12:54:39 +0000

I don’t think this is true across Blizzard. Overwatch is the best it’s ever been.

New comment by zackangelo in "Parallel agents in Zed"

zackangelo — Wed, 22 Apr 2026 23:35:02 +0000

I give them a try about twice a year. I write a lot of Rust which should be squarely in their wheelhouse.

This last time I was pleasantly surprised to find they mostly fixed their SSH remote editing support. But then it started truncating rustc inline error messages and I couldn’t figure out how to view the whole thing easily. When you’re just trying to get something done little bits like this can add up quickly. Punted back to Cursor for now.

New comment by zackangelo in "Qwen3.6-35B-A3B: Agentic coding power, now open to all"

zackangelo — Thu, 16 Apr 2026 15:00:42 +0000

They are but the IDE needs to be integrated with them.

Qwen specifically calls out FIM (“fill in the middle”) support on the model card and you can see it getting confused and posting the control tokens in the example here.

New comment by zackangelo in "Qwen3.6-35B-A3B: Agentic coding power, now open to all"

zackangelo — Thu, 16 Apr 2026 14:57:47 +0000

17b per token. So when you’re generating a single stream of text (“decoding”) 17b parameters are active.

If you’re decoding multiple streams, it will be 17b per stream (some tokens will use the same expert, so there is some overlap).

When the model is ingesting the prompt (“prefilling”) it’s looking at many tokens at once, so the number of active parameters will be larger.

New comment by zackangelo in "GPU memory snapshots: sub-second startup (2025)"

zackangelo — Sat, 10 Jan 2026 23:56:27 +0000

This uses Nvidia’s CUDA snapshot API under the hood, but you have to pair it with a host side snapshot as well. Modal uses gVisor for this, which is notoriously high overhead.

Does anyone know of a more efficient alternative if you’re running a trusted container?

New comment by zackangelo in "macOS 26.2 enables fast AI clusters with RDMA over Thunderbolt"

zackangelo — Fri, 12 Dec 2025 23:26:39 +0000

You’re right I misunderstood.

I’m not sure if it would be of much utility because this would presumably be for tensor parallel workloads. In that case you want the ranks in your cluster to be uniform or else everything will be forced to run at the speed of the slowest rank.

You could run pipeline parallel but not sure it’d be that much better than what we already have.

New comment by zackangelo in "macOS 26.2 enables fast AI clusters with RDMA over Thunderbolt"

zackangelo — Fri, 12 Dec 2025 23:05:09 +0000

Sparks are built for this and actually have Connect-X 7 NICs built in! You just need to get the SFPs for them. This means you can natively cluster them at 200Gbps.

New comment by zackangelo in "macOS 26.2 enables fast AI clusters with RDMA over Thunderbolt"

zackangelo — Fri, 12 Dec 2025 23:00:10 +0000

No you use tensor parallelism in both cases.

The way it typically works in an attention block is: smaller portions of the Q, K and V linear layers are assigned to each node and are processed independently. Attention, rope norm etc is run on the node-specific output of that. Then, when the output linear layer is applied an "all reduce" is computed which combines the output of all the nodes.

EDIT: just realized it wasn't clear -- this means that each node ends up holding a portion of the KV cache specific to its KV tensor shards. This can change based on the specific style of attention (e.g., in GQA where there are fewer KV heads than ranks you end up having to do some replication etc)

New comment by zackangelo in "Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model"

zackangelo — Fri, 07 Nov 2025 03:41:43 +0000

What 1T parameter base model have you seen from any of those labs?

New comment by zackangelo in "NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inference"

zackangelo — Wed, 15 Oct 2025 03:39:58 +0000

Wouldn't you be able to test nccl if you had 2 of these?

New comment by zackangelo in "Launch HN: LlamaFarm (YC W22) – Open-source framework for distributed AI"

zackangelo — Wed, 08 Oct 2025 16:48:47 +0000

Just a bit of feedback:

> Instead of one brittle giant, we orchestrate a Mixture of Experts…

“mixture of experts” is a specific term of art that describes an architectural detail of a type of transformer model. It’s definitely not using smaller specialized models for individual tasks. Experts in an MoE model are actually routed to on a per token basis, not on a per task or per generation basis.

I know it’s tempting to co-opt this term because it would fit nicely for what you’re trying to do but it just adds confusion.

New comment by zackangelo in "Apps SDK"

zackangelo — Mon, 06 Oct 2025 20:36:35 +0000

Because it depends on how much better “best” is. If it’s only incrementally better than open source models that have other advantages, why would you bother?

OpenAI’s moat will only come from the products they built on top. Theoretically their products will be better because they’ll be more vertically integrated with the underlying models. It’s not unlike Apple’s playbook with regard to hardwares and software integration.

New comment by zackangelo in "From multi-head to latent attention: The evolution of attention mechanisms"

zackangelo — Sat, 30 Aug 2025 17:23:33 +0000

Not quite a frontier model but definitely built by a frontier lab: Grok 2 was recently open sourced and I believe it uses a fairly standard MHA architecture with MoE.

New comment by zackangelo in "Mosh Mobile Shell"

zackangelo — Thu, 28 Aug 2025 17:08:14 +0000

I feel a bit silly for not noticing this before. Over the last year or so I've often wondered when ssh added protocol-level support for session resume. I'd open my laptop on a new network and everything would be ready to go. But of course, it's nothing to do with ssh, it's just that I started using tailscale.