Hacker News: sanchitmonga22

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 15:06:22 +0000

please check our main repo: https://github.com/RunanywhereAI/runanywhere-sdks/

We are running anywhere, hence RunAnywhere, MetalRT is the fastest inference engine we made for Apple silicon, and we'll be covering other edge devices as well, All edge about to hit Warp speed!

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 15:04:10 +0000

Yes, that's the plan. MetalRT will ship as part of the RunAnywhere SDK so other developers can integrate it into their own apps. We're working on making that available. If you want to be in the early access group, drop me a line at founder@runanywhere.ai or open an issue on the RCLI repo. Happy to look at your project.

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 15:03:42 +0000

That's a fair read. Tool calling reliability with sub-4B models is genuinely the hardest unsolved problem in on-device AI right now.

The inference engine (MetalRT) is production-grade, the pipeline architecture is solid, but the models at this size are still the weak link for complex tool routing. Larger model support (where tool calling is much more reliable) is next on the roadmap. Please stay tuned!

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 15:02:47 +0000

That tracks with what we've seen too. For agent workflows with reliable tool calling, you really do need the larger models. Larger model support is a priority for us. Thanks for the data point.

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 15:02:13 +0000

Fair criticism. Our benchmarks are on small models because MetalRT was built for the voice pipeline use case, where decode latency on 0.6B-4B models is the bottleneck.

You're right that the bigger opportunity on Apple Silicon is large models that don't fit on consumer GPUs. Expanding MetalRT to 7B, 14B, 32B+ is on the roadmap. The architectural advantages(that MetalRT has) should matter even more at that scale where everything becomes memory-bandwidth-bound.

We'll publish benchmarks on larger models as we add support. If you have a specific model/size you'd want to see first, that helps us prioritize.

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 15:00:53 +0000

Good correction, thanks. You're right that NAX and ANE are distinct, I shouldn't have conflated them. NAX's ability to accelerate LLM prefill is exactly the kind of capability that could complement MetalRT's decode-focused pipeline. Appreciate the clarification on the Metal 4 / Tahoe requirement too.

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 04:56:12 +0000

Yes, mobile is our primary offering and it is on the roadmap. The same Metal GPU pipeline that powers MetalRT on macOS maps directly to iOS (same Apple Silicon, same Metal API)

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 03:27:12 +0000

Agreed for a lot of use cases. RCLI supports text-only mode (--no-speak flag or just type in the TUI instead of using push-to-talk). TTS makes sense for hands-free / eyes-free scenarios, but we dont force it.

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 03:25:34 +0000

We use AI tools in our workflow, same as a lot of teams at this point. The pipeline architecture, Metal integration, and engine design are ours. The code is MIT and open for anyone to read and judge the quality directly.

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 03:23:56 +0000

RCLI includes local RAG out of the box. You can ingest PDFs, DOCX, and plain text, then query by voice or text:

rcli rag ingest ~/Documents/notes rcli ask --rag ~/Library/RCLI/index "summarize the project plan"

It uses hybrid retrieval (vector + BM25 with Reciprocal Rank Fusion) and runs at ~4ms over 5K+ chunks. Embeddings are computed locally with Snowflake Arctic, so nothing leaves you're machine.

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 03:22:29 +0000

Fair point. The install script shouldn't silently install Homebrew without explicit consent. We'll update it to detect when Homebrew is missing and prompt the user before installing anything beyond RCLI itself.

In the meantime, if you already have Homebrew, you can install directly:

brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git brew install rcli rcli setup

Or build from source if you prefer not to use either method: https://github.com/RunanywhereAI/RCLI#build-from-source

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 02:08:34 +0000

Cool, just checked out dlgo. Looks like you're targeting Go bindings for on-device inference? Different approach but same conviction that this should run locally. Happy to compare notes if you want to chat about Metal optimization or pipeline architecture.

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 02:07:25 +0000

Apple has the silicon, the frameworks (MLX, CoreML), and the models. The gap is putting it all together into a fast, unified on-device pipeline. That's what we're focused on, and honestly, we think Apple will eventually ship something similar natively. Until then, we're trying to show whats possible today on their hardware.

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 02:06:02 +0000

Absolutely, we'd welcome a Portfile contribution. Happy to review and merge. If halostatue wants to co-maintain, even better.

Feel free to open a PR or issue on the RCLI repo and we'll coordinate.

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 02:04:56 +0000

Understood, you want dictation, not a chatbot. That's a valid and different use case.

RCLI is Apple Silicon only today because MetalRT is built on Metal. For Linux, the closest thing to what you're describing would be building a virtual input device on top of Whisper or Parakeet (which RCLI supports as STT backends). Parakeet TDT 0.6B has ~1.9% WER, that's very close to production dictation quality.

The missing piece on Linux isn't the model, it's the integration: a daemon that captures mic audio, runs STT with hidden latency (streaming partial results), and injects text as keyboard input. sherpa-onnx (https://github.com/k2-fsa/sherpa-onnx) supports Linux and has streaming STT, it might be the best starting point for what your after.

We're focused on Apple Silicon for now but broader platform support is on the roadmap.

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 02:03:45 +0000

This is a great idea. A virtual audio device that sits in the path of any audio stream and provides live transcription, that would be huge for video conferencing, lectures, podcasts.

MetalRT's STT numbers make this feasible: 70 seconds of audio transcribed in 101ms means you could process audio chunks in real-time with massive headroom. The latency would be imperceptible.

We haven't built this yet but it's a compelling use case. CoreAudio supports virtual audio devices (aggregate devices) that could pipe audio through the pipeline. If anyone in this thread has experience building macOS audio HAL plugins and wants to collaborate, we're very open to contributions, RCLI is MIT.

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 02:02:29 +0000

This is exactly the problem we're trying to solve. The models themselves have gotten surprisingly capable at small sizes, Qwen3.5 4B with 262K context, LFM2 1.2B for fast tool calling, but the inference infrastructure hasn't kept up.

When people say "local AI is too slow," they usually mean the engine is too slow, not the model. A 4B model at 186 tok/s (MetalRT on M4 Max) feels genuinely responsive for interactive chat. The same model at 87 tok/s (llama.cpp) feels sluggish. Same weights, same quality, 2x the speed, that's a usability cliff.

We think the gap between cloud and on-device inference is a infrastructure problem, not a model problem. That's what we're working on.

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 02:01:13 +0000

The default TTS voice (Piper) is a lightweight model optimized for speed over quality. It's fast but yeah, it doesn't sound great.

If you install Kokoro TTS (rcli models > TTS section), the voice quality is dramatically better, it's a neural TTS model with 28 different voices. MetalRT synthesizes Kokoro at 178ms for short responses, so you don't pay a speed penalty for the upgrade.

We should probably make Kokoro the default or atleast make the upgrade path more obvious in the first-run experience. Fair feedback.

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 02:00:01 +0000

Fair criticism. The action executed on the LLM side but didn't translate to the correct macOS action, the model hallucinated success instead of routing to the open_url tool.

This is a known limitation with small LLMs (0.6B-1.2B) doing tool calling. They sometimes confuse "I know what you want" with "I did it." Upgrading to a larger model improves tool-calling accuracy significantly.

We're also working on verification, having the pipeline confirm the action actually succeeded before reporting back. Thats a fair expectation and we should meet it.

New comment by sanchitmonga22 in "Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon"

sanchitmonga22 — Wed, 11 Mar 2026 01:59:03 +0000

Thanks for trying it and for filing the bug, we're looking into the homebrew install issue.

On unsloth quants: agreed, they're consistently better bit-for-bit. Adding broader quantization format support (including unsloth's approach) is on the roadmap. Right now MetalRT works with MLX 4-bit files and GGUF Q4_K_M, we want to expand that.

On the grounding issue ("navigate to google.com" not actually navigating): you're right, that's a gap. The "open_url" action exists but the LLM doesn't always route to it correctly, especially with compound commands. Small models (0.6B-1.2B) have limited tool-calling accuracy, upgrading to Qwen3.5 4B via rcli upgrade-llm helps significantly. We're also improving the action routing prompts.

Appreciate the detailed feedback, this is exactly what we need.