Hacker News: danlenton

New comment by danlenton in "Multi-Stream LLMs: new paper on parallelizing/separating prompts, thinking, I/O"

danlenton — Fri, 22 May 2026 00:35:45 +0000

I think the main benefit is improved speed and parallelism. Very similar to https://thinkingmachines.ai/blog/interaction-models/

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Wed, 29 May 2024 00:39:51 +0000

We just initialize a random latent vector for each model, and then jointly train each of these unique latent vectors :)

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Fri, 24 May 2024 10:32:28 +0000

Currently the motivation is mainly speed. For the really easy ones like "hey, how's it going?" or "sorry I didn't hear you, can you repeat?" you can easily send to Llama3 etc. Ofc you could do some clever caching or something, but training a custom router directly on the task to optimize the resultant performance metric doesn't require any manual engineering.

Still, I agree that routing in isolation is not thaaat useful in many LLM domains. I think the usefulness will increase when applying to multi-step agentic systems, and when combining with other optimizations such as end-to-end learning of the intermediate prompts (DSPy etc.)

Thanks again for diving deep, super helpful!

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 17:09:50 +0000

Interesting, do you have any hunch as to why this is? We've seen in more verticalized apps where the underlying model is hidden from the user (sales call agent, autopilot tool, support agent etc.) that trying to reach high quality on hard prompts and high speed on the remaining prompts makes routing an appealing option.

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 17:02:24 +0000

no down side

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 16:59:11 +0000

One use case is optimizing agentic systems, where a custom router [https://youtu.be/9JYqNbIEac0] is trained end-to-end on the final task (rather than GPT4-as-a-judge). Both the intermediate prompts and the models used can then be learned from data (similar to DSPy), whilst ensuring the final task performance remains high. This is not supported with v0, but it's on the roadmap. Thoughts?

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 16:54:31 +0000

Thanks for sharing, will get this fixed now!

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 16:50:04 +0000

If you do test it out, feel free to ping me with any questions!

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 16:48:38 +0000

Makes sense, however I would clarify that we don't need to make the final decision. If you're using the neural scoring function as an API, then you can just get predictions about how each model will likely perform on your prompt, and then use these predictions however you want (if at all). Likewise, the benchmarking platform [https://youtu.be/PO4r6ek8U6M] can be used to just assess different models on your prompts without needing to do any routing. Nonetheless, this perspective is very helpful.

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 15:00:48 +0000

Thanks for weighing in. I'm sure for your setup right now, our router in it's current form would not be useful for you. This is the very first version, and the scope is therefore relatively limited.

On our roadmap, we plan to support:

- an API which returns the neural scores directly, enabling model selection and model-specific prompts to all be handled on the client side

- automatic learning of intermediate prompts for agentic multi-step systems, taking a similar view as DSPy, where all intermediate LLM calls and prompts are treated as latent variables in an optimizable end-to-end agentic system.

With these additions, the subtleties of the model + prompt relationship would be better respected.

I also believe that LLMs will become more robust to prompt subtleties over time. Also, some tasks are less sensitive to these minor subtleties you refer to.

For example, if you have a sales call agent, you might want to optimize UX for easy dialgoue prompts (so the person on the other end isn't left waiting), and take longer thinking about harder prompts requiring the full context of the call.

This is just an example, but my point is that not all LLM applications are the same. Some might be super sensitive to prompt subtleties, others might not be.

Thoughts?

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 14:46:42 +0000

duly noted!

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 14:43:31 +0000

however, as janekm says, we can't charge just based on cost savings. We would need the router points to be sufficiently compelling wrt quality, speed and cost (including our own margins) that users still sometimes opt for these router points. Suffice it to say, if any router configs do start to take margins, then this will be clearly reflected in the overall router cost plotted on the scatter graph. UX will not be affected.

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 14:41:01 +0000

Yeah that's a great point, something we'll keep in mind as we work out the final business model. Thanks!

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 12:23:58 +0000

Thanks! Ipsos is also a great analogous example, I hadn't thought of that.

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 12:20:07 +0000

Makes sense, thanks a lot for the feedback. We're pretty confident that future versions of our router will provide sufficient value where we can take margins here, we therefore don't expect the need to start charging for Single-sign-on (SSO) alone. The SSO benefits are only the beginning in my mind, our main value will come from custom benchmarks across all models + providers and optimizing LLM applications, including agentic workfows. I do very much see your point though. Thankfully, we're very fortunate to have several years of runway, so we don't plan on disappearing anytime too soon!

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 11:20:34 +0000

It's on the roadmap! Hopefully will be added next week

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 10:40:30 +0000

Yes the benchmarks are ongoing, we continually plot the speed and cost across time in our runtime benchmarks [https://unify.ai/benchmarks], and we use this live data when plotted the quality scatter graphs [https://console.unify.ai/dashboard]. The router configurations are "self-improving" in the sense that any given router config will quickly wrap the latest models and providers under the hood. Using a router config is a way of riding the wave of models and providers, whilst just specifying your priorities for quality, speed and cost. We will have some case studies which better explain this soon!

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 10:34:20 +0000

Currently, we simply use GPT4-as-a-judge, with a general system prompt we've written which is task agnostic. This is then used to train the neural scoring function, which predicts quality ahead-of-time. However, it's on our roadmap to add make the judging more flexible, potentially task-specific judge prompts and in-context examples, also perhaps using a jury [https://arxiv.org/pdf/2404.18796].

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 10:30:59 +0000

Sure! Basically traditional MoE has several linear layers, and the network learns to route down those paths, based on the training loss (similar to how CNNs learn through max-pooling, which is also non-differentiable). However, MoEs have been shown to specialiaze on tokens, not high-level semantics. This was eloquently explained by Fuzhao Xue, author of OpenMoE, in one of our reading groups: https://www.youtube.com/watch?v=k3QOpJA0A0Q&t=1547s

In contrast, our router sits at a higher level of the stack, sending prompts to different models and providers based on quality on the prompt distribution, speed and cost. Happy to clarify further if helpful!

New comment by danlenton in "Show HN: Route your prompts to the best LLM"

danlenton — Thu, 23 May 2024 10:27:06 +0000

So the neural scoring introduces ~20ms latency, but this only impacts time-to-first-token (not inter-token-latency). When using our public endpoints there is an additional ~150ms latency, but you can deploy the router on-prem in your own cloud, so then it would only be the inference latency. Generally the improvements in ITL outweigh the small addition to TTFT.