<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hacker News: russ</title><link>https://news.ycombinator.com/user?id=russ</link><description>Hacker News RSS</description><docs>https://hnrss.org/</docs><generator>hnrss v2.1.1</generator><lastBuildDate>Sat, 02 May 2026 10:28:14 +0000</lastBuildDate><atom:link href="https://hnrss.org/user?id=russ" rel="self" type="application/rss+xml"></atom:link><item><title><![CDATA[Show HN: Open-source turn detection model for voice AI]]></title><description><![CDATA[
<p>Hey HN, it’s Russ - cofounder of LiveKit. An open source stack for building realtime AI applications.<p>We’re sharing our first homegrown AI model for turn detection. Here’s a live demo: <a href="https://cerebras.vercel.app/" rel="nofollow">https://cerebras.vercel.app/</a><p>Voice AI has come a long way in the last year. We now have end-to-end systems that can generate a response to user input in 300-500ms — human level speeds!<p>As latency reduces, a common problem that surfaces is the LLM responds too quickly. Any time there’s a short pause in a user’s speech, it ends up interrupting them. This is largely due to how voice AI applications perform “turn detection” — that is, figuring out when the user has finished speaking and when the model can run inference and respond.<p>Pretty much everyone uses a signal processing technique called voice activity detection (VAD). In a nutshell, it figures out when the audio signal switches from speech to silence and then triggers an end of turn once a configurable amount of silence has transpired.<p>One obvious delta between VAD and how humans do turn detection is we also consider the content of speech (i.e. what someone says). These past few months, we’ve been working on an open weights, content-aware turn detection model for voice AI applications. It was fine-tuned from SmolLM v2 on text, runs on CPU (currently takes 50ms for inference), and uses speech transcriptions as input to predict when a user has completed a thought (also called an “utterance”). Since it was trained on text, notably it works well for pipeline-based architectures (i.e. STT ⇒ LLM ⇒ TTS).<p>We use this model together with VAD to make better predictions about whether a user is done speaking. Here’s some demos --<p>- Podcast interview: <a href="https://youtu.be/EYDrSSEP0h0" rel="nofollow">https://youtu.be/EYDrSSEP0h0</a><p>- Ordering food: <a href="https://youtu.be/fcr8Y-3c4E0" rel="nofollow">https://youtu.be/fcr8Y-3c4E0</a><p>- Providing shipping address: <a href="https://youtu.be/2pQWvd6xozw" rel="nofollow">https://youtu.be/2pQWvd6xozw</a><p>- Customer support: <a href="https://youtu.be/YoSRg3ORKtQ" rel="nofollow">https://youtu.be/YoSRg3ORKtQ</a><p>In our testing we’ve found:<p>- 85% reduction in unintentional interruptions<p>- 3% false positives (where the user is done speaking, but the model thinks they aren’t)<p>In practice, we still have work to do. We currently delay inference if the model predicts a < 15% chance the user is done speaking. This threshold misses a bunch of middle-of-the-pack probabilities.<p>Next steps are improving the model accuracy, tuning performance, and expanding to support more languages (only supports English rn). Separately, we’re starting to explore an audio-based model that considers not just what someone says but how they say it, which can be used with natively multimodal models like GPT-4o that directly process and generate audio.<p>Code here: <a href="https://github.com/livekit/agents/tree/main/livekit-plugins/livekit-plugins-turn-detector">https://github.com/livekit/agents/tree/main/livekit-plugins/...</a><p>Let us know what you think!</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=42497868">https://news.ycombinator.com/item?id=42497868</a></p>
<p>Points: 8</p>
<p># Comments: 1</p>
]]></description><pubDate>Mon, 23 Dec 2024 21:46:37 +0000</pubDate><link>https://news.ycombinator.com/item?id=42497868</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=42497868</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42497868</guid></item><item><title><![CDATA[New comment by russ in "CodeMic: A new way to talk about code"]]></title><description><![CDATA[
<p>Yeah, that would be really handy too.</p>
]]></description><pubDate>Mon, 23 Dec 2024 16:47:35 +0000</pubDate><link>https://news.ycombinator.com/item?id=42495722</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=42495722</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42495722</guid></item><item><title><![CDATA[New comment by russ in "CodeMic: A new way to talk about code"]]></title><description><![CDATA[
<p>This is very cool! I’ve wanted something like CodeMic for a long time.<p>Back when I was at Twitter, we used Review Board for code reviews (this was in 2009, before GH was a thing for most companies). It was tough to thoughtfully review large branches, especially for parts of the codebase that I wasn’t familiar with. I remember thinking, if I could somehow record the development process for a PR I was reviewing, it would be easier to understand what the submitter was trying to accomplish and how they went about doing so. I found myself more so reviewing code style instead of functionality, architecture, or design.<p>I watched most of the intro video, but didn’t go deeper on the site. Does CM integrate easily into the code review/PR process? I suppose I could just attach a link in any PR description?<p>Great work!</p>
]]></description><pubDate>Sun, 22 Dec 2024 21:43:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=42489386</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=42489386</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42489386</guid></item><item><title><![CDATA[New comment by russ in "Codenames App, or my biggest project so far"]]></title><description><![CDATA[
<p>You got it! Hope you have some fun with it. :)</p>
]]></description><pubDate>Mon, 16 Dec 2024 02:48:49 +0000</pubDate><link>https://news.ycombinator.com/item?id=42427574</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=42427574</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42427574</guid></item><item><title><![CDATA[New comment by russ in "Codenames App, or my biggest project so far"]]></title><description><![CDATA[
<p>Haven’t played Codenames in a long while, but made this 8 years ago to play with family and friends on TVs. Right in time for the holidays!<p>demo: <a href="https://dsa.github.io" rel="nofollow">https://dsa.github.io</a><p>code: <a href="https://github.com/dsa/dsa.github.io">https://github.com/dsa/dsa.github.io</a></p>
]]></description><pubDate>Mon, 09 Dec 2024 00:14:52 +0000</pubDate><link>https://news.ycombinator.com/item?id=42361933</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=42361933</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42361933</guid></item><item><title><![CDATA[New comment by russ in "Show HN: OnAir – create link, receive calls"]]></title><description><![CDATA[
<p>Haha yup, I’m that Russ. Really appreciate your kind words. <3<p>I’ll shoot you an email and let’s chat!</p>
]]></description><pubDate>Fri, 15 Nov 2024 21:19:24 +0000</pubDate><link>https://news.ycombinator.com/item?id=42151219</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=42151219</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42151219</guid></item><item><title><![CDATA[New comment by russ in "Show HN: OnAir – create link, receive calls"]]></title><description><![CDATA[
<p>This is super cool. One neat idea: when I'm in offline mode, I can clone my voice,  provide some context data/sources, and have my AI clone answer calls for me. It can give me a summary of conversations it had each day and allow me to follow up.</p>
]]></description><pubDate>Fri, 15 Nov 2024 18:02:24 +0000</pubDate><link>https://news.ycombinator.com/item?id=42149291</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=42149291</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=42149291</guid></item><item><title><![CDATA[New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"]]></title><description><![CDATA[
<p>Heh, actually I'm pretty sure I've come across your X profile before. :) You're definitely in a small minority of folks with a deep(er) understanding of WebRTC.</p>
]]></description><pubDate>Sat, 05 Oct 2024 22:19:56 +0000</pubDate><link>https://news.ycombinator.com/item?id=41753341</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=41753341</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41753341</guid></item><item><title><![CDATA[New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"]]></title><description><![CDATA[
<p>You really don’t need to know about WebRTC at all when you use LiveKit. That’s largely thanks to the SDKs abstracting away all the complexity. Having good SDKs that work across every platform with consistent APIs is more valuable than the SFU imo. There are other options for SFUs and folks like Signal have rolled their own. Try to get WebRTC running on Apple Vision Pro or tvOS and let me know if that’s no big deal.</p>
]]></description><pubDate>Sat, 05 Oct 2024 19:57:01 +0000</pubDate><link>https://news.ycombinator.com/item?id=41752467</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=41752467</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41752467</guid></item><item><title><![CDATA[New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"]]></title><description><![CDATA[
<p>Doesn’t sound right. I’d love to dig into this some more. Would you mind shooting me a DM on X? @dsa</p>
]]></description><pubDate>Sat, 05 Oct 2024 19:51:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=41752432</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=41752432</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41752432</guid></item><item><title><![CDATA[New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"]]></title><description><![CDATA[
<p>Field CTO — hi @Sean-Der :wave:<p>Fractional CTO sounds like a disaster lol</p>
]]></description><pubDate>Sat, 05 Oct 2024 17:50:41 +0000</pubDate><link>https://news.ycombinator.com/item?id=41751599</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=41751599</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41751599</guid></item><item><title><![CDATA[New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"]]></title><description><![CDATA[
<p>Which components feel ad hoc?<p>In most real applications, the agent has additional logic (function calling, RAG, etc) than simply relaying a stream to the model server. In those cases, you want it to be a separate service/component that can be independently scaled.</p>
]]></description><pubDate>Sat, 05 Oct 2024 14:59:32 +0000</pubDate><link>https://news.ycombinator.com/item?id=41750340</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=41750340</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41750340</guid></item><item><title><![CDATA[New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"]]></title><description><![CDATA[
<p>There’s Ultravox as well (from one of the creators of WebRTC):
<a href="https://github.com/fixie-ai/ultravox">https://github.com/fixie-ai/ultravox</a><p>Their model builds a speech-to-speech layer into Llama. Last I checked they have the audio-in part working and they’re working on the audio-out piece.</p>
]]></description><pubDate>Sat, 05 Oct 2024 14:51:50 +0000</pubDate><link>https://news.ycombinator.com/item?id=41750286</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=41750286</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41750286</guid></item><item><title><![CDATA[New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"]]></title><description><![CDATA[
<p>There’s a ton of complexity under the “relatively simple use case” when you get to a global, 200M+ user scale.</p>
]]></description><pubDate>Sat, 05 Oct 2024 02:47:23 +0000</pubDate><link>https://news.ycombinator.com/item?id=41747366</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=41747366</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41747366</guid></item><item><title><![CDATA[New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"]]></title><description><![CDATA[
<p>50% human speaking at $0.06/minute of tokens<p>50% AI speaking at $0.24/minute of tokens<p>we (LiveKit Cloud) charge ~$0.0005/minute for each participant (in this case there would be 2)<p>So blended is $0.151/minute</p>
]]></description><pubDate>Sat, 05 Oct 2024 01:14:23 +0000</pubDate><link>https://news.ycombinator.com/item?id=41746982</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=41746982</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41746982</guid></item><item><title><![CDATA[New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"]]></title><description><![CDATA[
<p>We had our playground (<a href="https://playground.livekit.io" rel="nofollow">https://playground.livekit.io</a>) up for a few days using our key. Def racked up a $$$$ bill!</p>
]]></description><pubDate>Fri, 04 Oct 2024 23:32:19 +0000</pubDate><link>https://news.ycombinator.com/item?id=41746455</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=41746455</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41746455</guid></item><item><title><![CDATA[New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"]]></title><description><![CDATA[
<p>I had no idea! <3 Thank you for sharing this, made my weekend.</p>
]]></description><pubDate>Fri, 04 Oct 2024 23:24:31 +0000</pubDate><link>https://news.ycombinator.com/item?id=41746424</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=41746424</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41746424</guid></item><item><title><![CDATA[New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"]]></title><description><![CDATA[
<p>It's using the same model/engine. I don't have knowledge of the internals, but a different subsystem/set of dedicated resources though for API traffic versus first-party apps.<p>One thing to note is there is no separate TTS-phase here, it's happening internally within GPT-4o, in the Realtime API and Advanced Voice.</p>
]]></description><pubDate>Fri, 04 Oct 2024 23:23:53 +0000</pubDate><link>https://news.ycombinator.com/item?id=41746421</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=41746421</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41746421</guid></item><item><title><![CDATA[Show HN: Open source framework OpenAI uses for Advanced Voice]]></title><description><![CDATA[
<p>Hey HN, we've been working with OpenAI for the past few months on the new Realtime API.<p>The goal is to give everyone access to the same stack that underpins Advanced Voice in the ChatGPT app.<p>Under the hood it works like this:
- A user's speech is captured by a LiveKit client SDK in the ChatGPT app
- Their speech is streamed using WebRTC to OpenAI’s voice agent
- The agent relays the speech prompt over websocket to GPT-4o
- GPT-4o runs inference and streams speech packets (over websocket) back to the agent
- The agent relays generated speech using WebRTC back to the user’s device<p>The Realtime API that OpenAI launched is the websocket interface to GPT-4o. This backend framework covers the voice agent portion. Besides having additional logic like function calling, the agent fundamentally proxies WebRTC to websocket.<p>The reason for this is because websocket isn’t the best choice for client-server communication. The vast majority of packet loss occurs between a server and client device and websocket doesn’t provide programmatic control or intervention in lossy network environments like WiFi or cellular. Packet loss leads to higher latency and choppy or garbled audio.</p>
<hr>
<p>Comments URL: <a href="https://news.ycombinator.com/item?id=41743327">https://news.ycombinator.com/item?id=41743327</a></p>
<p>Points: 266</p>
<p># Comments: 61</p>
]]></description><pubDate>Fri, 04 Oct 2024 17:01:04 +0000</pubDate><link>https://github.com/livekit/agents</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=41743327</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41743327</guid></item><item><title><![CDATA[New comment by russ in "Cerebras Inference: AI at Instant Speed"]]></title><description><![CDATA[
<p>It’s insanely fast.<p>Here’s an AI voice assistant I built that uses it:<p><a href="https://cerebras.vercel.app" rel="nofollow">https://cerebras.vercel.app</a></p>
]]></description><pubDate>Tue, 27 Aug 2024 18:56:16 +0000</pubDate><link>https://news.ycombinator.com/item?id=41371389</link><dc:creator>russ</dc:creator><comments>https://news.ycombinator.com/item?id=41371389</comments><guid isPermaLink="false">https://news.ycombinator.com/item?id=41371389</guid></item></channel></rss>