Hacker News: russ

Show HN: Open-source turn detection model for voice AI

russ — Mon, 23 Dec 2024 21:46:37 +0000

Hey HN, it’s Russ - cofounder of LiveKit. An open source stack for building realtime AI applications.

We’re sharing our first homegrown AI model for turn detection. Here’s a live demo: https://cerebras.vercel.app/

Voice AI has come a long way in the last year. We now have end-to-end systems that can generate a response to user input in 300-500ms — human level speeds!

As latency reduces, a common problem that surfaces is the LLM responds too quickly. Any time there’s a short pause in a user’s speech, it ends up interrupting them. This is largely due to how voice AI applications perform “turn detection” — that is, figuring out when the user has finished speaking and when the model can run inference and respond.

Pretty much everyone uses a signal processing technique called voice activity detection (VAD). In a nutshell, it figures out when the audio signal switches from speech to silence and then triggers an end of turn once a configurable amount of silence has transpired.

One obvious delta between VAD and how humans do turn detection is we also consider the content of speech (i.e. what someone says). These past few months, we’ve been working on an open weights, content-aware turn detection model for voice AI applications. It was fine-tuned from SmolLM v2 on text, runs on CPU (currently takes 50ms for inference), and uses speech transcriptions as input to predict when a user has completed a thought (also called an “utterance”). Since it was trained on text, notably it works well for pipeline-based architectures (i.e. STT ⇒ LLM ⇒ TTS).

We use this model together with VAD to make better predictions about whether a user is done speaking. Here’s some demos --

- Podcast interview: https://youtu.be/EYDrSSEP0h0

- Ordering food: https://youtu.be/fcr8Y-3c4E0

- Providing shipping address: https://youtu.be/2pQWvd6xozw

- Customer support: https://youtu.be/YoSRg3ORKtQ

In our testing we’ve found:

- 85% reduction in unintentional interruptions

- 3% false positives (where the user is done speaking, but the model thinks they aren’t)

In practice, we still have work to do. We currently delay inference if the model predicts a < 15% chance the user is done speaking. This threshold misses a bunch of middle-of-the-pack probabilities.

Next steps are improving the model accuracy, tuning performance, and expanding to support more languages (only supports English rn). Separately, we’re starting to explore an audio-based model that considers not just what someone says but how they say it, which can be used with natively multimodal models like GPT-4o that directly process and generate audio.

Code here: https://github.com/livekit/agents/tree/main/livekit-plugins/...

Let us know what you think!

Comments URL: https://news.ycombinator.com/item?id=42497868

Points: 8

# Comments: 1

New comment by russ in "CodeMic: A new way to talk about code"

russ — Mon, 23 Dec 2024 16:47:35 +0000

Yeah, that would be really handy too.

New comment by russ in "CodeMic: A new way to talk about code"

russ — Sun, 22 Dec 2024 21:43:56 +0000

This is very cool! I’ve wanted something like CodeMic for a long time.

Back when I was at Twitter, we used Review Board for code reviews (this was in 2009, before GH was a thing for most companies). It was tough to thoughtfully review large branches, especially for parts of the codebase that I wasn’t familiar with. I remember thinking, if I could somehow record the development process for a PR I was reviewing, it would be easier to understand what the submitter was trying to accomplish and how they went about doing so. I found myself more so reviewing code style instead of functionality, architecture, or design.

I watched most of the intro video, but didn’t go deeper on the site. Does CM integrate easily into the code review/PR process? I suppose I could just attach a link in any PR description?

Great work!

New comment by russ in "Codenames App, or my biggest project so far"

russ — Mon, 16 Dec 2024 02:48:49 +0000

You got it! Hope you have some fun with it. :)

New comment by russ in "Codenames App, or my biggest project so far"

russ — Mon, 09 Dec 2024 00:14:52 +0000

Haven’t played Codenames in a long while, but made this 8 years ago to play with family and friends on TVs. Right in time for the holidays!

demo: https://dsa.github.io

code: https://github.com/dsa/dsa.github.io

New comment by russ in "Show HN: OnAir – create link, receive calls"

russ — Fri, 15 Nov 2024 21:19:24 +0000

Haha yup, I’m that Russ. Really appreciate your kind words. <3

I’ll shoot you an email and let’s chat!

New comment by russ in "Show HN: OnAir – create link, receive calls"

russ — Fri, 15 Nov 2024 18:02:24 +0000

This is super cool. One neat idea: when I'm in offline mode, I can clone my voice, provide some context data/sources, and have my AI clone answer calls for me. It can give me a summary of conversations it had each day and allow me to follow up.

New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"

russ — Sat, 05 Oct 2024 22:19:56 +0000

Heh, actually I'm pretty sure I've come across your X profile before. :) You're definitely in a small minority of folks with a deep(er) understanding of WebRTC.

New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"

russ — Sat, 05 Oct 2024 19:57:01 +0000

You really don’t need to know about WebRTC at all when you use LiveKit. That’s largely thanks to the SDKs abstracting away all the complexity. Having good SDKs that work across every platform with consistent APIs is more valuable than the SFU imo. There are other options for SFUs and folks like Signal have rolled their own. Try to get WebRTC running on Apple Vision Pro or tvOS and let me know if that’s no big deal.

New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"

russ — Sat, 05 Oct 2024 19:51:32 +0000

Doesn’t sound right. I’d love to dig into this some more. Would you mind shooting me a DM on X? @dsa

New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"

russ — Sat, 05 Oct 2024 17:50:41 +0000

Field CTO — hi @Sean-Der :wave:

Fractional CTO sounds like a disaster lol

New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"

russ — Sat, 05 Oct 2024 14:59:32 +0000

Which components feel ad hoc?

In most real applications, the agent has additional logic (function calling, RAG, etc) than simply relaying a stream to the model server. In those cases, you want it to be a separate service/component that can be independently scaled.

New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"

russ — Sat, 05 Oct 2024 14:51:50 +0000

There’s Ultravox as well (from one of the creators of WebRTC): https://github.com/fixie-ai/ultravox

Their model builds a speech-to-speech layer into Llama. Last I checked they have the audio-in part working and they’re working on the audio-out piece.

New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"

russ — Sat, 05 Oct 2024 02:47:23 +0000

There’s a ton of complexity under the “relatively simple use case” when you get to a global, 200M+ user scale.

New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"

russ — Sat, 05 Oct 2024 01:14:23 +0000

50% human speaking at $0.06/minute of tokens

50% AI speaking at $0.24/minute of tokens

we (LiveKit Cloud) charge ~$0.0005/minute for each participant (in this case there would be 2)

So blended is $0.151/minute

New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"

russ — Fri, 04 Oct 2024 23:32:19 +0000

We had our playground (https://playground.livekit.io) up for a few days using our key. Def racked up a $$$$ bill!

New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"

russ — Fri, 04 Oct 2024 23:24:31 +0000

I had no idea! <3 Thank you for sharing this, made my weekend.

New comment by russ in "Show HN: Open source framework OpenAI uses for Advanced Voice"

russ — Fri, 04 Oct 2024 23:23:53 +0000

It's using the same model/engine. I don't have knowledge of the internals, but a different subsystem/set of dedicated resources though for API traffic versus first-party apps.

One thing to note is there is no separate TTS-phase here, it's happening internally within GPT-4o, in the Realtime API and Advanced Voice.

Show HN: Open source framework OpenAI uses for Advanced Voice

russ — Fri, 04 Oct 2024 17:01:04 +0000

Hey HN, we've been working with OpenAI for the past few months on the new Realtime API.

The goal is to give everyone access to the same stack that underpins Advanced Voice in the ChatGPT app.

Under the hood it works like this: - A user's speech is captured by a LiveKit client SDK in the ChatGPT app - Their speech is streamed using WebRTC to OpenAI’s voice agent - The agent relays the speech prompt over websocket to GPT-4o - GPT-4o runs inference and streams speech packets (over websocket) back to the agent - The agent relays generated speech using WebRTC back to the user’s device

The Realtime API that OpenAI launched is the websocket interface to GPT-4o. This backend framework covers the voice agent portion. Besides having additional logic like function calling, the agent fundamentally proxies WebRTC to websocket.

The reason for this is because websocket isn’t the best choice for client-server communication. The vast majority of packet loss occurs between a server and client device and websocket doesn’t provide programmatic control or intervention in lossy network environments like WiFi or cellular. Packet loss leads to higher latency and choppy or garbled audio.

Comments URL: https://news.ycombinator.com/item?id=41743327

Points: 266

# Comments: 61

New comment by russ in "Cerebras Inference: AI at Instant Speed"

russ — Tue, 27 Aug 2024 18:56:16 +0000

It’s insanely fast.

Here’s an AI voice assistant I built that uses it:

https://cerebras.vercel.app