Conversational Module

We want to build a simplified conversational pipeline that will allow to talk and give orders to Stompy on device. We also believe that this could be long-term wise a standalone product

High level observations:

ASR is 'solved' for non-noisy single user speech, multi-speaker noisy environment with intelligent VAD is a hard problem
LLMs are 'solved' with 3B models being smart enough to handle dealing with orders and having laid back convo
TTS is 'solved' for non-conversational (reading news, book) use-cases with large models. Having 'intelligent' fast voice is an open problem
Building a demo is simple, creating a setup that is robust to the noise and edge-cases is not

Pipeline partsEdit

ASREdit

Latency - most of the system in the proper setup can get below <100ms but knowing when to stop to listen is a VAD problem
VAD - Voice Activity Detection (fundamental to conversational application), benchmarked typically with https://github.com/snakers4/silero-vad, most of the systems are very simple and super stupid (ML a'la 2015)
Multispeaker - most of models are clueless when it comes to many speakers environment
Barge-in - an option to interrupt the system, in any noisy environment completely disrupts the system, none of the existing ASR systems are trained to deal with this properly (ASR doesn't understand the context)
Noise - in-house data are fundamental to fine-tune big model to the specific environment (20 hours are often enough to ground on the use-case)
Keyword detection - models that focus on only spotting one word (Alexa, Siri)
Best conversational production systems - only few: Google, RIVA, Deepgram

LLMEdit

Vision support starts to be available off-the-shelf

TTSEdit

Quality - conversational vs reading abilities, most models don't really understand the context
Latency - most of open-source model are either fast or slow
Streaming option - only now there are first model truly streaming-oriented

Current SetupEdit

Building on the Jetson through tight Docker compose module due to multiple conflicting requirements. Dusty created a fantastic repository of many e2e setups at https://www.jetson-ai-lab.com/tutorial-intro.html However, typically ASR and TTS are based on RIVA which is not the setup you would want to keep/support. The building process for each model requires a lot of small changes to the setup. Currently, having an e2e setup requires composing different images into a pipeline loop.

0. Audio
The Linux interaction between ALSA, Pulsaudio, sudo vs user is ironically the most demanding part. See https://0pointer.de/blog/projects/guide-to-sound-apis.html or https://www.reddit.com/r/archlinux/comments/ae67oa/lets_talk_about_how_the_linuxarch_sound/. The most stable approach is to rely on PyAudio to handle dmix issues and often changes to the hardware availability.

1. ASR

Whisper is good but nothing close to production-ready ASR system
RIVA is the best option (production ready, reliably) however requires licensing per GPU which is not a good model long term-wse
Any other options are either bad or not production ready

2. LLM

Rely on ollama and NanoLLM which supports latests SLMs (below 7B)
Fast and well behaved under onnxruntime library

3. TTS

FastPitch (RIVA) - fast but old-style and poor quality
StyleTTS2 - really good quality, requires a big clean up to make it TensorRT ready, currently 200ms on A100
xTTS ok quality but slow even with TensorRT (dropped)
PiperTTS (old VITS model) - fast but old-style and poor quality

Long-term betsEdit

E2E

Humanoid Robots Wiki ^β

Conversational Module

Contents

Pipeline partsEdit

ASREdit

LLMEdit

TTSEdit

Current SetupEdit

Long-term betsEdit

Humanoid Robots Wiki β

Conversational Module

Contents

Pipeline partsEdit

ASREdit

LLMEdit

TTSEdit

Current SetupEdit

Long-term betsEdit

Humanoid Robots Wiki ^β