We want to build a simplified conversational pipeline that will allow to talk and give orders to Stompy on device. We also believe that this could be long-term wise a standalone product
High level observations:
- ASR is 'solved' for non-noisy single user speech, multi-speaker noisy environment with intelligent VAD is a hard problem
- LLMs are 'solved' with 3B models being smart enough to handle dealing with orders and having laid back convo
- TTS is 'solved' for non-conversational (reading news, book) use-cases with large models. Having 'intelligent' fast voice is an open problem
- Building a demo is simple, creating a setup that is robust to the noise and edge-cases is not
Contents
Pipeline partsEdit
ASREdit
- Latency - most of the system in the proper setup can get below <100ms but knowing when to stop to listen is a VAD problem
- VAD - Voice Activity Detection (fundamental to conversational application), benchmarked typically with https://github.com/snakers4/silero-vad, most of the systems are very simple and super stupid (ML a'la 2015)
- Multispeaker - most of models are clueless when it comes to many speakers environment
- Barge-in - an option to interrupt the system, in any noisy environment completely disrupts the system, none of the existing ASR systems are trained to deal with this properly (ASR doesn't understand the context)
- Noise - in-house data are fundamental to fine-tune big model to the specific environment (20 hours are often enough to ground on the use-case)
- Keyword detection - models that focus on only spotting one word (Alexa, Siri)
- Best conversational production systems - only few: Google, RIVA, Deepgram
LLMEdit
- Vision support starts to be available off-the-shelf
TTSEdit
- Quality - conversational vs reading abilities, most models don't really understand the context
- Latency - most of open-source model are either fast or slow
- Streaming option - only now there are first model truly streaming-oriented
Current SetupEdit
Building on the Jetson through tight Docker compose module due to multiple conflicting requirements. Dusty created a fantastic repository of many e2e setups at https://www.jetson-ai-lab.com/tutorial-intro.html However, typically ASR and TTS are based on RIVA which is not the setup you would want to keep/support. The building process for each model requires a lot of small changes to the setup. Currently, having an e2e setup requires composing different images into a pipeline loop.
0. Audio
The Linux interaction between ALSA, Pulsaudio, sudo vs user is ironically the most demanding part. See https://0pointer.de/blog/projects/guide-to-sound-apis.html or https://www.reddit.com/r/archlinux/comments/ae67oa/lets_talk_about_how_the_linuxarch_sound/.
The most stable approach is to rely on PyAudio to handle dmix issues and often changes to the hardware availability.
1. ASR
- Whisper is good but nothing close to production-ready ASR system
- RIVA is the best option (production ready, reliably) however requires licensing per GPU which is not a good model long term-wse
- Any other options are either bad or not production ready
2. LLM
- Rely on ollama and NanoLLM which supports latests SLMs (below 7B)
- Fast and well behaved under onnxruntime library
3. TTS
- FastPitch (RIVA) - fast but old-style and poor quality
- StyleTTS2 - really good quality, requires a big clean up to make it TensorRT ready, currently 200ms on A100
- xTTS ok quality but slow even with TensorRT (dropped)
- PiperTTS (old VITS model) - fast but old-style and poor quality
Long-term betsEdit
- E2E