Hi! Interesting open-source speech language model.
Wanted to share a related project that might be of interest for comparison or collaboration:
SenseVoice
SenseVoice is a multi-task speech foundation model that handles:
- ASR (50+ languages)
- Speech emotion recognition (happy, sad, angry, neutral, etc.)
- Audio event detection (laughter, applause, music, etc.)
All in a single model with ultra-low latency (~70ms for 10s audio).
Comparison areas
|
Tada |
SenseVoice |
| Architecture |
Speech LM |
Multi-task encoder |
| Tasks |
Speech understanding |
ASR + emotion + events |
| Latency |
- |
~70ms/10s |
| Languages |
- |
50+ |
Also relevant
Paper: https://arxiv.org/abs/2407.04051
Hi! Interesting open-source speech language model.
Wanted to share a related project that might be of interest for comparison or collaboration:
SenseVoice
SenseVoice is a multi-task speech foundation model that handles:
All in a single model with ultra-low latency (~70ms for 10s audio).
Comparison areas
Also relevant
Paper: https://arxiv.org/abs/2407.04051