Skip to content

feat: Add SenseVoice/FunASR as STT extension — 5x faster than Whisper, emotion detection #2175

@LauraGPT

Description

@LauraGPT

Feature request

TEN Framework is for building conversational voice AI agents — ASR latency directly impacts conversation quality. SenseVoice (8.3K+ stars) could significantly reduce the STT bottleneck.

Why SenseVoice for TEN

  1. 5x faster than Whisper — lower latency for more natural voice conversations
  2. Non-autoregressive — single forward pass, predictable and consistent latency
  3. Built-in emotion detection — agents can adapt responses based on user's emotional state
  4. Audio event detection — detect laughter, music, etc. for richer context
  5. 50+ languages with auto-detection — multilingual agent support out of the box
  6. Built-in VAD (FSMN-VAD) — handles silence detection efficiently

Integration

OpenAI-compatible API

pip install funasr
funasr-server --device cuda
# Serves at http://localhost:8000/v1/audio/transcriptions

Python API (for direct integration)

from funasr import AutoModel
model = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad")
result = model.generate(input=audio_chunk)
# Returns: text + language + emotion + audio events

Streaming (for real-time agents)

FunASR also supports WebSocket streaming for low-latency scenarios:

from funasr import AutoModel
model = AutoModel(model="paraformer-zh-streaming", chunk_size=[0, 10, 5])
# Processes audio chunks in real-time

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions