Related work: FunASR SenseVoice — multi-task speech model with emotion recognition

Hi! Interesting open-source speech language model.

Wanted to share a related project that might be of interest for comparison or collaboration:

## SenseVoice

[SenseVoice](https://github.com/FunAudioLLM/SenseVoice) is a multi-task speech foundation model that handles:
- **ASR** (50+ languages)
- **Speech emotion recognition** (happy, sad, angry, neutral, etc.)
- **Audio event detection** (laughter, applause, music, etc.)

All in a single model with ultra-low latency (~70ms for 10s audio).

## Comparison areas

| | Tada | SenseVoice |
|--|--|--|
| Architecture | Speech LM | Multi-task encoder |
| Tasks | Speech understanding | ASR + emotion + events |
| Latency | - | ~70ms/10s |
| Languages | - | 50+ |

## Also relevant

- **Fun-ASR-Nano** — Encoder + LLM speech model (audio encoder + Qwen2.5-0.5B): https://github.com/FunAudioLLM/Fun-ASR
- **FunASR** — Full ASR toolkit: https://github.com/modelscope/FunASR (16K+ stars)

Paper: https://arxiv.org/abs/2407.04051

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Related work: FunASR SenseVoice — multi-task speech model with emotion recognition #36

SenseVoice

Comparison areas

Also relevant

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	Tada	SenseVoice
Architecture	Speech LM	Multi-task encoder
Tasks	Speech understanding	ASR + emotion + events
Latency	-	~70ms/10s
Languages	-	50+

Related work: FunASR SenseVoice — multi-task speech model with emotion recognition #36

Description

SenseVoice

Comparison areas

Also relevant

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions