feat(doubao): Support Doubao Speech Large Model 2.0 (Seed-TTS) by chainsaid · Pull Request #41 · moeru-ai/unspeech

chainsaid · 2026-03-04T11:23:20Z

Description

This PR introduces support for Doubao Speech Large Model 2.0 (Doubao-Seed-TTS 2.0), enabling high-quality speech synthesis with enhanced control over pitch, emotion, and voice characteristics.

Key Changes

New Backend Implementation:
- Added pkg/backend/doubao package.
- Implemented HandleSpeech to support the Volcengine TTS API v3 protocol.
- Implemented HandleVoices to list available Doubao 2.0 voices.
Protocol Upgrades (v3):
- Added support for new audio parameters:
  - pitch (Pitch control)
  - emotion (Emotion control)
  - resource_id (Required for Seed-TTS 2.0, default set to seed-tts-2.0 in request body when specified).
Voice Configuration:
- Added voices.json containing 20+ new voices specifically for Doubao 2.0 (e.g., zh_female_vv_uranus_bigtts, zh_male_m191_uranus_bigtts).
- Included support for various scenarios: General, Role-play, Video Dubbing, and Multi-language (English).
Backend Registration:
- Registered the doubao backend in pkg/backend/backend.go.
- Supported aliases: doubao, bytedance, volcengine-doubao, doubao-tts.

Usage Example

curl -X POST http://localhost:5933/v1/audio/speech \
  -H "Authorization: Bearer <YOUR_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "doubao-2.0",
    "backend": "doubao",
    "input": "你好，我是豆包语音大模型2.0",
    "voice": "zh_female_vv_uranus_bigtts",
    "response_format": "mp3",
    "extra_body": {
      "app": {
        "appid": "<YOUR_APPID>",
        "cluster": "volcano_tts"
      },
      "audio": {
        "pitch": 1.0,
        "emotion": "happy",
        "resource_id": "seed-tts-2.0"
      }
    }
  }' --output output.mp3

Verification

Verified v1/audio/speech endpoint with Doubao 2.0 voice (zh_female_vv_uranus_bigtts).
Verified api/voices endpoint returns the updated Doubao 2.0 voice list.
Confirmed correct JSON payload structure for Volcengine API v3 (including resource_id placement).

Related Issues

Addresses request for Doubao Speech Large Model 2.0 integration.
Reference docs: https://www.volcengine.com/docs/6561/1329505?lang=zh (API v3) & https://www.volcengine.com/docs/6561/1257544?lang=zh (Voices)

gemini-code-assist · 2026-03-04T11:24:41Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the application's text-to-speech capabilities by integrating the Doubao Speech Large Model 2.0. It introduces a dedicated backend for this service, supporting its advanced API protocol and a wide array of new voices. This enhancement allows users to leverage Doubao's high-quality speech synthesis with fine-grained control over audio characteristics, enriching the overall audio generation experience.

Highlights

New Backend Implementation: A new doubao backend package was added, implementing HandleSpeech for the Volcengine TTS API v3 protocol and HandleVoices to list available Doubao 2.0 voices.
Protocol Upgrades: Support for new audio parameters such as pitch, emotion, and resource_id (required for Seed-TTS 2.0) was introduced, aligning with API v3.
Expanded Voice Configuration: Over 20 new Doubao 2.0 voices were added via voices.json, covering various scenarios like general use, role-play, video dubbing, and multi-language (English) support.
Backend Registration: The new doubao backend was registered in pkg/backend/backend.go, supporting aliases including bytedance, volcengine-doubao, and doubao-tts.

Changelog

pkg/backend/backend.go
- Imported the new doubao backend package.
- Added routing logic to the Speech and Voices handlers to direct requests to the doubao backend based on specified aliases.
pkg/backend/doubao/speech.go
- Implemented the HandleSpeech function to process speech synthesis requests using the Volcengine TTS API v3 protocol.
- Defined request and response structures for the Doubao API, including support for AppID, UserID, RequestID, VoiceType, Encoding, SpeedRatio, Pitch, Emotion, and ResourceID.
- Handled token extraction, default parameter assignment (e.g., cluster, operation, speedRatio), and error handling for upstream API calls.
- Decoded base64 audio data from the Doubao API response and returned it as an audio stream.
pkg/backend/doubao/voices.go
- Implemented the HandleVoices function to provide a list of available Doubao 2.0 voices.
- Embedded voices.json to load voice configurations.
- Mapped the internal voice structure to the generic types.Voice format, including supported formats (MP3, PCM, WAV) and languages.
pkg/backend/doubao/voices.json
- Added a JSON file containing detailed configurations for over 20 Doubao 2.0 voices, specifying their names, models, scenarios, languages, bitrates, and formats.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for Doubao Speech Large Model 2.0, which is a significant enhancement. The new backend implementation, protocol upgrades, and voice configurations are well-structured. However, I've identified a few areas for improvement related to API request consistency, response header accuracy, and data redundancy that should be addressed to ensure robustness and correctness.

gemini-code-assist · 2026-03-04T11:26:22Z

+			LoudnessRatio:    utils.GetByJSONPath[*float64](opts.ExtraBody, "{ .audio.loudness_ratio }"),
+			Pitch:            utils.GetByJSONPath[*float64](opts.ExtraBody, "{ .audio.pitch }"),
+			Emotion:          utils.GetByJSONPath[*string](opts.ExtraBody, "{ .audio.emotion }"),
+			ResourceID:       utils.GetByJSONPath[*string](opts.ExtraBody, "{ .audio.resource_id }"),


The PR description states that resource_id is required for Seed-TTS 2.0 and should default to seed-tts-2.0 if not specified. Currently, the code only passes ResourceID if it's present in opts.ExtraBody. If it's a mandatory field, it should be explicitly defaulted when missing to prevent potential API errors.

Suggested change

ResourceID: utils.GetByJSONPath[*string](opts.ExtraBody, "{ .audio.resource_id }"),

ResourceID: lo.Ternary(utils.GetByJSONPath[*string](opts.ExtraBody, "{ .audio.resource_id }") != nil, utils.GetByJSONPath[*string](opts.ExtraBody, "{ .audio.resource_id }"), lo.ToPtr("seed-tts-2.0")),

gemini-code-assist · 2026-03-04T11:26:22Z

+	}
+
+	// Set authorization header - Doubao uses Bearer;token format
+	req.Header.Set("Authorization", "Bearer;"+token)


The Authorization header is constructed using Bearer; + token. Standard Bearer token authentication typically uses a space (Bearer ) as a separator. The strings.TrimPrefix on line 80 also uses Bearer (with a space). This inconsistency might lead to authentication failures with the upstream Doubao API if it expects the standard format.

Suggested change

req.Header.Set("Authorization", "Bearer;"+token)

req.Header.Set("Authorization", "Bearer "+token)

gemini-code-assist · 2026-03-04T11:26:22Z

+		return mo.Err[any](apierrors.NewErrInternal().WithDetail(err.Error()).WithError(err).WithCaller())
+	}
+
+	return mo.Ok[any](c.Blob(http.StatusOK, "audio/mp3", audioBytes))


The c.Blob function hardcodes the Content-Type header to audio/mp3. However, the Encoding field in the request payload (DoubaoSpeechRequestAudio) is dynamically set based on opts.ResponseFormat, allowing clients to request different audio formats (e.g., WAV, PCM). If a client requests a format other than MP3, the response header will incorrectly state audio/mp3, which can cause issues for the client trying to process the audio.

Suggested change

return mo.Ok[any](c.Blob(http.StatusOK, "audio/mp3", audioBytes))

return mo.Ok[any](c.Blob(http.StatusOK, "audio/"+lo.FromPtr(reqBody.Audio.Encoding), audioBytes))

gemini-code-assist · 2026-03-04T11:26:22Z

+	Bitrate         string   `json:"bitrate"`
+	Format          string   `json:"format"`


The Bitrate and Format fields in the VoicesResponseItem struct are populated from voices.json but are not utilized when constructing the types.VoiceFormat array. This makes these fields redundant and potentially misleading. If they don't serve a specific purpose in defining the voice's primary format, they could be removed for clarity and to avoid unnecessary data.

type VoicesResponseItem struct { Name string `json:"name"` PreviewAudioURL string `json:"preview_audio_url"` Model string `json:"model"` Voice string `json:"voice"` Scenarios []string `json:"scenarios"` Language string `json:"language"` }

feat(doubao): support doubao-seed-tts-2.0

9422528

gemini-code-assist Bot reviewed Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(doubao): Support Doubao Speech Large Model 2.0 (Seed-TTS)#41

feat(doubao): Support Doubao Speech Large Model 2.0 (Seed-TTS)#41
chainsaid wants to merge 1 commit into
moeru-ai:mainfrom
chainsaid:main

chainsaid commented Mar 4, 2026

Uh oh!

gemini-code-assist Bot commented Mar 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 4, 2026

Uh oh!

gemini-code-assist Bot Mar 4, 2026

Uh oh!

gemini-code-assist Bot Mar 4, 2026

Uh oh!

gemini-code-assist Bot Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	ResourceID: utils.GetByJSONPath[*string](opts.ExtraBody, "{ .audio.resource_id }"),
	ResourceID: lo.Ternary(utils.GetByJSONPath[string](opts.ExtraBody, "{ .audio.resource_id }") != nil, utils.GetByJSONPath[string](opts.ExtraBody, "{ .audio.resource_id }"), lo.ToPtr("seed-tts-2.0")),

	req.Header.Set("Authorization", "Bearer;"+token)
	req.Header.Set("Authorization", "Bearer "+token)

	return mo.Ok[any](c.Blob(http.StatusOK, "audio/mp3", audioBytes))
	return mo.Ok[any](c.Blob(http.StatusOK, "audio/"+lo.FromPtr(reqBody.Audio.Encoding), audioBytes))

		Bitrate string `json:"bitrate"`
		Format string `json:"format"`

Conversation

chainsaid commented Mar 4, 2026

Description

Key Changes

Usage Example

Verification

Related Issues

Uh oh!

gemini-code-assist Bot commented Mar 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant