Skip to content

feat(doubao): Support Doubao Speech Large Model 2.0 (Seed-TTS)#41

Open
chainsaid wants to merge 1 commit into
moeru-ai:mainfrom
chainsaid:main
Open

feat(doubao): Support Doubao Speech Large Model 2.0 (Seed-TTS)#41
chainsaid wants to merge 1 commit into
moeru-ai:mainfrom
chainsaid:main

Conversation

@chainsaid
Copy link
Copy Markdown

Description

This PR introduces support for Doubao Speech Large Model 2.0 (Doubao-Seed-TTS 2.0), enabling high-quality speech synthesis with enhanced control over pitch, emotion, and voice characteristics.

Key Changes

  1. New Backend Implementation:

    • Added pkg/backend/doubao package.
    • Implemented HandleSpeech to support the Volcengine TTS API v3 protocol.
    • Implemented HandleVoices to list available Doubao 2.0 voices.
  2. Protocol Upgrades (v3):

    • Added support for new audio parameters:
      • pitch (Pitch control)
      • emotion (Emotion control)
      • resource_id (Required for Seed-TTS 2.0, default set to seed-tts-2.0 in request body when specified).
  3. Voice Configuration:

    • Added voices.json containing 20+ new voices specifically for Doubao 2.0 (e.g., zh_female_vv_uranus_bigtts, zh_male_m191_uranus_bigtts).
    • Included support for various scenarios: General, Role-play, Video Dubbing, and Multi-language (English).
  4. Backend Registration:

    • Registered the doubao backend in pkg/backend/backend.go.
    • Supported aliases: doubao, bytedance, volcengine-doubao, doubao-tts.

Usage Example

curl -X POST http://localhost:5933/v1/audio/speech \
  -H "Authorization: Bearer <YOUR_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "doubao-2.0",
    "backend": "doubao",
    "input": "你好,我是豆包语音大模型2.0",
    "voice": "zh_female_vv_uranus_bigtts",
    "response_format": "mp3",
    "extra_body": {
      "app": {
        "appid": "<YOUR_APPID>",
        "cluster": "volcano_tts"
      },
      "audio": {
        "pitch": 1.0,
        "emotion": "happy",
        "resource_id": "seed-tts-2.0"
      }
    }
  }' --output output.mp3

Verification

  • Verified v1/audio/speech endpoint with Doubao 2.0 voice (zh_female_vv_uranus_bigtts).
  • Verified api/voices endpoint returns the updated Doubao 2.0 voice list.
  • Confirmed correct JSON payload structure for Volcengine API v3 (including resource_id placement).

Related Issues

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the application's text-to-speech capabilities by integrating the Doubao Speech Large Model 2.0. It introduces a dedicated backend for this service, supporting its advanced API protocol and a wide array of new voices. This enhancement allows users to leverage Doubao's high-quality speech synthesis with fine-grained control over audio characteristics, enriching the overall audio generation experience.

Highlights

  • New Backend Implementation: A new doubao backend package was added, implementing HandleSpeech for the Volcengine TTS API v3 protocol and HandleVoices to list available Doubao 2.0 voices.
  • Protocol Upgrades: Support for new audio parameters such as pitch, emotion, and resource_id (required for Seed-TTS 2.0) was introduced, aligning with API v3.
  • Expanded Voice Configuration: Over 20 new Doubao 2.0 voices were added via voices.json, covering various scenarios like general use, role-play, video dubbing, and multi-language (English) support.
  • Backend Registration: The new doubao backend was registered in pkg/backend/backend.go, supporting aliases including bytedance, volcengine-doubao, and doubao-tts.
Changelog
  • pkg/backend/backend.go
    • Imported the new doubao backend package.
    • Added routing logic to the Speech and Voices handlers to direct requests to the doubao backend based on specified aliases.
  • pkg/backend/doubao/speech.go
    • Implemented the HandleSpeech function to process speech synthesis requests using the Volcengine TTS API v3 protocol.
    • Defined request and response structures for the Doubao API, including support for AppID, UserID, RequestID, VoiceType, Encoding, SpeedRatio, Pitch, Emotion, and ResourceID.
    • Handled token extraction, default parameter assignment (e.g., cluster, operation, speedRatio), and error handling for upstream API calls.
    • Decoded base64 audio data from the Doubao API response and returned it as an audio stream.
  • pkg/backend/doubao/voices.go
    • Implemented the HandleVoices function to provide a list of available Doubao 2.0 voices.
    • Embedded voices.json to load voice configurations.
    • Mapped the internal voice structure to the generic types.Voice format, including supported formats (MP3, PCM, WAV) and languages.
  • pkg/backend/doubao/voices.json
    • Added a JSON file containing detailed configurations for over 20 Doubao 2.0 voices, specifying their names, models, scenarios, languages, bitrates, and formats.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Doubao Speech Large Model 2.0, which is a significant enhancement. The new backend implementation, protocol upgrades, and voice configurations are well-structured. However, I've identified a few areas for improvement related to API request consistency, response header accuracy, and data redundancy that should be addressed to ensure robustness and correctness.

LoudnessRatio: utils.GetByJSONPath[*float64](opts.ExtraBody, "{ .audio.loudness_ratio }"),
Pitch: utils.GetByJSONPath[*float64](opts.ExtraBody, "{ .audio.pitch }"),
Emotion: utils.GetByJSONPath[*string](opts.ExtraBody, "{ .audio.emotion }"),
ResourceID: utils.GetByJSONPath[*string](opts.ExtraBody, "{ .audio.resource_id }"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The PR description states that resource_id is required for Seed-TTS 2.0 and should default to seed-tts-2.0 if not specified. Currently, the code only passes ResourceID if it's present in opts.ExtraBody. If it's a mandatory field, it should be explicitly defaulted when missing to prevent potential API errors.

Suggested change
ResourceID: utils.GetByJSONPath[*string](opts.ExtraBody, "{ .audio.resource_id }"),
ResourceID: lo.Ternary(utils.GetByJSONPath[*string](opts.ExtraBody, "{ .audio.resource_id }") != nil, utils.GetByJSONPath[*string](opts.ExtraBody, "{ .audio.resource_id }"), lo.ToPtr("seed-tts-2.0")),

}

// Set authorization header - Doubao uses Bearer;token format
req.Header.Set("Authorization", "Bearer;"+token)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Authorization header is constructed using Bearer; + token. Standard Bearer token authentication typically uses a space (Bearer ) as a separator. The strings.TrimPrefix on line 80 also uses Bearer (with a space). This inconsistency might lead to authentication failures with the upstream Doubao API if it expects the standard format.

Suggested change
req.Header.Set("Authorization", "Bearer;"+token)
req.Header.Set("Authorization", "Bearer "+token)

return mo.Err[any](apierrors.NewErrInternal().WithDetail(err.Error()).WithError(err).WithCaller())
}

return mo.Ok[any](c.Blob(http.StatusOK, "audio/mp3", audioBytes))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The c.Blob function hardcodes the Content-Type header to audio/mp3. However, the Encoding field in the request payload (DoubaoSpeechRequestAudio) is dynamically set based on opts.ResponseFormat, allowing clients to request different audio formats (e.g., WAV, PCM). If a client requests a format other than MP3, the response header will incorrectly state audio/mp3, which can cause issues for the client trying to process the audio.

Suggested change
return mo.Ok[any](c.Blob(http.StatusOK, "audio/mp3", audioBytes))
return mo.Ok[any](c.Blob(http.StatusOK, "audio/"+lo.FromPtr(reqBody.Audio.Encoding), audioBytes))

Comment on lines +26 to +27
Bitrate string `json:"bitrate"`
Format string `json:"format"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Bitrate and Format fields in the VoicesResponseItem struct are populated from voices.json but are not utilized when constructing the types.VoiceFormat array. This makes these fields redundant and potentially misleading. If they don't serve a specific purpose in defining the voice's primary format, they could be removed for clarity and to avoid unnecessary data.

type VoicesResponseItem struct {
	Name            string   `json:"name"`
	PreviewAudioURL string   `json:"preview_audio_url"`
	Model           string   `json:"model"`
	Voice           string   `json:"voice"`
	Scenarios       []string `json:"scenarios"`
	Language        string   `json:"language"`
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant