feat(doubao): Support Doubao Speech Large Model 2.0 (Seed-TTS)#41
feat(doubao): Support Doubao Speech Large Model 2.0 (Seed-TTS)#41chainsaid wants to merge 1 commit into
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the application's text-to-speech capabilities by integrating the Doubao Speech Large Model 2.0. It introduces a dedicated backend for this service, supporting its advanced API protocol and a wide array of new voices. This enhancement allows users to leverage Doubao's high-quality speech synthesis with fine-grained control over audio characteristics, enriching the overall audio generation experience. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for Doubao Speech Large Model 2.0, which is a significant enhancement. The new backend implementation, protocol upgrades, and voice configurations are well-structured. However, I've identified a few areas for improvement related to API request consistency, response header accuracy, and data redundancy that should be addressed to ensure robustness and correctness.
| LoudnessRatio: utils.GetByJSONPath[*float64](opts.ExtraBody, "{ .audio.loudness_ratio }"), | ||
| Pitch: utils.GetByJSONPath[*float64](opts.ExtraBody, "{ .audio.pitch }"), | ||
| Emotion: utils.GetByJSONPath[*string](opts.ExtraBody, "{ .audio.emotion }"), | ||
| ResourceID: utils.GetByJSONPath[*string](opts.ExtraBody, "{ .audio.resource_id }"), |
There was a problem hiding this comment.
The PR description states that resource_id is required for Seed-TTS 2.0 and should default to seed-tts-2.0 if not specified. Currently, the code only passes ResourceID if it's present in opts.ExtraBody. If it's a mandatory field, it should be explicitly defaulted when missing to prevent potential API errors.
| ResourceID: utils.GetByJSONPath[*string](opts.ExtraBody, "{ .audio.resource_id }"), | |
| ResourceID: lo.Ternary(utils.GetByJSONPath[*string](opts.ExtraBody, "{ .audio.resource_id }") != nil, utils.GetByJSONPath[*string](opts.ExtraBody, "{ .audio.resource_id }"), lo.ToPtr("seed-tts-2.0")), |
| } | ||
|
|
||
| // Set authorization header - Doubao uses Bearer;token format | ||
| req.Header.Set("Authorization", "Bearer;"+token) |
There was a problem hiding this comment.
The Authorization header is constructed using Bearer; + token. Standard Bearer token authentication typically uses a space (Bearer ) as a separator. The strings.TrimPrefix on line 80 also uses Bearer (with a space). This inconsistency might lead to authentication failures with the upstream Doubao API if it expects the standard format.
| req.Header.Set("Authorization", "Bearer;"+token) | |
| req.Header.Set("Authorization", "Bearer "+token) |
| return mo.Err[any](apierrors.NewErrInternal().WithDetail(err.Error()).WithError(err).WithCaller()) | ||
| } | ||
|
|
||
| return mo.Ok[any](c.Blob(http.StatusOK, "audio/mp3", audioBytes)) |
There was a problem hiding this comment.
The c.Blob function hardcodes the Content-Type header to audio/mp3. However, the Encoding field in the request payload (DoubaoSpeechRequestAudio) is dynamically set based on opts.ResponseFormat, allowing clients to request different audio formats (e.g., WAV, PCM). If a client requests a format other than MP3, the response header will incorrectly state audio/mp3, which can cause issues for the client trying to process the audio.
| return mo.Ok[any](c.Blob(http.StatusOK, "audio/mp3", audioBytes)) | |
| return mo.Ok[any](c.Blob(http.StatusOK, "audio/"+lo.FromPtr(reqBody.Audio.Encoding), audioBytes)) |
| Bitrate string `json:"bitrate"` | ||
| Format string `json:"format"` |
There was a problem hiding this comment.
The Bitrate and Format fields in the VoicesResponseItem struct are populated from voices.json but are not utilized when constructing the types.VoiceFormat array. This makes these fields redundant and potentially misleading. If they don't serve a specific purpose in defining the voice's primary format, they could be removed for clarity and to avoid unnecessary data.
type VoicesResponseItem struct {
Name string `json:"name"`
PreviewAudioURL string `json:"preview_audio_url"`
Model string `json:"model"`
Voice string `json:"voice"`
Scenarios []string `json:"scenarios"`
Language string `json:"language"`
}
Description
This PR introduces support for Doubao Speech Large Model 2.0 (Doubao-Seed-TTS 2.0), enabling high-quality speech synthesis with enhanced control over pitch, emotion, and voice characteristics.
Key Changes
New Backend Implementation:
pkg/backend/doubaopackage.HandleSpeechto support the Volcengine TTS API v3 protocol.HandleVoicesto list available Doubao 2.0 voices.Protocol Upgrades (v3):
pitch(Pitch control)emotion(Emotion control)resource_id(Required for Seed-TTS 2.0, default set toseed-tts-2.0in request body when specified).Voice Configuration:
voices.jsoncontaining 20+ new voices specifically for Doubao 2.0 (e.g.,zh_female_vv_uranus_bigtts,zh_male_m191_uranus_bigtts).Backend Registration:
doubaobackend inpkg/backend/backend.go.doubao,bytedance,volcengine-doubao,doubao-tts.Usage Example
Verification
v1/audio/speechendpoint with Doubao 2.0 voice (zh_female_vv_uranus_bigtts).api/voicesendpoint returns the updated Doubao 2.0 voice list.resource_idplacement).Related Issues