v2.2.0: Shallow diffusion returning back, melody encoder and ornaments support, multi-node batched validation, minor improvements and bug fixes
Shallow diffusion returning back (#128)
Shallow diffusion is a mechanism that can improve quality and save inference time for diffusion models that was first introduced in the original DiffSinger paper. Instead of starting the diffusion process from purely gaussian noise as classic diffusion does, shallow diffusion adds a shallow gaussian noise on a low-quality results generated by a simple network (which is called the auxiliary decoder) to skip many unnecessary steps from the beginning. With the combination of shallow diffusion and sampling acceleration algorithms, we can get better results under the same inference speed as before, or achieve higher inference speed without quality deterioration.
Quick start in configuration file
use_shallow_diffusion: true
K_step: 400 # adjust according to your needs
K_step_infer: 400 # should be <= K_stepSee other advanced settings and usages in the BestPractices.md.
Inference and deployment
The diffusion depth (K_step_infer) can be adjusted at inference time with --depth option of infer.py.
Acoustic models with shallow diffusion in will get and additional input called depth after exporting to ONNX format.
The above depth arguments are guaranteed to be safe as they are clipped by the maximum trained number of diffusion steps (K_step).
Melody encoder and ornaments support (#143)
Melody encoder directly calculates attention on the note sequene besides the linguistic features. With this new method of melody modeling, the pitch predictor gains more sensitiveness on the pitch trend in the music scores, thus imroving accuracy and stability on short slurs, long vibratos and out-of-range notes. In addition, this note-level encoder can also accept ornament tags as input, for example, the glides.
Melody encoder for pitch prediction
The results showed that melody encoder is more suitable than base pitch to carry music score information, especially on expressive datasets. On TensorBoard, significant improvements on short slurs and long vibratos were also observed. In our internal tests, pitch predictors with melody encoder also outperformed the old method on out-of-range notes, and can still show its sensitiveness even if the music scores are far higher than normal range (e.g. over C7 for a male singer). [Demo]
Before using melody encoder, we recommend you label your phoneme timings and MIDI sequence accurately. To enable melody encoder, simply introduce the following line in your configuration file:
use_melody_encoder: truePitch predictors with melody encoder enabled will get an additional input called note_rest after exporting to ONNX format.
Natural glide support
Melody encoder currently support glides, where the pitch smoothly rises at the beginning of the note, or drops at the end of the note. With enough glide samples that are properly labeled in the dataset, the pitch predictor can produce accurate and natural glides with simple glide flags, without having to draw manual pitch curves like before. [Demo]
To enable glide input, ensure that melody encoder is enabled, and introduce the following line in your configuration file:
use_glide_embed: trueIn your transcriptions.csv, you should add a new column called note_glide with glide type names, where none is for no glide and other names are for glide types defined in glide_types configuration key. By default, there are two types of glide notes: up and down.
Glide labeling has already been supported by MakeDiffSinger and SlurCutter.
Pitch predictors with glide embedding will get an additional input called note_glide after exporting to ONNX format.
Multi-node batched validation and improved strategy selection (#148)
Validation during training can now run on all of the notes and devices when DDP is enabled. Additionally, validation batch size is no longer limited to 1. To configure this, override the following keys in your configuration file:
# adjust according to your needs
max_val_batch_frames: 10000
max_val_batch_size: 4The PyTorch Lightning trainer strategy can now be configured more dynamically. Configuration example:
pl_trainer_strategy:
name: ddp
# keyword arguments of the strategy class can be configured below
process_group_backend: ncclSee more available strategies in the official documentation.
Besides, a new configuration key called nccl_p2p is introduced to control P2P option of NCCL in case it gets stuck.
Other improvements and changes
- TensorBoard manipulation of plots and audio samples are improved (#148)
- Binarizers now also prints data duration for each speaker respectively (#148)
- Harvest pitch extractor and F0 range configurations are supported (#149)
- Data augmentation is now enabled by default and ONNX exporter no longer needs
--expose_*options - Formatting of configuration attributes in the configuration schema has been improved (#153)
- Documentation and links are updated (#156)
Major bug fixes
- ONNX exporter of acoustic models now loads state dict in strict mode to prevent incorrect checkpoint
- SciPy version is constrained to >= 1.10.0 to avoid interpolation raising ValueError in some cases
- Potential alignment issues of the parselmouth pitch extractor are fixed
Known issues
For performance concerns, the find_unused_parameters option of DDP strategy is disabled by default. However, the DDP strategy requires all the parameters to be included in the computing graph, otherwise it raises a RuntimeError.
In some cases, like when you turned off train_aux_decoder or train_diffusion in shallow diffusion configurations, part of the model can be expected to hung outside of the computing graph. If you are using DDP with this, you can enable the option manually to avoid the error:
pl_trainer_strategy:
name: ddp
find_unused_parameters: true # <- enable this optionSome changes may not be listed above. See full change log: v2.1.0...v2.2.0