The augmentation stage applies realistic audio transformations to synthetic TTS clips and aligns them within detection windows.
Source: src/livekit/wakeword/data/augment.py
CLI: livekit-wakeword augment <config>
Original TTS clips (clip_000000.wav)
│
▼ Round 0: reads originals
├──► Per-sample augmentations (EQ, distortion)
├──► RIR convolution
├──► Background mixing
├──► Alignment
└──► clip_000000_r0.wav
│
▼ Round 1: reads r0 output (stacks)
├──► Per-sample augmentations
├──► RIR convolution
├──► Background mixing
└──► clip_000000_r1.wav
│
▼ ... Round N reads r(N-1)
The AudioAugmentor class manages all audio augmentations.
AudioAugmentor(
background_paths: list[Path], # Directories with background noise .wav files
rir_paths: list[Path], # Directories with room impulse response .wav files
sample_rate: int = 16000
)All .wav files are collected recursively from the provided directories.
Applied via the audiomentations library to individual clips:
| Transform | Probability | Description |
|---|---|---|
SevenBandParametricEQ |
0.25 | 7-band parametric equalizer |
TanhDistortion |
0.25 | Tanh-based distortion |
apply_rir(audio, p=0.5) convolves audio with a randomly selected room impulse response using FFT convolution (scipy.signal.fftconvolve). The RIR is normalized by its maximum absolute value before convolution. Output is cropped to the original audio length.
mix_with_background(audio, snr_db_range=(5.0, 15.0)) mixes audio with a random background noise clip at a randomly selected SNR within the given range.
The background clip is looped (tiled) if shorter than the audio and randomly cropped to a starting position. The mixing formula scales the background based on:
scale = sqrt(audio_power / (background_power * 10^(snr_db / 10)))
output = audio + scale * background
Note: Background noise files serve double duty — they are used here as augmentation overlays and also generated as standalone background clips during the data generation step. Those background clips then pass through this same augmentation pipeline.
Positive and negative clips are aligned differently within the target window (default 2.0 seconds = 32,000 samples).
align_clip_to_end(audio, target_length, jitter_samples=3200)
Positive clips are placed at the end of the window with random jitter of up to 3200 samples (200ms at 16kHz). This simulates the real detection scenario where the wake word appears at the trailing edge of the audio buffer.
[ zero padding | wake word | jitter ]
◄── target_length ──►
Negative clips are centered within the target window. If longer than the target, they are center-cropped; if shorter, they are center-padded with zeros.
The augmentation pipeline runs config.augmentation.rounds passes over all six directories (positive train/test, negative train/test, background train/test). Each round writes to a separate file (clip_000000_r0.wav, clip_000000_r1.wav, etc.) — originals are never modified.
Rounds stack: round 0 reads the clean TTS originals, round 1 reads round 0's output, round 2 reads round 1's output, and so on. This produces progressively more degraded audio as augmentation effects compound across rounds. Old augmented files (_rN.wav) are cleaned up at the start of each run so re-running is idempotent.
For each WAV file in a directory:
- Read audio, convert to float32, take first channel if stereo
- Apply per-sample augmentations (EQ, distortion)
- Apply RIR convolution (50% probability)
- Mix with background noise
- Align to window — round 0 only (end-aligned for positives, center-padded for negatives)
- Write to
clip_NNNNNN_r{round}.wav(originals preserved)
After augmentation:
output/<model_name>/
├── positive_train/
│ ├── clip_000000.wav # Original TTS (preserved, not used for training)
│ ├── clip_000000_r0.wav # Round 0 augmented
│ ├── clip_000000_r1.wav # Round 1 (stacked on r0)
│ └── ...
├── positive_test/
├── negative_train/
├── negative_test/
├── background_train/
└── background_test/
Only _rN.wav files are fed to feature extraction — clean TTS originals are excluded from training since they don't match real microphone audio.
Feature extraction is a separate step — see Feature Extraction.