speech-vocoder
Speech Processingspeechbrainrigorous codebase
Description
Neural Vocoder Generator Design
Research Question
Design an improved neural vocoder generator network. Given a mel-spectrogram, the generator synthesizes high-quality raw audio waveform. The entire generator architecture — including upsampling strategy, residual block design, activation functions, and conditioning method — is open for modification. The multi-scale discriminator and GAN training loop are fixed.
Background
Neural vocoders convert mel-spectrograms to waveforms for text-to-speech and voice conversion. Key architectures:
- HiFi-GAN (Kong et al., 2020): Multi-Receptive Field Fusion (MRF) with transposed conv upsampling — current gold standard for quality-speed tradeoff
- MelGAN (Kumar et al., 2019): Lightweight generator using reflection padding and dilated residual stacks
- UnivNet (Jang et al., 2021): Location-Variable Convolution with mel conditioning at each upsampling stage
Task
Modify the VocoderGenerator class in speechbrain/custom_vocoder.py (lines 54-124). The class must implement:
__init__(self, n_mels=80, upsample_rates=(8,8,2,2), d_model=512, dropout=0.1)forward(self, mel) -> waveform
Input: mel-spectrogram (B, 80, T_mel).
Output: audio waveform (B, 1, T_audio).
Evaluation
- Metrics: Mel spectrogram loss (lower is better), PESQ (higher is better)
- Environments: LJSpeech (single speaker), VCTK 5-speaker (multi-speaker English), AISHELL-3 5-speaker (multi-speaker Chinese)
- Baselines: HiFi-GAN V1, MelGAN, UnivNet
Code
custom_vocoder.py
EditableRead-only
1#!/usr/bin/env python32"""Self-contained vocoder training script for speech-vocoder task.34The VocoderGenerator class (EDITABLE region) is the complete generator network:5- Mel-spectrogram → raw waveform conversion6- Upsampling strategy, residual blocks, activation functions78FIXED: Multi-scale discriminator, GAN training loop, evaluation (PESQ, MRSTFT).910Environment variables:11DATA_DIR — path to dataset root12OUTPUT_DIR — output directory13SEED — random seed (default: 42)14ENV — dataset label15
Results
| Model | Type | mel loss ljspeech ↓ | pesq ljspeech ↑ | mel loss aishell3-5spk ↓ | pesq aishell3-5spk ↑ |
|---|---|---|---|---|---|
| hifigan | baseline | 0.076 | 1.958 | 0.013 | 2.061 |
| melgan | baseline | 0.078 | 1.707 | 0.016 | 1.530 |
| univnet | baseline | 0.075 | 1.847 | 0.017 | 1.318 |