speech-enhancement

Speech Processingspeechbrainrigorous codebase

Description

Speech Enhancement Model Design

Research Question

Design an improved speech enhancement / denoising model. Given a noisy speech waveform, the model should output the corresponding clean speech. The entire model architecture — including the choice of time-domain vs frequency-domain processing, encoder-decoder design, mask estimation strategy, and sequence modeling approach — is open for modification.

Background

Speech enhancement removes background noise, reverberation, or interfering speakers from an audio signal. Key paradigms include:

Frequency-domain masking: Estimate a mask (IRM, cIRM, PSM) in STFT domain to filter noise
Time-domain processing: Directly map noisy to clean waveform using learned encoder-decoder (Conv-TasNet)
Complex-domain methods: Operate on complex STFT for joint magnitude and phase enhancement (DCCRN)
Dual-path models: Handle long sequences by segmenting and processing locally + globally (DPT-Net)

Task

Modify the EnhancementModel class in speechbrain/custom_speech_enhancement.py (lines 56-140). The class must implement:

__init__(self, n_fft=512, hop_length=256, d_model=256, n_layers=4, n_head=4, dropout=0.1)
forward(self, noisy_waveform) -> enhanced_waveform

Input: noisy 16kHz waveform (B, T). Output: enhanced waveform (B, T) (same length as input).

Training loss (SI-SNR + multi-resolution STFT), data loading, and evaluation are fixed.

Evaluation

Metrics: SI-SNR improvement (higher is better), PESQ (higher is better), STOI (higher is better)
Environments: VoiceBank-DEMAND (stationary noise), Noisy-VCTK-56spk (multi-speaker diverse noise), LibriMix 2-speaker (separation)
Baselines: Conv-TasNet (time-domain), DCCRN (complex frequency-domain), DPT-Net (dual-path Transformer)

Code

custom_speech_enhancement.py

EditableRead-only

1#!/usr/bin/env python3
2"""Self-contained speech enhancement training script.
3
4The EnhancementModel class (EDITABLE region) is the complete enhancement model:
5  - STFT analysis/synthesis
6  - Mask estimation or direct mapping network
7  - Encoder-decoder architecture
8  - Any time-domain or frequency-domain approach
9
10FIXED: Noise mixing, training loss (SI-SNR + multi-resolution STFT), data loading, evaluation.
11
12Environment variables:
13  DATA_DIR   — path to dataset root
14  OUTPUT_DIR — output directory
15  SEED       — random seed (default: 42)

Additional context files (read-only):

speechbrain/speechbrain/nnet/losses.py

Results

Model	Type	si snr voicebank-demand ↑	pesq voicebank-demand ↑	stoi voicebank-demand ↑	si snr noisy-vctk-56spk ↑	pesq noisy-vctk-56spk ↑	stoi noisy-vctk-56spk ↑
conv_tasnet	baseline	19.536	2.683	0.943	18.845	2.736	0.943
dccrn	baseline	18.665	2.614	0.939	18.679	2.710	0.941
dptnet	baseline	19.040	2.482	0.936	18.992	2.445	0.937