speech-enhancement

Speech Processingspeechbrainrigorous codebase

Description

Speech Enhancement Model Design

Research Question

Design an improved speech enhancement / denoising model. Given a noisy speech waveform, the model should output the corresponding clean speech. The entire model architecture — including the choice of time-domain vs frequency-domain processing, encoder-decoder design, mask estimation strategy, and sequence modeling approach — is open for modification.

Background

Speech enhancement removes background noise, reverberation, or interfering speakers from an audio signal. Key paradigms include:

  • Frequency-domain masking: Estimate a mask (IRM, cIRM, PSM) in STFT domain to filter noise
  • Time-domain processing: Directly map noisy to clean waveform using learned encoder-decoder (Conv-TasNet)
  • Complex-domain methods: Operate on complex STFT for joint magnitude and phase enhancement (DCCRN)
  • Dual-path models: Handle long sequences by segmenting and processing locally + globally (DPT-Net)

Task

Modify the EnhancementModel class in speechbrain/custom_speech_enhancement.py (lines 56-140). The class must implement:

  • __init__(self, n_fft=512, hop_length=256, d_model=256, n_layers=4, n_head=4, dropout=0.1)
  • forward(self, noisy_waveform) -> enhanced_waveform

Input: noisy 16kHz waveform (B, T). Output: enhanced waveform (B, T) (same length as input).

Training loss (SI-SNR + multi-resolution STFT), data loading, and evaluation are fixed.

Evaluation

  • Metrics: SI-SNR improvement (higher is better), PESQ (higher is better), STOI (higher is better)
  • Environments: VoiceBank-DEMAND (stationary noise), Noisy-VCTK-56spk (multi-speaker diverse noise), LibriMix 2-speaker (separation)
  • Baselines: Conv-TasNet (time-domain), DCCRN (complex frequency-domain), DPT-Net (dual-path Transformer)

Code

custom_speech_enhancement.py
EditableRead-only
1#!/usr/bin/env python3
2"""Self-contained speech enhancement training script.
3
4The EnhancementModel class (EDITABLE region) is the complete enhancement model:
5 - STFT analysis/synthesis
6 - Mask estimation or direct mapping network
7 - Encoder-decoder architecture
8 - Any time-domain or frequency-domain approach
9
10FIXED: Noise mixing, training loss (SI-SNR + multi-resolution STFT), data loading, evaluation.
11
12Environment variables:
13 DATA_DIR — path to dataset root
14 OUTPUT_DIR — output directory
15 SEED — random seed (default: 42)

Additional context files (read-only):

  • speechbrain/speechbrain/nnet/losses.py

Results

ModelTypesi snr voicebank-demand pesq voicebank-demand stoi voicebank-demand si snr noisy-vctk-56spk pesq noisy-vctk-56spk stoi noisy-vctk-56spk
conv_tasnetbaseline19.5362.6830.94318.8452.7360.943
dccrnbaseline18.6652.6140.93918.6792.7100.941
dptnetbaseline19.0402.4820.93618.9922.4450.937