speech-asr-encoder
Description
Speech ASR Encoder Design
Research Question
Design an improved end-to-end speech recognition encoder architecture. Given raw audio waveform input, the encoder must produce frame-level representations for CTC (Connectionist Temporal Classification) decoding. The entire encoder design — including feature extraction, subsampling, encoder blocks, attention mechanism, position encoding, and normalization — is open for modification.
Background
Modern ASR encoders have evolved from simple RNN-based models to Transformer variants. The Conformer (Gulati et al., 2020) introduced convolution-augmented Transformers that capture both local and global patterns in speech. The Branchformer (Peng et al., 2022) uses parallel attention and convolution branches instead of sequential composition. Key design decisions include:
- Feature extraction: Log-mel filterbanks are standard, but learnable frontends (SincNet, LEAF) can adapt to the data
- Subsampling: CNN-based downsampling reduces the long speech sequences (100 frames/sec) to manageable lengths
- Encoder blocks: The arrangement of attention, convolution, and feed-forward modules significantly affects performance
- Position encoding: Speech sequences are long and variable-length, requiring effective positional information
Task
Modify the SpeechEncoder class in speechbrain/custom_asr_encoder.py (lines 61-220). The class must implement:
__init__(self, n_vocab, d_model=256, n_head=4, n_layers=4, d_ffn=1024, kernel_size=31, dropout=0.1)forward(self, waveform, wav_lens) -> (log_probs, out_lens)
Input: raw 16kHz waveform (B, T) and relative lengths (B,).
Output: CTC log-probabilities (B, T', n_vocab) and relative output lengths (B,).
The CTC decoder head, data loading, training loop, and evaluation are fixed.
Evaluation
- Metric: Word Error Rate (WER, lower is better) for English/Spanish; Character Error Rate (CER, lower is better) for Chinese
- Environments: LibriSpeech train-clean-100 (English), AISHELL-1 (Mandarin Chinese), MLS Spanish
- Baselines: Conformer (~SOTA), pure Transformer (simpler), Branchformer (parallel branches)
Code
1#!/usr/bin/env python32"""Self-contained ASR training script for speech-asr-encoder task.34The SpeechEncoder class (EDITABLE region) is the complete encoder architecture:5- Audio feature extraction (waveform → frame-level features)6- Subsampling / downsampling7- Encoder blocks (attention, convolution, feed-forward, normalization)8- Position encoding910FIXED: CTC decoder head, data loading, training loop, evaluation.1112Environment variables:13DATA_DIR — path to dataset root (e.g. /data/speech/asr/librispeech-100)14OUTPUT_DIR — output directory for checkpoints15SEED — random seed (default: 42)
Additional context files (read-only):
speechbrain/speechbrain/nnet/attention.py
Results
| Model | Type | wer librispeech-100h ↑ | cer librispeech-100h ↑ | wer aishell-1 ↑ | cer aishell-1 ↑ | wer mls-spanish ↑ | cer mls-spanish ↑ |
|---|---|---|---|---|---|---|---|
| branchformer | baseline | 0.305 | 0.103 | 0.664 | 0.116 | 0.305 | 0.079 |
| conformer | baseline | 0.231 | 0.079 | 0.663 | 0.113 | 0.240 | 0.065 |
| transformer | baseline | 0.478 | 0.163 | 0.703 | 0.128 | 0.588 | 0.135 |