speech-asr-encoder

Speech Processingspeechbrainrigorous codebase

Description

Speech ASR Encoder Design

Research Question

Design an improved end-to-end speech recognition encoder architecture. Given raw audio waveform input, the encoder must produce frame-level representations for CTC (Connectionist Temporal Classification) decoding. The entire encoder design — including feature extraction, subsampling, encoder blocks, attention mechanism, position encoding, and normalization — is open for modification.

Background

Modern ASR encoders have evolved from simple RNN-based models to Transformer variants. The Conformer (Gulati et al., 2020) introduced convolution-augmented Transformers that capture both local and global patterns in speech. The Branchformer (Peng et al., 2022) uses parallel attention and convolution branches instead of sequential composition. Key design decisions include:

Feature extraction: Log-mel filterbanks are standard, but learnable frontends (SincNet, LEAF) can adapt to the data
Subsampling: CNN-based downsampling reduces the long speech sequences (100 frames/sec) to manageable lengths
Encoder blocks: The arrangement of attention, convolution, and feed-forward modules significantly affects performance
Position encoding: Speech sequences are long and variable-length, requiring effective positional information

Task

Modify the SpeechEncoder class in speechbrain/custom_asr_encoder.py (lines 61-220). The class must implement:

__init__(self, n_vocab, d_model=256, n_head=4, n_layers=4, d_ffn=1024, kernel_size=31, dropout=0.1)
forward(self, waveform, wav_lens) -> (log_probs, out_lens)

Input: raw 16kHz waveform (B, T) and relative lengths (B,). Output: CTC log-probabilities (B, T', n_vocab) and relative output lengths (B,).

The CTC decoder head, data loading, training loop, and evaluation are fixed.

Evaluation

Metric: Word Error Rate (WER, lower is better) for English/Spanish; Character Error Rate (CER, lower is better) for Chinese
Environments: LibriSpeech train-clean-100 (English), AISHELL-1 (Mandarin Chinese), MLS Spanish
Baselines: Conformer (~SOTA), pure Transformer (simpler), Branchformer (parallel branches)

Code

custom_asr_encoder.py

EditableRead-only

1#!/usr/bin/env python3
2"""Self-contained ASR training script for speech-asr-encoder task.
3
4The SpeechEncoder class (EDITABLE region) is the complete encoder architecture:
5  - Audio feature extraction (waveform → frame-level features)
6  - Subsampling / downsampling
7  - Encoder blocks (attention, convolution, feed-forward, normalization)
8  - Position encoding
9
10FIXED: CTC decoder head, data loading, training loop, evaluation.
11
12Environment variables:
13  DATA_DIR   — path to dataset root (e.g. /data/speech/asr/librispeech-100)
14  OUTPUT_DIR — output directory for checkpoints
15  SEED       — random seed (default: 42)

Additional context files (read-only):

speechbrain/speechbrain/nnet/attention.py

Results

Model	Type	wer librispeech-100h ↑	cer librispeech-100h ↑	wer aishell-1 ↑	cer aishell-1 ↑	wer mls-spanish ↑	cer mls-spanish ↑
branchformer	baseline	0.305	0.103	0.664	0.116	0.305	0.079
conformer	baseline	0.231	0.079	0.663	0.113	0.240	0.065
transformer	baseline	0.478	0.163	0.703	0.128	0.588	0.135