llm-scaling-law-discovery
Description
SLDBench Scaling Law Discovery
Research Question
Can you design a better scaling-law model that extrapolates on held-out SLDBench scaling tasks while keeping a single functional form per task and fitting group-specific coefficients from observed trials?
Background
This task is a pure SLDBench benchmark inspired by Can Language Models Discover Scaling Laws? (arXiv:2507.21184).
It keeps three representative and harder subsets (less saturated than the
original parallel/moe/sft trio):
sld-vocab: vocabulary scaling law — unigram-normalised loss as a function of non-vocabulary parametersN, vocabulary sizeV, and training charactersD(see Tao et al. "Scaling Laws with Vocabulary").sld-lrbsz: learning-rate & batch-size scaling law — LM loss as a joint function of learning rate, batch size, training tokens, and non-embedding parameters.sld-dataconstrained: data-constrained scaling law — loss as a function of unique tokensU, parametersN, and total tokensD, whereDcan exceedU(data repetition). See Muennighoff et al. 2023.
The goal is not generic tabular regression. The intended object is a scaling law: a shared functional form for each benchmark, with coefficients that can vary by experimental group.
Task
Edit the ScalingLawModel class in custom_scaling_law.py.
Your model receives:
X_num: raw numeric inputs (see per-benchmark list below)X_cat: categorical metadata, primarily thegroupy: observed target losses on the training split
The runtime already loads the official SLDBench train/test splits from /data/scaling_law/*.jsonl.
The observed training trials are also mirrored into the editable workspace as read-only files:
scaling-law-lab/observed_trials/sld_vocab_train.jsonlscaling-law-lab/observed_trials/sld_lrbsz_train.jsonlscaling-law-lab/observed_trials/sld_dataconstrained_train.jsonl
You are expected to inspect these raw train trials directly and discover benchmark-specific symbolic laws. Large pretrained LMs are not allowed.
Benchmarks
sld-vocab- numeric inputs:
non_vocab_parameters,vocab_size,num_characters - categorical input:
group - target:
unigram_normalized_loss(can be negative)
- numeric inputs:
sld-lrbsz- numeric inputs:
lr,bsz,data_size,non_embedding_param_size - categorical input:
group - target:
lm_loss
- numeric inputs:
sld-dataconstrained- numeric inputs:
unique_tokens,params,tokens - categorical input:
group - target:
loss
- numeric inputs:
Interface
Implement:
class ScalingLawModel:
def __init__(self, benchmark_name, numeric_names, categorical_names):
...
def fit(self, X_num, X_cat, y):
return self
def predict(self, X_num, X_cat):
return y_pred
benchmark_name lets you use different law families for vocab, lrbsz, and dataconstrained. You should feel free to write different symbolic forms per benchmark, while still keeping one shared expression within each benchmark and fitting group-specific coefficients.
Note: for sld-vocab the target (unigram_normalized_loss) can be negative, so do not clip your predictions to positive values.
Evaluation
Primary metric: held-out test R^2 for each benchmark.
Secondary metrics:
MAERMSENMAE
Strong solutions usually have two properties:
- they fit coefficients per
groupinstead of collapsing all groups together - they preserve sensible asymptotics on larger or denser test points
Code
1#!/usr/bin/env python32"""Pure SLDBench scaling-law discovery benchmark."""34import argparse5import json6import os7import random8from dataclasses import dataclass9from pathlib import Path1011import numpy as np12from scipy.optimize import least_squares131415DATA_DIR = Path(os.environ.get("SCALING_LAW_DATA_DIR", "/data/scaling_law"))
1{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 180820540.85207617, "unigram_normalized_loss": -1.3494449853897097}2{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 361641081.7041524, "unigram_normalized_loss": -2.0973777770996094}3{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 542461622.5562285, "unigram_normalized_loss": -2.502056121826172}4{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 723282163.4083047, "unigram_normalized_loss": -2.7609572410583483}5{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 904102704.2603807, "unigram_normalized_loss": -2.8973426818847656}6{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 1084923245.112457, "unigram_normalized_loss": -3.036267042160034}7{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 1265743785.964533, "unigram_normalized_loss": -3.1521780490875244}8{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 1446564326.8166094, "unigram_normalized_loss": -3.2433104515075684}9{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 1627384867.6686854, "unigram_normalized_loss": -3.4037160873413086}10{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 1808205408.5207615, "unigram_normalized_loss": -3.5318007469177246}11{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 1989025949.3728375, "unigram_normalized_loss": -3.621314525604248}12{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 2169846490.224914, "unigram_normalized_loss": -3.7126035690307617}13{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 2350667031.07699, "unigram_normalized_loss": -3.743427038192749}14{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 2531487571.929066, "unigram_normalized_loss": -3.7540900707244873}15{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 2712308112.7811427, "unigram_normalized_loss": -3.828134059906006}
1{"group": "all_data", "lr": 0.0003453, "bsz": 736.0, "data_size": 100000000000.0, "non_embedding_param_size": 214663680.0, "lm_loss": 2.3976109618296744}2{"group": "all_data", "lr": 0.005524, "bsz": 736.0, "data_size": 50000000000.0, "non_embedding_param_size": 429260800.0, "lm_loss": 2.2778836983119093}3{"group": "all_data", "lr": 0.0003453, "bsz": 736.0, "data_size": 80000000000.0, "non_embedding_param_size": 268304384.0, "lm_loss": 2.36255410523428}4{"group": "all_data", "lr": 0.002762, "bsz": 736.0, "data_size": 50000000000.0, "non_embedding_param_size": 429260800.0, "lm_loss": 2.26677528064734}5{"group": "all_data", "lr": 0.0004883, "bsz": 1024.0, "data_size": 50000000000.0, "non_embedding_param_size": 429260800.0, "lm_loss": 2.307012281611546}6{"group": "all_data", "lr": 0.0009766, "bsz": 32.0, "data_size": 100000000000.0, "non_embedding_param_size": 214663680.0, "lm_loss": 2.4137805813253053}7{"group": "all_data", "lr": 0.001381, "bsz": 128.0, "data_size": 100000000000.0, "non_embedding_param_size": 214663680.0, "lm_loss": 2.3583542548184586}8{"group": "all_data", "lr": 0.003906, "bsz": 2048.0, "data_size": 50000000000.0, "non_embedding_param_size": 429260800.0, "lm_loss": 2.312878249783144}9{"group": "all_data", "lr": 0.002762, "bsz": 32.0, "data_size": 50000000000.0, "non_embedding_param_size": 429260800.0, "lm_loss": 2.380411149629176}10{"group": "all_data", "lr": 0.0009766, "bsz": 352.0, "data_size": 50000000000.0, "non_embedding_param_size": 429260800.0, "lm_loss": 2.2687204593838177}11{"group": "all_data", "lr": 0.001381, "bsz": 1024.0, "data_size": 80000000000.0, "non_embedding_param_size": 268304384.0, "lm_loss": 2.326585942511133}12{"group": "all_data", "lr": 0.0009766, "bsz": 192.0, "data_size": 50000000000.0, "non_embedding_param_size": 429260800.0, "lm_loss": 2.265579835154685}13{"group": "all_data", "lr": 0.0006905, "bsz": 128.0, "data_size": 80000000000.0, "non_embedding_param_size": 268304384.0, "lm_loss": 2.323627755139312}14{"group": "all_data", "lr": 0.01105, "bsz": 192.0, "data_size": 80000000000.0, "non_embedding_param_size": 268304384.0, "lm_loss": 2.37908273262946}15{"group": "all_data", "lr": 0.002762, "bsz": 64.0, "data_size": 80000000000.0, "non_embedding_param_size": 268304384.0, "lm_loss": 2.367986859925716}
1{"group": "all_data", "unique_tokens": 4000000000.0, "params": 2810000000.0, "tokens": 32000000000.0, "loss": 2.722962}2{"group": "all_data", "unique_tokens": 4000000000.0, "params": 2810000000.0, "tokens": 40000000000.0, "loss": 2.706547}3{"group": "all_data", "unique_tokens": 4000000000.0, "params": 2810000000.0, "tokens": 55000000000.0, "loss": 2.696432}4{"group": "all_data", "unique_tokens": 9000000000.0, "params": 2810000000.0, "tokens": 55000000000.0, "loss": 2.611045}5{"group": "all_data", "unique_tokens": 11000000000.0, "params": 2810000000.0, "tokens": 55000000000.0, "loss": 2.598793}6{"group": "all_data", "unique_tokens": 14000000000.0, "params": 2810000000.0, "tokens": 55000000000.0, "loss": 2.589427}7{"group": "all_data", "unique_tokens": 18000000000.0, "params": 2810000000.0, "tokens": 55000000000.0, "loss": 2.584592}8{"group": "all_data", "unique_tokens": 28000000000.0, "params": 2810000000.0, "tokens": 55000000000.0, "loss": 2.579361}9{"group": "all_data", "unique_tokens": 55000000000.0, "params": 2810000000.0, "tokens": 55000000000.0, "loss": 2.574117}10{"group": "all_data", "unique_tokens": 100000000.0, "params": 7098752.0, "tokens": 100000000.0, "loss": 8.102005}11{"group": "all_data", "unique_tokens": 100000000.0, "params": 7098752.0, "tokens": 200000000.0, "loss": 7.36236}12{"group": "all_data", "unique_tokens": 100000000.0, "params": 1096300000.0, "tokens": 100000000.0, "loss": 6.611002}13{"group": "all_data", "unique_tokens": 100000000.0, "params": 14100000.0, "tokens": 100000000.0, "loss": 7.278144}14{"group": "all_data", "unique_tokens": 400000000.0, "params": 19703712.0, "tokens": 400000000.0, "loss": 6.096268}15{"group": "all_data", "unique_tokens": 400000000.0, "params": 35500000.0, "tokens": 400000000.0, "loss": 5.79413}
Results
No results yet.