llm-scaling-law-discovery

Language Modelsscaling-law-labrigorous codebase

Description

SLDBench Scaling Law Discovery

Research Question

Can you design a better scaling-law model that extrapolates on held-out SLDBench scaling tasks while keeping a single functional form per task and fitting group-specific coefficients from observed trials?

Background

This task is a pure SLDBench benchmark inspired by Can Language Models Discover Scaling Laws? (arXiv:2507.21184).

It keeps three representative and harder subsets (less saturated than the original parallel/moe/sft trio):

  • sld-vocab: vocabulary scaling law — unigram-normalised loss as a function of non-vocabulary parameters N, vocabulary size V, and training characters D (see Tao et al. "Scaling Laws with Vocabulary").
  • sld-lrbsz: learning-rate & batch-size scaling law — LM loss as a joint function of learning rate, batch size, training tokens, and non-embedding parameters.
  • sld-dataconstrained: data-constrained scaling law — loss as a function of unique tokens U, parameters N, and total tokens D, where D can exceed U (data repetition). See Muennighoff et al. 2023.

The goal is not generic tabular regression. The intended object is a scaling law: a shared functional form for each benchmark, with coefficients that can vary by experimental group.

Task

Edit the ScalingLawModel class in custom_scaling_law.py.

Your model receives:

  • X_num: raw numeric inputs (see per-benchmark list below)
  • X_cat: categorical metadata, primarily the group
  • y: observed target losses on the training split

The runtime already loads the official SLDBench train/test splits from /data/scaling_law/*.jsonl.

The observed training trials are also mirrored into the editable workspace as read-only files:

  • scaling-law-lab/observed_trials/sld_vocab_train.jsonl
  • scaling-law-lab/observed_trials/sld_lrbsz_train.jsonl
  • scaling-law-lab/observed_trials/sld_dataconstrained_train.jsonl

You are expected to inspect these raw train trials directly and discover benchmark-specific symbolic laws. Large pretrained LMs are not allowed.

Benchmarks

  • sld-vocab
    • numeric inputs: non_vocab_parameters, vocab_size, num_characters
    • categorical input: group
    • target: unigram_normalized_loss (can be negative)
  • sld-lrbsz
    • numeric inputs: lr, bsz, data_size, non_embedding_param_size
    • categorical input: group
    • target: lm_loss
  • sld-dataconstrained
    • numeric inputs: unique_tokens, params, tokens
    • categorical input: group
    • target: loss

Interface

Implement:

class ScalingLawModel:
    def __init__(self, benchmark_name, numeric_names, categorical_names):
        ...

    def fit(self, X_num, X_cat, y):
        return self

    def predict(self, X_num, X_cat):
        return y_pred

benchmark_name lets you use different law families for vocab, lrbsz, and dataconstrained. You should feel free to write different symbolic forms per benchmark, while still keeping one shared expression within each benchmark and fitting group-specific coefficients.

Note: for sld-vocab the target (unigram_normalized_loss) can be negative, so do not clip your predictions to positive values.

Evaluation

Primary metric: held-out test R^2 for each benchmark.

Secondary metrics:

  • MAE
  • RMSE
  • NMAE

Strong solutions usually have two properties:

  • they fit coefficients per group instead of collapsing all groups together
  • they preserve sensible asymptotics on larger or denser test points

Code

custom_scaling_law.py
EditableRead-only
1#!/usr/bin/env python3
2"""Pure SLDBench scaling-law discovery benchmark."""
3
4import argparse
5import json
6import os
7import random
8from dataclasses import dataclass
9from pathlib import Path
10
11import numpy as np
12from scipy.optimize import least_squares
13
14
15DATA_DIR = Path(os.environ.get("SCALING_LAW_DATA_DIR", "/data/scaling_law"))
sld_vocab_train.jsonl
EditableRead-only
1{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 180820540.85207617, "unigram_normalized_loss": -1.3494449853897097}
2{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 361641081.7041524, "unigram_normalized_loss": -2.0973777770996094}
3{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 542461622.5562285, "unigram_normalized_loss": -2.502056121826172}
4{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 723282163.4083047, "unigram_normalized_loss": -2.7609572410583483}
5{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 904102704.2603807, "unigram_normalized_loss": -2.8973426818847656}
6{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 1084923245.112457, "unigram_normalized_loss": -3.036267042160034}
7{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 1265743785.964533, "unigram_normalized_loss": -3.1521780490875244}
8{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 1446564326.8166094, "unigram_normalized_loss": -3.2433104515075684}
9{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 1627384867.6686854, "unigram_normalized_loss": -3.4037160873413086}
10{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 1808205408.5207615, "unigram_normalized_loss": -3.5318007469177246}
11{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 1989025949.3728375, "unigram_normalized_loss": -3.621314525604248}
12{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 2169846490.224914, "unigram_normalized_loss": -3.7126035690307617}
13{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 2350667031.07699, "unigram_normalized_loss": -3.743427038192749}
14{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 2531487571.929066, "unigram_normalized_loss": -3.7540900707244873}
15{"group": "all_data", "non_vocab_parameters": 33222784.0, "vocab_size": 4096.0, "num_characters": 2712308112.7811427, "unigram_normalized_loss": -3.828134059906006}
sld_lrbsz_train.jsonl
EditableRead-only
1{"group": "all_data", "lr": 0.0003453, "bsz": 736.0, "data_size": 100000000000.0, "non_embedding_param_size": 214663680.0, "lm_loss": 2.3976109618296744}
2{"group": "all_data", "lr": 0.005524, "bsz": 736.0, "data_size": 50000000000.0, "non_embedding_param_size": 429260800.0, "lm_loss": 2.2778836983119093}
3{"group": "all_data", "lr": 0.0003453, "bsz": 736.0, "data_size": 80000000000.0, "non_embedding_param_size": 268304384.0, "lm_loss": 2.36255410523428}
4{"group": "all_data", "lr": 0.002762, "bsz": 736.0, "data_size": 50000000000.0, "non_embedding_param_size": 429260800.0, "lm_loss": 2.26677528064734}
5{"group": "all_data", "lr": 0.0004883, "bsz": 1024.0, "data_size": 50000000000.0, "non_embedding_param_size": 429260800.0, "lm_loss": 2.307012281611546}
6{"group": "all_data", "lr": 0.0009766, "bsz": 32.0, "data_size": 100000000000.0, "non_embedding_param_size": 214663680.0, "lm_loss": 2.4137805813253053}
7{"group": "all_data", "lr": 0.001381, "bsz": 128.0, "data_size": 100000000000.0, "non_embedding_param_size": 214663680.0, "lm_loss": 2.3583542548184586}
8{"group": "all_data", "lr": 0.003906, "bsz": 2048.0, "data_size": 50000000000.0, "non_embedding_param_size": 429260800.0, "lm_loss": 2.312878249783144}
9{"group": "all_data", "lr": 0.002762, "bsz": 32.0, "data_size": 50000000000.0, "non_embedding_param_size": 429260800.0, "lm_loss": 2.380411149629176}
10{"group": "all_data", "lr": 0.0009766, "bsz": 352.0, "data_size": 50000000000.0, "non_embedding_param_size": 429260800.0, "lm_loss": 2.2687204593838177}
11{"group": "all_data", "lr": 0.001381, "bsz": 1024.0, "data_size": 80000000000.0, "non_embedding_param_size": 268304384.0, "lm_loss": 2.326585942511133}
12{"group": "all_data", "lr": 0.0009766, "bsz": 192.0, "data_size": 50000000000.0, "non_embedding_param_size": 429260800.0, "lm_loss": 2.265579835154685}
13{"group": "all_data", "lr": 0.0006905, "bsz": 128.0, "data_size": 80000000000.0, "non_embedding_param_size": 268304384.0, "lm_loss": 2.323627755139312}
14{"group": "all_data", "lr": 0.01105, "bsz": 192.0, "data_size": 80000000000.0, "non_embedding_param_size": 268304384.0, "lm_loss": 2.37908273262946}
15{"group": "all_data", "lr": 0.002762, "bsz": 64.0, "data_size": 80000000000.0, "non_embedding_param_size": 268304384.0, "lm_loss": 2.367986859925716}
sld_dataconstrained_train.jsonl
EditableRead-only
1{"group": "all_data", "unique_tokens": 4000000000.0, "params": 2810000000.0, "tokens": 32000000000.0, "loss": 2.722962}
2{"group": "all_data", "unique_tokens": 4000000000.0, "params": 2810000000.0, "tokens": 40000000000.0, "loss": 2.706547}
3{"group": "all_data", "unique_tokens": 4000000000.0, "params": 2810000000.0, "tokens": 55000000000.0, "loss": 2.696432}
4{"group": "all_data", "unique_tokens": 9000000000.0, "params": 2810000000.0, "tokens": 55000000000.0, "loss": 2.611045}
5{"group": "all_data", "unique_tokens": 11000000000.0, "params": 2810000000.0, "tokens": 55000000000.0, "loss": 2.598793}
6{"group": "all_data", "unique_tokens": 14000000000.0, "params": 2810000000.0, "tokens": 55000000000.0, "loss": 2.589427}
7{"group": "all_data", "unique_tokens": 18000000000.0, "params": 2810000000.0, "tokens": 55000000000.0, "loss": 2.584592}
8{"group": "all_data", "unique_tokens": 28000000000.0, "params": 2810000000.0, "tokens": 55000000000.0, "loss": 2.579361}
9{"group": "all_data", "unique_tokens": 55000000000.0, "params": 2810000000.0, "tokens": 55000000000.0, "loss": 2.574117}
10{"group": "all_data", "unique_tokens": 100000000.0, "params": 7098752.0, "tokens": 100000000.0, "loss": 8.102005}
11{"group": "all_data", "unique_tokens": 100000000.0, "params": 7098752.0, "tokens": 200000000.0, "loss": 7.36236}
12{"group": "all_data", "unique_tokens": 100000000.0, "params": 1096300000.0, "tokens": 100000000.0, "loss": 6.611002}
13{"group": "all_data", "unique_tokens": 100000000.0, "params": 14100000.0, "tokens": 100000000.0, "loss": 7.278144}
14{"group": "all_data", "unique_tokens": 400000000.0, "params": 19703712.0, "tokens": 400000000.0, "loss": 6.096268}
15{"group": "all_data", "unique_tokens": 400000000.0, "params": 35500000.0, "tokens": 400000000.0, "loss": 5.79413}

Results

No results yet.