Agent Conversation: gpt-5.4-pro

Model: gpt-5.4-pro
Total steps: 5
Messages: 11
UserStep 0

Task: causal-observational-linear-non-gaussian

Causal Discovery: Observational Linear Non-Gaussian Data

Objective

Implement a causal discovery algorithm that recovers the DAG structure from purely observational data generated by a Linear Non-Gaussian Acyclic Model (LiNGAM). Your code goes in bench/custom_algorithm.py.

Background

LiNGAM-based methods exploit non-Gaussian noise to achieve full DAG identifiability from observational data alone, going beyond the Markov Equivalence Class limit of constraint-based (PC) and score-based (GES) methods.

Evaluation Scenarios

LabelGraph typeNodesSamplesNoise
ER10Erdos-Renyi10250Exponential
ER15Erdos-Renyi15500Laplace
SF12Scale-Free (BA)12300Uniform
ER30Erdos-Renyi301000Laplace
ER50Erdos-Renyi502000Exponential
ER50-LowSampleErdos-Renyi50250Exponential
SF100Scale-Free (BA)1001000Uniform
ER20-DenseErdos-Renyi20500Laplace

Metrics

All computed on the directed edge set (skeleton + direction must be correct):

  • F1 (primary ranking metric), SHD, Precision, Recall

Baselines

  • icalingam: ICA-based LiNGAM (Shimizu 2006)
  • directlingam: DirectLiNGAM (Shimizu 2011)

causal-learn/bench/custom_algorithm.py [EDITABLE — lines 3–14 only]

     1: import numpy as np
     2: 
     3: # =====================================================================
     4: # EDITABLE: implement run_causal_discovery below
     5: # =====================================================================
     6: def run_causal_discovery(X: np.ndarray) -> np.ndarray:
     7:     """
     8:     Input:  X of shape (n_samples, n_variables)
     9:     Output: adjacency matrix B of shape (n_variables, n_variables)
    10:             B[i, j] != 0  means j -> i  (follows causal-learn convention)
    11:     """
    12:     n = X.shape[1]
    13:     return np.zeros((n, n))
    14: # =====================================================================

icalingam baseline — editable region [READ-ONLY — reference implementation]

Lines 314:
     1: import numpy as np
     2: 
     3: # =====================================================================
     4: # EDITABLE: implement run_causal_discovery below
     5: # =====================================================================
     6: def run_causal_discovery(X: np.ndarray) -> np.ndarray:
     7:     """
     8:     Input:  X of shape (n_samples, n_variables)
     9:     Output: adjacency matrix B of shape (n_variables, n_variables)
    10:             B[i, j] != 0  means j -> i  (follows causal-learn convention)
    11:     """
    12:     import numpy as np
    13:     from scipy.optimize import linear_sum_assignment
    14:     from sklearn.decomposition import FastICA
    15:     from sklearn.utils import check_array
    16: 
AssistantStep 1edit
Tool Input
1{
2  "op": "replace",
3  "filename": "causal-learn/bench/custom_algorithm.py",
4  "content": "import numpy as np\n\n# =====================================================================\n# EDITABLE: implement run_causal_discovery below\n# =====================================================================\ndef run_causal_discovery(X: np.ndarray) -> np.ndarray:\n    \"\"\"\n    Contrastive Residual LiNGAM:\n    1) estimate a causal order by repeatedly extracting the variable whose\n       residuals are most contrastively independent in the forward vs reverse\n       regression directions;\n    2) estimate sparse direct parents along that order with adaptive-lasso BIC.\n\n    Output convention: B[i, j] != 0 means j -> i.\n    \"\"\"\n    from sklearn.linear_model import LassoLarsIC\n    from sklearn.utils import check_array\n\n    X = check_array(X, ensure_2d=True, dtype=float)\n    n, p = X.shape\n    if p == 0:\n        return np.zeros((0, 0))\n    if p == 1:\n        return np.zeros((1, 1))\n\n    eps = 1e-8\n    Xc = X - X.mean(axis=0, keepdims=True)\n    Xs = Xc / (np.sqrt(np.mean(Xc * Xc, axis=0, keepdims=True)) + eps)\n\n    def _standardize(A: np.ndarray) -> np.ndarray:\n        A = A - A.mean(axis=0, keepdims=True)\n        return A / (np.sqrt(np.mean(A * A, axis=0, keepdims=True)) + eps)\n\n    def _dep_cols_from_x(x: np.ndarray, R: np.ndarray) -> np.ndarray:\n        x = x - x.mean()\n        x = x / (np.sqrt(np.mean(x * x)) + eps)\n        Rz = _standardize(R)\n        x2 = x * x - 1.0\n        R2 = Rz * Rz - 1.0\n        return (\n            np.abs(np.mean(np.tanh(x)[:, None] * Rz, axis=0))\n            + np.abs(np.mean(x[:, None] * np.tanh(Rz), axis=0))\n            + 0.3 * np.abs(np.mean(x2[:, None] * R2, axis=0))\n        )\n\n    def _dep_pairs(A: np.ndarray, R: np.ndarray) -> np.ndarray:\n        Az = _standardize(A)\n        Rz = _standardize(R)\n        A2 = Az * Az - 1.0\n        R2 = Rz * Rz - 1.0\n        return (\n            np.abs(np.mean(np.tanh(Az) * Rz, axis=0))\n            + np.abs(np.mean(Az * np.tanh(Rz), axis=0))\n            + 0.3 * np.abs(np.mean(A2 * R2, axis=0))\n        )\n\n    # ---------- Stage 1: contrastive exogenous extraction ----------\n    remaining = list(range(p))\n    order = []\n    Z = Xs.copy()\n    while len(remaining) > 1:\n        m = Z.shape[1]\n        scores = np.empty(m)\n        col_var = np.sum(Z * Z, axis=0) + eps\n        for j in range(m):\n            x = Z[:, j]\n            gx = Z.T @ x\n\n            # Forward: regress every variable on candidate x_j.\n            beta_f = gx / (x @ x + eps)\n            Rf = Z - np.outer(x, beta_f)\n            dep_f = _dep_cols_from_x(x, Rf)\n\n            # Reverse: regress x_j on every other variable separately.\n            beta_r = gx / col_var\n            Rr = x[:, None] - Z * beta_r[None, :]\n            dep_r = _dep_pairs(Z, Rr)\n\n            corr = np.abs(gx) / np.sqrt((x @ x + eps) * col_var)\n            mask = np.arange(m) != j\n            # Novel contrastive score: prefer variables that are independent\n            # in the forward direction but dependent in the reverse direction.\n            scores[j] = np.mean((corr[mask] + 0.05) * np.log((dep_f[mask] + 1e-4) / (dep_r[mask] + 1e-4)))\n\n        j_star = int(np.argmin(scores))\n        order.append(remaining[j_star])\n\n        x = Z[:, j_star]\n        beta = (Z.T @ x) / (x @ x + eps)\n        Z = Z - np.outer(x, beta)\n        Z = np.delete(Z, j_star, axis=1)\n        remaining.pop(j_star)\n        if Z.size:\n            Z = Z - Z.mean(axis=0, keepdims=True)\n    order.extend(remaining)\n\n    # ---------- Stage 2: sparse parent estimation along the order ----------\n    B = np.zeros((p, p), dtype=float)\n    for pos, tgt in enumerate(order):\n        parents = order[:pos]\n        if not parents:\n            continue\n        y = Xs[:, tgt]\n        Xp = Xs[:, parents]\n\n        if Xp.shape[1] == 1:\n            coef = np.array([(Xp[:, 0] @ y) / (Xp[:, 0] @ Xp[:, 0] + eps)])\n            if abs(coef[0]) > 0.05:\n                B[tgt, parents[0]] = coef[0]\n            continue\n\n        ols = np.linalg.lstsq(Xp, y, rcond=None)[0]\n        weights = 1.0 / (np.abs(ols) + 0.02)\n        Xw = Xp / weights[None, :]\n        try:\n            model = LassoLarsIC(criterion=\"bic\")\n            model.fit(Xw, y)\n            coef = model.coef_ / weights\n        except Exception:\n            coef = ols\n\n        sel = np.flatnonzero(np.abs(coef) > 1e-8)\n        if sel.size == 0:\n            continue\n\n        beta = np.linalg.lstsq(Xp[:, sel], y, rcond=None)[0]\n        for k, s in enumerate(sel):\n            if abs(beta[k]) > 0.04:\n                B[tgt, parents[s]] = beta[k]\n\n    np.fill_diagonal(B, 0.0)\n    return B\n# =====================================================================",
5  "after_line": 14,
6  "start_line": 3,
7  "end_line": 14
8}
Tool ResultStep 1
OK: Replaced lines 3..14 with 133 line(s) in causal-learn/bench/custom_algorithm.py. Editable range: 3–135.

[Current file: causal-learn/bench/custom_algorithm.py | editable: 3–135 | total: 135 lines]
     1| import numpy as np
     2| 
     3  import numpy as np
     4  
     5  # =====================================================================
     6  # EDITABLE: implement run_causal_discovery below
     7  # =====================================================================
     8  def run_causal_discovery(X: np.ndarray) -> np.ndarray:
     9      """
    10      Contrastive Residual LiNGAM:
    11      1) estimate a causal order by repeatedly extracting the variable whose
    12         residuals are most contrastively independent in the forward vs reverse
    13         regression directions;
    14      2) estimate sparse direct parents along that order with adaptive-lasso BIC.
    15  
    16      Output convention: B[i, j] != 0 means j -> i.
    17      """
    18      
AssistantStep 2test
Tool ResultStep 2
[Test #1] (2 tests remaining; call submit(n=1) to submit this result)

### ER10 (scripts/eval_er10.sh)
Results (ER10):
  SHD=2  F1=0.8966  Precision=0.8667  Recall=0.9286

### ER15 (scripts/eval_er15.sh)
Results (ER15):
  SHD=2  F1=0.9474  Precision=0.9000  Recall=1.0000

### SF12 (scripts/eval_sf12.sh)
Results (SF12):
  SHD=3  F1=0.9302  Precision=0.8696  Recall=1.0000

### ER30 (scripts/eval_er30.sh)
Results (ER30):
  SHD=19  F1=0.8995  Precision=0.9126  Recall=0.8868

### ER50 (scripts/eval_er50.sh)
Results (ER50):
  SHD=92  F1=0.8025  Precision=0.8298  Recall=0.7769

### ER50-LowSample (scripts/eval_er50_lowsample.sh)
Results (ER50-LowSample):
  SHD=58  F1=0.7735  Precision=0.6852  Recall=0.8880

### ER20-Dense (scripts/eval_er20_dense.sh)
Results (ER20-Dense):
  SHD=19  F1=0.8950  Precision=0.9643  Recall=0.8351

[Leaderboard] Results saved: {'shd_ER10': 2, 'f1_ER10': 0.8966, 'precision_ER10': 0.8667, 'recall_ER10': 0.9286, 'elapsed_ER10': 3, 'shd_ER15': 2, 'f1_ER15': 0.9474, 'pre
AssistantStep 3edit
Tool Input
1{
2  "op": "replace",
3  "filename": "causal-learn/bench/custom_algorithm.py",
4  "content": "# =====================================================================\n# EDITABLE: implement run_causal_discovery below\n# =====================================================================\ndef run_causal_discovery(X: np.ndarray) -> np.ndarray:\n    \"\"\"\n    Stable Contrastive Direct LiNGAM.\n\n    Novel contribution:\n    - build several LiNGAM graphs from complementary inductive biases:\n      (i) DirectLiNGAM support if available,\n      (ii) full-order adaptive-lasso/backward-BIC refinement,\n      (iii) a contrastive forward-vs-reverse residual ordering,\n      (iv) a fused support graph combining direct and contrastive candidates;\n    - select the final DAG by a residual-independence + sparsity criterion.\n\n    Output convention: B[i, j] != 0 means j -> i.\n    \"\"\"\n    from sklearn.linear_model import LassoLarsIC\n    from sklearn.utils import check_array\n\n    X = check_array(X, ensure_2d=True, dtype=float)\n    n, p = X.shape\n    if p == 0:\n        return np.zeros((0, 0))\n    if p == 1:\n        return np.zeros((1, 1))\n\n    eps = 1e-8\n    Xc = X - X.mean(axis=0, keepdims=True)\n    Xs = Xc / (np.sqrt(np.mean(Xc * Xc, axis=0, keepdims=True)) + eps)\n\n    def _standardize(A: np.ndarray) -> np.ndarray:\n        A = A - A.mean(axis=0, keepdims=True)\n        return A / (np.sqrt(np.mean(A * A, axis=0, keepdims=True)) + eps)\n\n    def _dep_cols_from_x(x: np.ndarray, R: np.ndarray) -> np.ndarray:\n        x = x - x.mean()\n        x = x / (np.sqrt(np.mean(x * x)) + eps)\n        Rz = _standardize(R)\n        x2 = x * x - 1.0\n        R2 = Rz * Rz - 1.0\n        return (\n            np.abs(np.mean(np.tanh(x)[:, None] * Rz, axis=0))\n            + np.abs(np.mean(x[:, None] * np.tanh(Rz), axis=0))\n            + 0.3 * np.abs(np.mean(x2[:, None] * R2, axis=0))\n        )\n\n    def _dep_pairs(A: np.ndarray, R: np.ndarray) -> np.ndarray:\n        Az = _standardize(A)\n        Rz = _standardize(R)\n        A2 = Az * Az - 1.0\n        R2 = Rz * Rz - 1.0\n        return (\n            np.abs(np.mean(np.tanh(Az) * Rz, axis=0))\n            + np.abs(np.mean(Az * np.tanh(Rz), axis=0))\n            + 0.3 * np.abs(np.mean(A2 * R2, axis=0))\n        )\n\n    def _residual_dep_score(E: np.ndarray) -> float:\n        Ez = _standardize(E)\n        T = np.tanh(Ez)\n        E2 = Ez * Ez - 1.0\n        M = np.abs((T.T @ Ez) / max(Ez.shape[0], 1))\n        Q = np.abs((E2.T @ E2) / max(Ez.shape[0], 1))\n        S = M + M.T + 0.2 * Q\n        np.fill_diagonal(S, 0.0)\n        return float(np.sum(S) / (p * max(p - 1, 1)))\n\n    def _contrastive_order(Z0: np.ndarray) -> list:\n        remaining = list(range(p))\n        order = []\n        Z = Z0.copy()\n        while len(remaining) > 1:\n            m = Z.shape[1]\n            scores = np.empty(m)\n            col_var = np.sum(Z * Z, axis=0) + eps\n            for j in range(m):\n                x = Z[:, j]\n                gx = Z.T @ x\n\n                beta_f = gx / (x @ x + eps)\n                Rf = Z - np.outer(x, beta_f)\n                dep_f = _dep_cols_from_x(x, Rf)\n\n                beta_r = gx / col_var\n                Rr = x[:, None] - Z * beta_r[None, :]\n                dep_r = _dep_pairs(Z, Rr)\n\n                corr = np.abs(gx) / np.sqrt((x @ x + eps) * col_var)\n                mask = np.arange(m) != j\n                scores[j] = np.mean(\n                    (corr[mask] + 0.05)\n                    * np.log((dep_f[mask] + 1e-4) / (dep_r[mask] + 1e-4))\n                )\n\n            j_star = int(np.argmin(scores))\n            order.append(remaining[j_star])\n\n            x = Z[:, j_star]\n            beta = (Z.T @ x) / (x @ x + eps)\n            Z = Z - np.outer(x, beta)\n            Z = np.delete(Z, j_star, axis=1)\n            remaining.pop(j_star)\n            if Z.size:\n                Z = Z - Z.mean(axis=0, keepdims=True)\n        order.extend(remaining)\n        return order\n\n    def _topological_from_support(support: np.ndarray) -> list:\n        parents = [set(np.flatnonzero(support[i])) for i in range(p)]\n        children = [set(np.flatnonzero(support[:, j])) for j in range(p)]\n        roots = [i for i in range(p) if not parents[i]]\n        order = []\n        used = np.zeros(p, dtype=bool)\n        while roots:\n            r = roots.pop(0)\n            if used[r]:\n                continue\n            used[r] = True\n            order.append(r)\n            for c in list(children[r]):\n                parents[c].discard(r)\n                if not parents[c]:\n                    roots.append(c)\n            children[r].clear()\n        if len(order) < p:\n            missing = [i for i in range(p) if not used[i]]\n            score = np.sum(np.abs(support), axis=1) - np.sum(np.abs(support), axis=0)\n            order.extend(sorted(missing, key=lambda i: score[i]))\n        return order\n\n    def _bic_subset(y: np.ndarray, Xp: np.ndarray) -> np.ndarray:\n        m = Xp.shape[1]\n        coef = np.zeros(m, dtype=float)\n        if m == 0:\n            return coef\n\n        y_var = np.mean(y * y) + eps\n        if m == 1:\n            b = float((Xp[:, 0] @ y) / (Xp[:, 0] @ Xp[:, 0] + eps))\n            rss1 = np.mean((y - Xp[:, 0] * b) ** 2) + eps\n            bic1 = n * np.log(rss1) + np.log(max(n, 2))\n            bic0 = n * np.log(y_var)\n            if bic1 < bic0:\n                coef[0] = b\n            return coef\n\n        ols = np.linalg.lstsq(Xp, y, rcond=None)[0]\n        weights = 1.0 / (np.abs(ols) + 0.02)\n        try:\n            model = LassoLarsIC(criterion=\"bic\")\n            model.fit(Xp / weights[None, :], y)\n            warm = model.coef_ / weights\n            active = np.flatnonzero(np.abs(warm) > 1e-8).tolist()\n        except Exception:\n            active = np.flatnonzero(np.abs(ols) > 0.03).tolist()\n\n        if not active:\n            return coef\n\n        while True:\n            Xcur = Xp[:, active]\n            beta = np.linalg.lstsq(Xcur, y, rcond=None)[0]\n            rss_cur = np.mean((y - Xcur @ beta) ** 2) + eps\n            bic_cur = n * np.log(rss_cur) + len(active) * np.log(max(n, 2))\n\n            best_active = active\n            best_bic = bic_cur\n            for t in range(len(active)):\n                cand = active[:t] + active[t + 1 :]\n                if cand:\n                    Xcand = Xp[:, cand]\n                    beta_c = np.linalg.lstsq(Xcand, y, rcond=None)[0]\n                    rss_c = np.mean((y - Xcand @ beta_c) ** 2) + eps\n                    bic_c = n * np.log(rss_c) + len(cand) * np.log(max(n, 2))\n                else:\n                    bic_c = n * np.log(y_var)\n                if bic_c + 1e-8 < best_bic:\n                    best_bic = bic_c\n                    best_active = cand\n\n            if len(best_active) == len(active) and all(a == b for a, b in zip(best_active, active)):\n                break\n            active = best_active\n            if not active:\n                return coef\n\n        beta = np.linalg.lstsq(Xp[:, active], y, rcond=None)[0]\n        coef[np.array(active)] = beta\n        coef[np.abs(coef) < 1e-8] = 0.0\n        return coef\n\n    def _ols_support(y: np.ndarray, Xp: np.ndarray) -> np.ndarray:\n        if Xp.shape[1] == 0:\n            return np.zeros(0, dtype=float)\n        beta = np.linalg.lstsq(Xp, y, rcond=None)[0]\n        beta[np.abs(beta) < 1e-8] = 0.0\n        return beta\n\n    def _estimate_from_order(order: list, support: np.ndarray = None, refine: bool = True) -> np.ndarray:\n        B = np.zeros((p, p), dtype=float)\n        for pos, tgt in enumerate(order):\n            prev = order[:pos]\n            if support is None:\n                parents = prev\n            else:\n                parents = [j for j in prev if support[tgt, j]]\n            if not parents:\n                continue\n            Xp = Xs[:, parents]\n            y = Xs[:, tgt]\n            beta = _bic_subset(y, Xp) if refine else _ols_support(y, Xp)\n            if beta.size:\n                B[tgt, parents] = beta\n        np.fill_diagonal(B, 0.0)\n        return B\n\n    def _graph_score(B: np.ndarray) -> float:\n        E = Xs - Xs @ B.T\n        rss = np.mean(E * E, axis=0) + eps\n        dep = _residual_dep_score(E)\n        l0 = np.count_nonzero(np.abs(B) > 1e-10)\n        bic_like = np.mean(np.log(rss)) + 0.5 * l0 * np.log(max(n, 2)) / max(n * p, 1)\n        return float(dep + 0.15 * bic_like)\n\n    def _fit_direct() -> tuple:\n        DirectLiNGAM = None\n        try:\n            from causallearn.search.FCMBased.lingam import DirectLiNGAM as _DL\n\n            DirectLiNGAM = _DL\n        except Exception:\n            try:\n                from causallearn.search.FCMBased.lingam.direct_lingam import DirectLiNGAM as _DL\n\n                DirectLiNGAM = _DL\n            except Exception:\n                return None, None\n\n        try:\n            model = DirectLiNGAM()\n            model.fit(X)\n            B_raw = getattr(model, \"adjacency_matrix_\", None)\n            if B_raw is None:\n                B_raw = getattr(model, \"_adjacency_matrix\", None)\n            if B_raw is None:\n                return None, None\n            B_raw = np.asarray(B_raw, dtype=float)\n            support = np.abs(B_raw) > 1e-10\n            order = getattr(model, \"causal_order_\", None)\n            if order is None or len(order) != p:\n                order = _topological_from_support(support)\n            else:\n                order = [int(v) for v in order]\n            return support, order\n        except Exception:\n            return None, None\n\n    candidates = []\n\n    support_direct, order_direct = _fit_direct()\n    if support_direct is not None and order_direct is not None:\n        candidates.append(_estimate_from_order(order_direct, support_direct, refine=False))\n        candidates.append(_estimate_from_order(order_direct, support_direct, refine=True))\n        candidates.append(_estimate_from_order(order_direct, support=None, refine=True))\n\n    order_contrast = _contrastive_order(Xs)\n    B_contrast = _estimate_from_order(order_contrast, support=None, refine=True)\n    candidates.append(B_contrast)\n\n    if support_direct is not None and order_direct is not None:\n \n... (truncated)",
5  "after_line": 2,
6  "start_line": 3,
7  "end_line": 135
8}
Tool ResultStep 3
OK: Replaced lines 3..135 with 294 line(s) in causal-learn/bench/custom_algorithm.py. Editable range: 3–296.

[Current file: causal-learn/bench/custom_algorithm.py | editable: 3–296 | total: 296 lines]
     1| import numpy as np
     2| 
     3  # =====================================================================
     4  # EDITABLE: implement run_causal_discovery below
     5  # =====================================================================
     6  def run_causal_discovery(X: np.ndarray) -> np.ndarray:
     7      """
     8      Stable Contrastive Direct LiNGAM.
     9  
    10      Novel contribution:
    11      - build several LiNGAM graphs from complementary inductive biases:
    12        (i) DirectLiNGAM support if available,
    13        (ii) full-order adaptive-lasso/backward-BIC refinement,
    14        (iii) a contrastive forward-vs-reverse residual ordering,
    15        (iv) a fused support graph combining direct and contrastive candidates;
    16      - selec
AssistantStep 4test
Tool ResultStep 4
[Test #2] (1 test remaining; call submit(n=2) to submit this result)

### ER10 (scripts/eval_er10.sh)
Results (ER10):
  SHD=0  F1=1.0000  Precision=1.0000  Recall=1.0000

### ER15 (scripts/eval_er15.sh)
Results (ER15):
  SHD=2  F1=0.9474  Precision=0.9000  Recall=1.0000

### SF12 (scripts/eval_sf12.sh)
Results (SF12):
  SHD=1  F1=0.9756  Precision=0.9524  Recall=1.0000

### ER30 (scripts/eval_er30.sh)
Results (ER30):
  SHD=1  F1=0.9953  Precision=0.9907  Recall=1.0000

### ER50 (scripts/eval_er50.sh)
Results (ER50):
  SHD=21  F1=0.9569  Precision=0.9873  Recall=0.9283

### ER50-LowSample (scripts/eval_er50_lowsample.sh)
Results (ER50-LowSample):
  SHD=30  F1=0.8832  Precision=0.8121  Recall=0.9680

### ER20-Dense (scripts/eval_er20_dense.sh)
Results (ER20-Dense):
  SHD=7  F1=0.9641  Precision=0.9592  Recall=0.9691

[Leaderboard] Results saved: {'shd_ER10': 0, 'f1_ER10': 1.0, 'precision_ER10': 1.0, 'recall_ER10': 1.0, 'elapsed_ER10': 3, 'shd_ER15': 2, 'f1_ER15': 0.9474, 'precision_ER15'
AssistantStep 5submit
Tool Input
1{
2  "n": 2
3}
Tool ResultStep 5
[submit] Submitting result from test #2 as final.

### ER10 (scripts/eval_er10.sh)
Results (ER10):
  SHD=0  F1=1.0000  Precision=1.0000  Recall=1.0000

### ER15 (scripts/eval_er15.sh)
Results (ER15):
  SHD=2  F1=0.9474  Precision=0.9000  Recall=1.0000

### SF12 (scripts/eval_sf12.sh)
Results (SF12):
  SHD=1  F1=0.9756  Precision=0.9524  Recall=1.0000

### ER30 (scripts/eval_er30.sh)
Results (ER30):
  SHD=1  F1=0.9953  Precision=0.9907  Recall=1.0000

### ER50 (scripts/eval_er50.sh)
Results (ER50):
  SHD=21  F1=0.9569  Precision=0.9873  Recall=0.9283

### ER50-LowSample (scripts/eval_er50_lowsample.sh)
Results (ER50-LowSample):
  SHD=30  F1=0.8832  Precision=0.8121  Recall=0.9680

### SF100 (scripts/eval_sf100.sh)
Results (SF100):
  SHD=17  F1=0.9715  Precision=0.9477  Recall=0.9966

### ER20-Dense (scripts/eval_er20_dense.sh)
Results (ER20-Dense):
  SHD=7  F1=0.9641  Precision=0.9592  Recall=0.9691

[Leaderboard] Results saved: {'shd_ER10': 0, 'f1_ER10': 1.0, 'precision_ER10': 1.0, 're