Agent Conversation: gpt-5.4-pro

Model: gpt-5.4-pro

Total steps: 5

Messages: 11

UserStep 0

Task: causal-observational-nonlinear

Causal Discovery: Observational Nonlinear Data

Objective

Implement a causal discovery algorithm that recovers the DAG structure from purely observational data generated by a nonlinear Additive Noise Model (ANM). Your code goes in bench/custom_algorithm.py.

Background

Additive Noise Models (ANMs) assume structural equations of the form x_j = f_j(parents(j)) + e_j, where f_j are nonlinear functions and e_j are independent noise terms. Under mild conditions on f_j and the noise distribution, the causal DAG is identifiable from observational data alone (Hoyer et al., 2008; Peters et al., 2014). This goes beyond linear methods (LiNGAM, PC, GES) that assume linearity.

Key challenges include: (1) testing independence in high-dimensional nonlinear settings, (2) scaling to larger graphs, (3) handling diverse nonlinear function types (MLP, GP, polynomial, sigmoid), and (4) robustness to different noise distributions including the harder Gaussian noise case.

Evaluation Scenarios

Label	Graph type	Nodes	Samples	Noise	Nonlinearity
ER8-MLP	Erdos-Renyi	8	500	Exponential	MLP
ER12-GP	Erdos-Renyi	12	1000	Laplace	GP
SF10-Mixed	Scale-Free (BA)	10	500	Uniform	Mixed
ER15-Sigmoid	Erdos-Renyi	15	1000	Exponential	Sigmoid
ER20-Mixed	Erdos-Renyi	20	2000	Laplace	Mixed
SF20-GP	Scale-Free (BA)	20	2000	Exponential	GP
ER12-LowSample	Erdos-Renyi	12	150	Laplace	Mixed
ER20-Gauss	Erdos-Renyi	20	2000	Gaussian	Mixed

Metrics

All computed on the directed edge set (skeleton + direction must be correct):

F1 (primary ranking metric), SHD, Precision, Recall

Baselines

cam: Causal Additive Models (Buehlmann et al., 2014) -- score-based with GAM regression
notears_mlp: NOTEARS with MLP parametrization (Zheng et al., 2020) -- continuous optimization
directlingam: DirectLiNGAM (Shimizu et al., 2011) -- linear reference baseline

causal-learn/bench/custom_algorithm.py [EDITABLE — lines 3–14 only]

     1: import numpy as np
     2: 
     3: # =====================================================================
     4: # EDITABLE: implement run_causal_discovery below
     5: # =====================================================================
     6: def run_causal_discovery(X: np.ndarray) -> np.ndarray:
     7:     """
     8:     Input:  X of shape (n_samples, n_variables)
     9:     Output: adjacency matrix B of shape (n_variables, n_variables)
    10:             B[i, j] != 0  means j -> i  (follows causal-learn convention)
    11:     """
    12:     n = X.shape[1]
    13:     return np.zeros((n, n))
    14: # =====================================================================

cam baseline — editable region [READ-ONLY — reference implementation]

Lines 3–14:
     1: import numpy as n

AssistantStep 1edit

Tool Input

1{
2  "op": "replace",
3  "filename": "causal-learn/bench/custom_algorithm.py",
4  "content": "# =====================================================================\n# EDITABLE: implement run_causal_discovery below\n# =====================================================================\ndef run_causal_discovery(X: np.ndarray) -> np.ndarray:\n    \"\"\"\n    Input:  X of shape (n_samples, n_variables)\n    Output: adjacency matrix B of shape (n_variables, n_variables)\n            B[i, j] != 0  means j -> i  (follows causal-learn convention)\n    \"\"\"\n    import os\n    from itertools import combinations\n    from sklearn.ensemble import ExtraTreesRegressor, HistGradientBoostingRegressor\n    from sklearn.feature_selection import mutual_info_regression\n    from sklearn.model_selection import KFold\n\n    # Keep thread usage low when multiple seeds are evaluated in parallel.\n    os.environ.setdefault(\"OMP_NUM_THREADS\", \"1\")\n    os.environ.setdefault(\"OPENBLAS_NUM_THREADS\", \"1\")\n    os.environ.setdefault(\"MKL_NUM_THREADS\", \"1\")\n\n    X = np.asarray(X, dtype=float)\n    n, d = X.shape\n    if d == 0:\n        return np.zeros((0, 0))\n    if d == 1:\n        return np.zeros((1, 1))\n\n    # Standardize for stable nonlinear regression / dependence testing.\n    X = X.copy()\n    X -= X.mean(axis=0, keepdims=True)\n    scale = X.std(axis=0, keepdims=True)\n    scale[scale < 1e-8] = 1.0\n    X /= scale\n\n    rng = np.random.default_rng(0)\n    dep_idx = np.arange(n) if n <= 384 else rng.choice(n, size=384, replace=False)\n    obj_cache = {}\n\n    def make_model() -> HistGradientBoostingRegressor:\n        return HistGradientBoostingRegressor(\n            loss=\"squared_error\",\n            learning_rate=0.06,\n            max_iter=60 if n >= 500 else 100,\n            max_depth=3,\n            min_samples_leaf=max(5, n // 80),\n            l2_regularization=1e-3,\n            random_state=0,\n        )\n\n    def crossfit_residual(y: np.ndarray, Z: np.ndarray) -> np.ndarray:\n        if Z.ndim == 1:\n            Z = Z[:, None]\n        if Z.shape[1] == 0:\n            return y - y.mean()\n\n        resid = np.zeros_like(y, dtype=float)\n        kf = KFold(n_splits=2, shuffle=True, random_state=0)\n        for tr, te in kf.split(Z):\n            model = make_model()\n            model.fit(Z[tr], y[tr])\n            resid[te] = y[te] - model.predict(Z[te])\n        return resid\n\n    def dep_score(Z: np.ndarray, r: np.ndarray) -> float:\n        if Z.ndim == 1:\n            Z = Z[:, None]\n        if Z.shape[1] == 0:\n            return 0.0\n\n        Zs = Z[dep_idx]\n        rs = r[dep_idx]\n        rs = rs - rs.mean()\n        rs_std = float(rs.std())\n        if rs_std < 1e-10:\n            return 0.0\n        rs = rs / rs_std\n\n        try:\n            mi = mutual_info_regression(\n                Zs,\n                rs,\n                discrete_features=False,\n                n_neighbors=5 if len(dep_idx) >= 200 else 3,\n                random_state=0,\n            )\n        except Exception:\n            mi = np.zeros(Zs.shape[1], dtype=float)\n\n        corr = np.zeros(Zs.shape[1], dtype=float)\n        for k in range(Zs.shape[1]):\n            zk = Zs[:, k]\n            if np.std(zk) < 1e-10:\n                continue\n            c = np.corrcoef(zk, rs)[0, 1]\n            corr[k] = 0.0 if not np.isfinite(c) else abs(c)\n\n        scores = np.nan_to_num(mi, nan=0.0, posinf=0.0, neginf=0.0) + 0.35 * corr\n        topk = min(3, len(scores))\n        if topk == 0:\n            return 0.0\n        top = np.partition(scores, len(scores) - topk)[-topk:]\n        return float(0.7 * np.max(top) + 0.3 * np.mean(top))\n\n    def objective(target: int, parents: tuple, context: tuple) -> float:\n        key = (target, parents, context)\n        if key in obj_cache:\n            return obj_cache[key]\n\n        y = X[:, target]\n        resid = crossfit_residual(y, X[:, parents]) if len(parents) > 0 else (y - y.mean())\n        dep = dep_score(X[:, context], resid) if len(context) > 0 else 0.0\n        mse = float(np.mean(resid ** 2))\n        # Dependence is primary; fit and tiny complexity penalty stabilize selection.\n        value = dep + 0.12 * mse + 0.01 * len(parents) / max(1, len(context) + 1)\n        obj_cache[key] = value\n        return value\n\n    # -----------------------------------------------------------------\n    # Stage 1: estimate a topological order by repeated sink peeling.\n    # Novelty: out-of-fold residual-independence scoring prevents overfit.\n    # -----------------------------------------------------------------\n    remaining = list(range(d))\n    reverse_order = []\n    while remaining:\n        if len(remaining) == 1:\n            reverse_order.append(remaining.pop())\n            break\n\n        best_node = None\n        best_score = np.inf\n        for j in remaining:\n            predictors = tuple(k for k in remaining if k != j)\n            resid = crossfit_residual(X[:, j], X[:, predictors])\n            score = dep_score(X[:, predictors], resid) + 0.12 * float(np.mean(resid ** 2))\n            if score < best_score:\n                best_score = score\n                best_node = j\n\n        reverse_order.append(best_node)\n        remaining.remove(best_node)\n\n    order = reverse_order[::-1]\n\n    # -----------------------------------------------------------------\n    # Stage 2: order-constrained parent selection.\n    # Novelty: knockoff-ranked nonlinear search + pairwise-synergy insertion.\n    # -----------------------------------------------------------------\n    B = np.zeros((d, d), dtype=float)\n    min_gain = 0.006 + 0.02 / np.sqrt(max(n, 2))\n\n    for pos, target in enumerate(order):\n        context = tuple(order[:pos])\n        if len(context) == 0:\n            continue\n\n        local_rng = np.random.default_rng(1234 + target)\n        Z_real = X[:, context]\n        Z_fake = np.column_stack([local_rng.permutation(Z_real[:, i]) for i in range(Z_real.shape[1])])\n        ranker = ExtraTreesRegressor(\n            n_estimators=96 if n < 1000 else 64,\n            max_depth=5,\n            min_samples_leaf=max(2, n // 150),\n            random_state=0,\n            n_jobs=1,\n        )\n        ranker.fit(np.hstack([Z_real, Z_fake]), X[:, target])\n        imp = ranker.feature_importances_\n        comp = imp[: len(context)] - imp[len(context) :]\n        order_idx = np.argsort(-comp)\n        cand_order = [context[i] for i in order_idx]\n\n        selected = tuple()\n        current = objective(target, selected, context)\n\n        while True:\n            best_set = None\n            best_value = current\n\n            # Best single addition.\n            for cand in cand_order:\n                if cand in selected:\n                    continue\n                trial = tuple(sorted(selected + (cand,)))\n                value = objective(target, trial, context)\n                if value < best_value:\n                    best_value = value\n                    best_set = trial\n\n            # Best pair addition among the strongest remaining candidates.\n            top_left = [c for c in cand_order if c not in selected][:5]\n            for a, b in combinations(top_left, 2):\n                trial = tuple(sorted(selected + (a, b)))\n                value = objective(target, trial, context)\n                if value < best_value:\n                    best_value = value\n                    best_set = trial\n\n            if best_set is None or current - best_value <= min_gain:\n                break\n            selected, current = best_set, best_value\n\n        # Backward cleanup.\n        changed = True\n        while changed and len(selected) > 0:\n            changed = False\n            for cand in list(selected):\n                trial = tuple(x for x in selected if x != cand)\n                value = objective(target, trial, context)\n                if value <= current + 0.003:\n                    selected, current = trial, value\n                    changed = True\n                    break\n\n        for parent in selected:\n            B[target, parent] = 1.0\n\n    return B\n# =====================================================================",
5  "after_line": 14,
6  "start_line": 3,
7  "end_line": 14
8}

Tool ResultStep 1

OK: Replaced lines 3..14 with 218 line(s) in causal-learn/bench/custom_algorithm.py. Editable range: 3–220.

[Current file: causal-learn/bench/custom_algorithm.py | editable: 3–220 | total: 220 lines]
     1| import numpy as np
     2| 
     3  # =====================================================================
     4  # EDITABLE: implement run_causal_discovery below
     5  # =====================================================================
     6  def run_causal_discovery(X: np.ndarray) -> np.ndarray:
     7      """
     8      Input:  X of shape (n_samples, n_variables)
     9      Output: adjacency matrix B of shape (n_variables, n_variables)
    10              B[i, j] != 0  means j -> i  (follows causal-learn convention)
    11      """
    12      import os
    13      from itertools import combinations
    14      from sklearn.ensemble import ExtraTreesRegressor, HistGradientBoostingRegressor
    15      from sklearn.feature_selection import mutual_info_regression

AssistantStep 2test

Tool ResultStep 2

[Test #1] (2 tests remaining; call submit(n=1) to submit this result)

### ER8-MLP (scripts/eval_er8_mlp.sh)
Results (ER8-MLP):
  SHD=6  F1=0.6316  Precision=0.5000  Recall=0.8571

### ER12-GP (scripts/eval_er12_gp.sh)
Results (ER12-GP):
  SHD=17  F1=0.6538  Precision=0.5152  Recall=0.8947

### SF10-Mixed (scripts/eval_sf10_mixed.sh)
Results (SF10-Mixed):
  SHD=13  F1=0.6341  Precision=0.5200  Recall=0.8125

### ER15-Sigmoid (scripts/eval_er15_sigmoid.sh)
Results (ER15-Sigmoid):
  SHD=29  F1=0.5867  Precision=0.4490  Recall=0.8462

### ER20-Mixed (scripts/eval_er20_mixed.sh)
Results (ER20-Mixed):
  SHD=58  F1=0.3860  Precision=0.3929  Recall=0.3793

### SF20-GP (scripts/eval_sf20_gp.sh)
Results (SF20-GP):
  SHD=37  F1=0.5567  Precision=0.4426  Recall=0.7500

### ER12-LowSample (scripts/eval_er12_lowsample.sh)
Results (ER12-LowSample):
  SHD=17  F1=0.5652  Precision=0.4815  Recall=0.6842

[Leaderboard] Results saved: {'shd_ER8-MLP': 6, 'f1_ER8-MLP': 0.6316, 'precision_ER8-MLP': 0.5, 're

AssistantStep 3edit

Tool Input

1{
2  "op": "replace",
3  "filename": "causal-learn/bench/custom_algorithm.py",
4  "content": "# =====================================================================\n# EDITABLE: implement run_causal_discovery below\n# =====================================================================\ndef run_causal_discovery(X: np.ndarray) -> np.ndarray:\n    \"\"\"\n    Input:  X of shape (n_samples, n_variables)\n    Output: adjacency matrix B of shape (n_variables, n_variables)\n            B[i, j] != 0  means j -> i  (follows causal-learn convention)\n    \"\"\"\n    import os\n    from itertools import combinations\n    from sklearn.ensemble import ExtraTreesRegressor, HistGradientBoostingRegressor\n    from sklearn.feature_selection import mutual_info_regression\n    from sklearn.model_selection import KFold\n\n    os.environ.setdefault(\"OMP_NUM_THREADS\", \"1\")\n    os.environ.setdefault(\"OPENBLAS_NUM_THREADS\", \"1\")\n    os.environ.setdefault(\"MKL_NUM_THREADS\", \"1\")\n\n    X = np.asarray(X, dtype=float)\n    n, d = X.shape\n    if d == 0:\n        return np.zeros((0, 0))\n    if d == 1:\n        return np.zeros((1, 1))\n\n    X = X.copy()\n    X -= X.mean(axis=0, keepdims=True)\n    scale = X.std(axis=0, keepdims=True)\n    scale[scale < 1e-8] = 1.0\n    X /= scale\n\n    rng = np.random.default_rng(0)\n    dep_idx = np.arange(n) if n <= 256 else rng.choice(n, size=256, replace=False)\n\n    resid_cache = {}\n    obj_cache = {}\n    cond_cache = {}\n    pair_cache = {}\n\n    def make_model() -> HistGradientBoostingRegressor:\n        return HistGradientBoostingRegressor(\n            loss=\"squared_error\",\n            learning_rate=0.06,\n            max_iter=50 if n >= 1000 else (70 if n >= 400 else 100),\n            max_depth=3,\n            min_samples_leaf=max(5, n // 80),\n            l2_regularization=1e-3,\n            random_state=0,\n        )\n\n    def crossfit_residual(y: np.ndarray, Z: np.ndarray) -> np.ndarray:\n        if Z.ndim == 1:\n            Z = Z[:, None]\n        if Z.shape[1] == 0:\n            return y - y.mean()\n\n        resid = np.zeros_like(y, dtype=float)\n        kf = KFold(n_splits=2, shuffle=True, random_state=0)\n        for tr, te in kf.split(Z):\n            model = make_model()\n            model.fit(Z[tr], y[tr])\n            resid[te] = y[te] - model.predict(Z[te])\n        return resid\n\n    def get_resid(var: int, cond: tuple) -> np.ndarray:\n        cond = tuple(sorted(cond))\n        key = (var, cond)\n        if key not in resid_cache:\n            resid_cache[key] = crossfit_residual(X[:, var], X[:, cond]) if len(cond) > 0 else (X[:, var] - X[:, var].mean())\n        return resid_cache[key]\n\n    def dep_score(Z: np.ndarray, r: np.ndarray) -> float:\n        if Z.ndim == 1:\n            Z = Z[:, None]\n        if Z.shape[1] == 0:\n            return 0.0\n\n        Zs = Z[dep_idx]\n        rs = r[dep_idx]\n        rs = rs - rs.mean()\n        rs_std = float(rs.std())\n        if rs_std < 1e-10:\n            return 0.0\n        rs = rs / rs_std\n\n        try:\n            mi = mutual_info_regression(\n                Zs,\n                rs,\n                discrete_features=False,\n                n_neighbors=5 if len(dep_idx) >= 160 else 3,\n                random_state=0,\n            )\n        except Exception:\n            mi = np.zeros(Zs.shape[1], dtype=float)\n\n        corr = np.zeros(Zs.shape[1], dtype=float)\n        nonlin = np.zeros(Zs.shape[1], dtype=float)\n        rs2 = np.tanh(rs)\n        for k in range(Zs.shape[1]):\n            zk = Zs[:, k]\n            zk_std = float(np.std(zk))\n            if zk_std < 1e-10:\n                continue\n            c1 = np.corrcoef(zk, rs)[0, 1]\n            c2 = np.corrcoef(np.tanh(zk), rs)[0, 1]\n            c3 = np.corrcoef(zk ** 2, rs2)[0, 1]\n            corr[k] = 0.0 if not np.isfinite(c1) else abs(c1)\n            nonlin[k] = max(0.0 if not np.isfinite(c2) else abs(c2), 0.0 if not np.isfinite(c3) else abs(c3))\n\n        scores = np.nan_to_num(mi, nan=0.0, posinf=0.0, neginf=0.0) + 0.25 * corr + 0.15 * nonlin\n        topk = min(3, len(scores))\n        if topk == 0:\n            return 0.0\n        top = np.partition(scores, len(scores) - topk)[-topk:]\n        return float(0.65 * np.max(top) + 0.35 * np.mean(top))\n\n    def objective(target: int, parents: tuple, context: tuple) -> float:\n        parents = tuple(sorted(parents))\n        context = tuple(context)\n        key = (target, parents, context)\n        if key in obj_cache:\n            return obj_cache[key]\n\n        resid = get_resid(target, parents)\n        dep = dep_score(X[:, context], resid) if len(context) > 0 else 0.0\n        mse = float(np.mean(resid ** 2))\n        complexity = len(parents) * (0.004 + 1.5 * np.log(max(n, 3)) / max(n, 1))\n        value = dep + 0.10 * mse + complexity\n        obj_cache[key] = value\n        return value\n\n    def cond_link_score(target: int, cand: int, cond: tuple) -> tuple:\n        cond = tuple(sorted(cond))\n        key = (target, cand, cond)\n        if key in cond_cache:\n            return cond_cache[key]\n\n        ry = get_resid(target, cond)\n        rx = get_resid(cand, cond)\n        real = dep_score(rx[:, None], ry)\n        perm_rng = np.random.default_rng(2027 + 97 * target + 31 * cand + 7 * len(cond))\n        fake = dep_score(rx[perm_rng.permutation(n)][:, None], ry)\n        cond_cache[key] = (real, fake)\n        return real, fake\n\n    def pair_pref(a: int, b: int) -> float:\n        key = (a, b)\n        if key in pair_cache:\n            return pair_cache[key]\n\n        s_ab = dep_score(X[:, [a]], get_resid(b, (a,)))\n        s_ba = dep_score(X[:, [b]], get_resid(a, (b,)))\n        pref = s_ba - s_ab  # positive => a is more likely before b\n        pair_cache[(a, b)] = pref\n        pair_cache[(b, a)] = -pref\n        return pref\n\n    # Stage 1: sink peeling with pairwise ANM vote regularization.\n    remaining = list(range(d))\n    reverse_order = []\n    while remaining:\n        if len(remaining) == 1:\n            reverse_order.append(remaining.pop())\n            break\n\n        best_node = None\n        best_score = np.inf\n        for j in remaining:\n            others = tuple(k for k in remaining if k != j)\n            resid = get_resid(j, others)\n            score = dep_score(X[:, others], resid) + 0.10 * float(np.mean(resid ** 2))\n            if len(others) > 1:\n                score += 0.28 * float(np.mean([max(0.0, pair_pref(j, k)) for k in others]))\n            if score < best_score:\n                best_score = score\n                best_node = j\n\n        reverse_order.append(best_node)\n        remaining.remove(best_node)\n\n    order = reverse_order[::-1]\n\n    # Stage 2: order-constrained parent search with conditional falsification.\n    B = np.zeros((d, d), dtype=float)\n    base_thr = 0.008 + 0.12 / np.sqrt(max(n, 2))\n\n    for pos, target in enumerate(order):\n        context = tuple(order[:pos])\n        if len(context) == 0:\n            continue\n\n        local_rng = np.random.default_rng(1234 + target)\n        Z_real = X[:, context]\n        Z_fake = np.column_stack([local_rng.permutation(Z_real[:, i]) for i in range(Z_real.shape[1])])\n        ranker = ExtraTreesRegressor(\n            n_estimators=80 if n < 1000 else 64,\n            max_depth=5,\n            min_samples_leaf=max(2, n // 150),\n            random_state=0,\n            n_jobs=1,\n        )\n        ranker.fit(np.hstack([Z_real, Z_fake]), X[:, target])\n        imp = ranker.feature_importances_\n        comp = imp[: len(context)] - imp[len(context) :]\n\n        rank_scores = []\n        for i, cand in enumerate(context):\n            pair_bonus = max(0.0, pair_pref(cand, target))\n            uni = dep_score(X[:, [cand]], X[:, target])\n            rank_scores.append(comp[i] + 0.35 * pair_bonus + 0.15 * uni)\n        rank_scores = np.asarray(rank_scores)\n        cand_order = [context[i] for i in np.argsort(-rank_scores)]\n\n        screen_cap = min(len(context), max(4, int(np.ceil(2.0 * np.sqrt(len(context)) + 1))))\n        cand_order = cand_order[:screen_cap]\n        max_parents = min(screen_cap, max(2, int(np.ceil(1.2 * np.sqrt(len(context)) + 1))))\n\n        selected = tuple()\n        current = objective(target, selected, context)\n\n        while len(selected) < max_parents:\n            best_set = None\n            best_value = current\n            best_metric = base_thr\n            single_stats = {}\n\n            for cand in cand_order:\n                if cand in selected:\n                    continue\n                real, fake = cond_link_score(target, cand, selected)\n                trial = tuple(sorted(selected + (cand,)))\n                value = objective(target, trial, context)\n                gain = current - value\n                metric = (real - 1.10 * fake) + 0.90 * gain + 0.08 * max(0.0, pair_pref(cand, target))\n                single_stats[cand] = (metric, value)\n                if metric > best_metric:\n                    best_metric = metric\n                    best_set = trial\n                    best_value = value\n\n            top_left = [c for c in cand_order if c not in selected]\n            top_left = sorted(top_left, key=lambda c: single_stats.get(c, (-1e9, np.inf))[0], reverse=True)[:4]\n            for a, b in combinations(top_left, 2):\n                trial = tuple(sorted(selected + (a, b)))\n                value = objective(target, trial, context)\n                gain = current - value\n                ra, fa = cond_link_score(target, a, selected)\n                rb, fb = cond_link_score(target, b, selected)\n                metric = gain + 0.60 * ((ra - fa) + (rb - fb))\n                if metric > best_metric + 0.002:\n                    best_metric = metric\n                    best_set = trial\n                    best_value = value\n\n            if best_set is None:\n                break\n            selected, current = best_set, best_value\n\n        changed = True\n        while changed and len(selected) > 0:\n            changed = False\n            full_value = objective(target, selected, context)\n            for cand in list(selected):\n                others = tuple(x for x in selected if x != cand)\n                real, fake = cond_link_score(target, \n... (truncated)",
5  "after_line": 5,
6  "start_line": 3,
7  "end_line": 220
8}

Tool ResultStep 3

OK: Replaced lines 3..220 with 283 line(s) in causal-learn/bench/custom_algorithm.py. Editable range: 3–285.

[Current file: causal-learn/bench/custom_algorithm.py | editable: 3–285 | total: 285 lines]
     1| import numpy as np
     2| 
     3  # =====================================================================
     4  # EDITABLE: implement run_causal_discovery below
     5  # =====================================================================
     6  def run_causal_discovery(X: np.ndarray) -> np.ndarray:
     7      """
     8      Input:  X of shape (n_samples, n_variables)
     9      Output: adjacency matrix B of shape (n_variables, n_variables)
    10              B[i, j] != 0  means j -> i  (follows causal-learn convention)
    11      """
    12      import os
    13      from itertools import combinations
    14      from sklearn.ensemble import ExtraTreesRegressor, HistGradientBoostingRegressor
    15      from sklearn.feature_selection import mutual_info_regression

AssistantStep 4test

Tool ResultStep 4

[Test #2] (1 test remaining; call submit(n=2) to submit this result)

### ER8-MLP (scripts/eval_er8_mlp.sh)
Results (ER8-MLP):
  SHD=6  F1=0.4000  Precision=0.3750  Recall=0.4286

### ER12-GP (scripts/eval_er12_gp.sh)
Results (ER12-GP):
  SHD=16  F1=0.3871  Precision=0.5000  Recall=0.3158

### SF10-Mixed (scripts/eval_sf10_mixed.sh)
Results (SF10-Mixed):
  SHD=10  F1=0.6667  Precision=0.6471  Recall=0.6875

### ER15-Sigmoid (scripts/eval_er15_sigmoid.sh)
Results (ER15-Sigmoid):
  SHD=10  F1=0.7719  Precision=0.7097  Recall=0.8462

### ER20-Mixed (scripts/eval_er20_mixed.sh)
Results (ER20-Mixed):
  SHD=54  F1=0.3366  Precision=0.3953  Recall=0.2931

### SF20-GP (scripts/eval_sf20_gp.sh)
Results (SF20-GP):
  SHD=22  F1=0.6579  Precision=0.6250  Recall=0.6944

### ER12-LowSample (scripts/eval_er12_lowsample.sh)
Results (ER12-LowSample):
  SHD=11  F1=0.5517  Precision=0.8000  Recall=0.4211

[Leaderboard] Results saved: {'shd_ER8-MLP': 6, 'f1_ER8-MLP': 0.4, 'precision_ER8-MLP': 0.375, 'reca

AssistantStep 5submit

Tool Input

1{
2  "n": 1
3}

Tool ResultStep 5

[submit] Submitting result from test #1 as final.

### ER8-MLP (scripts/eval_er8_mlp.sh)
Results (ER8-MLP):
  SHD=6  F1=0.6316  Precision=0.5000  Recall=0.8571

### ER12-GP (scripts/eval_er12_gp.sh)
Results (ER12-GP):
  SHD=17  F1=0.6538  Precision=0.5152  Recall=0.8947

### SF10-Mixed (scripts/eval_sf10_mixed.sh)
Results (SF10-Mixed):
  SHD=13  F1=0.6341  Precision=0.5200  Recall=0.8125

### ER15-Sigmoid (scripts/eval_er15_sigmoid.sh)
Results (ER15-Sigmoid):
  SHD=29  F1=0.5867  Precision=0.4490  Recall=0.8462

### ER20-Mixed (scripts/eval_er20_mixed.sh)
Results (ER20-Mixed):
  SHD=58  F1=0.3860  Precision=0.3929  Recall=0.3793

### SF20-GP (scripts/eval_sf20_gp.sh)
Results (SF20-GP):
  SHD=37  F1=0.5567  Precision=0.4426  Recall=0.7500

### ER12-LowSample (scripts/eval_er12_lowsample.sh)
Results (ER12-LowSample):
  SHD=17  F1=0.5652  Precision=0.4815  Recall=0.6842

### ER20-Gauss (scripts/eval_er20_gauss.sh)
Results (ER20-Gauss):
  SHD=62  F1=0.3871  Precision=0.3636  Recall=0.4138