Agent Conversation: gpt-5.4-pro

Model: gpt-5.4-pro
Total steps: 5
Messages: 11
UserStep 0

Task: causal-observational-nonlinear

Causal Discovery: Observational Nonlinear Data

Objective

Implement a causal discovery algorithm that recovers the DAG structure from purely observational data generated by a nonlinear Additive Noise Model (ANM). Your code goes in bench/custom_algorithm.py.

Background

Additive Noise Models (ANMs) assume structural equations of the form x_j = f_j(parents(j)) + e_j, where f_j are nonlinear functions and e_j are independent noise terms. Under mild conditions on f_j and the noise distribution, the causal DAG is identifiable from observational data alone (Hoyer et al., 2008; Peters et al., 2014). This goes beyond linear methods (LiNGAM, PC, GES) that assume linearity.

Key challenges include: (1) testing independence in high-dimensional nonlinear settings, (2) scaling to larger graphs, (3) handling diverse nonlinear function types (MLP, GP, polynomial, sigmoid), and (4) robustness to different noise distributions including the harder Gaussian noise case.

Evaluation Scenarios

LabelGraph typeNodesSamplesNoiseNonlinearity
ER8-MLPErdos-Renyi8500ExponentialMLP
ER12-GPErdos-Renyi121000LaplaceGP
SF10-MixedScale-Free (BA)10500UniformMixed
ER15-SigmoidErdos-Renyi151000ExponentialSigmoid
ER20-MixedErdos-Renyi202000LaplaceMixed
SF20-GPScale-Free (BA)202000ExponentialGP
ER12-LowSampleErdos-Renyi12150LaplaceMixed
ER20-GaussErdos-Renyi202000GaussianMixed

Metrics

All computed on the directed edge set (skeleton + direction must be correct):

  • F1 (primary ranking metric), SHD, Precision, Recall

Baselines

  • cam: Causal Additive Models (Buehlmann et al., 2014) -- score-based with GAM regression
  • notears_mlp: NOTEARS with MLP parametrization (Zheng et al., 2020) -- continuous optimization
  • directlingam: DirectLiNGAM (Shimizu et al., 2011) -- linear reference baseline

causal-learn/bench/custom_algorithm.py [EDITABLE — lines 3–14 only]

     1: import numpy as np
     2: 
     3: # =====================================================================
     4: # EDITABLE: implement run_causal_discovery below
     5: # =====================================================================
     6: def run_causal_discovery(X: np.ndarray) -> np.ndarray:
     7:     """
     8:     Input:  X of shape (n_samples, n_variables)
     9:     Output: adjacency matrix B of shape (n_variables, n_variables)
    10:             B[i, j] != 0  means j -> i  (follows causal-learn convention)
    11:     """
    12:     n = X.shape[1]
    13:     return np.zeros((n, n))
    14: # =====================================================================

cam baseline — editable region [READ-ONLY — reference implementation]

Lines 314:
     1: import numpy as n
AssistantStep 1edit
Tool Input
1{
2  "op": "replace",
3  "filename": "causal-learn/bench/custom_algorithm.py",
4  "content": "# =====================================================================\n# EDITABLE: implement run_causal_discovery below\n# =====================================================================\ndef run_causal_discovery(X: np.ndarray) -> np.ndarray:\n    \"\"\"\n    Input:  X of shape (n_samples, n_variables)\n    Output: adjacency matrix B of shape (n_variables, n_variables)\n            B[i, j] != 0  means j -> i  (follows causal-learn convention)\n    \"\"\"\n    import os\n    from itertools import combinations\n    from sklearn.ensemble import ExtraTreesRegressor, HistGradientBoostingRegressor\n    from sklearn.feature_selection import mutual_info_regression\n    from sklearn.model_selection import KFold\n\n    # Keep thread usage low when multiple seeds are evaluated in parallel.\n    os.environ.setdefault(\"OMP_NUM_THREADS\", \"1\")\n    os.environ.setdefault(\"OPENBLAS_NUM_THREADS\", \"1\")\n    os.environ.setdefault(\"MKL_NUM_THREADS\", \"1\")\n\n    X = np.asarray(X, dtype=float)\n    n, d = X.shape\n    if d == 0:\n        return np.zeros((0, 0))\n    if d == 1:\n        return np.zeros((1, 1))\n\n    # Standardize for stable nonlinear regression / dependence testing.\n    X = X.copy()\n    X -= X.mean(axis=0, keepdims=True)\n    scale = X.std(axis=0, keepdims=True)\n    scale[scale < 1e-8] = 1.0\n    X /= scale\n\n    rng = np.random.default_rng(0)\n    dep_idx = np.arange(n) if n <= 384 else rng.choice(n, size=384, replace=False)\n    obj_cache = {}\n\n    def make_model() -> HistGradientBoostingRegressor:\n        return HistGradientBoostingRegressor(\n            loss=\"squared_error\",\n            learning_rate=0.06,\n            max_iter=60 if n >= 500 else 100,\n            max_depth=3,\n            min_samples_leaf=max(5, n // 80),\n            l2_regularization=1e-3,\n            random_state=0,\n        )\n\n    def crossfit_residual(y: np.ndarray, Z: np.ndarray) -> np.ndarray:\n        if Z.ndim == 1:\n            Z = Z[:, None]\n        if Z.shape[1] == 0:\n            return y - y.mean()\n\n        resid = np.zeros_like(y, dtype=float)\n        kf = KFold(n_splits=2, shuffle=True, random_state=0)\n        for tr, te in kf.split(Z):\n            model = make_model()\n            model.fit(Z[tr], y[tr])\n            resid[te] = y[te] - model.predict(Z[te])\n        return resid\n\n    def dep_score(Z: np.ndarray, r: np.ndarray) -> float:\n        if Z.ndim == 1:\n            Z = Z[:, None]\n        if Z.shape[1] == 0:\n            return 0.0\n\n        Zs = Z[dep_idx]\n        rs = r[dep_idx]\n        rs = rs - rs.mean()\n        rs_std = float(rs.std())\n        if rs_std < 1e-10:\n            return 0.0\n        rs = rs / rs_std\n\n        try:\n            mi = mutual_info_regression(\n                Zs,\n                rs,\n                discrete_features=False,\n                n_neighbors=5 if len(dep_idx) >= 200 else 3,\n                random_state=0,\n            )\n        except Exception:\n            mi = np.zeros(Zs.shape[1], dtype=float)\n\n        corr = np.zeros(Zs.shape[1], dtype=float)\n        for k in range(Zs.shape[1]):\n            zk = Zs[:, k]\n            if np.std(zk) < 1e-10:\n                continue\n            c = np.corrcoef(zk, rs)[0, 1]\n            corr[k] = 0.0 if not np.isfinite(c) else abs(c)\n\n        scores = np.nan_to_num(mi, nan=0.0, posinf=0.0, neginf=0.0) + 0.35 * corr\n        topk = min(3, len(scores))\n        if topk == 0:\n            return 0.0\n        top = np.partition(scores, len(scores) - topk)[-topk:]\n        return float(0.7 * np.max(top) + 0.3 * np.mean(top))\n\n    def objective(target: int, parents: tuple, context: tuple) -> float:\n        key = (target, parents, context)\n        if key in obj_cache:\n            return obj_cache[key]\n\n        y = X[:, target]\n        resid = crossfit_residual(y, X[:, parents]) if len(parents) > 0 else (y - y.mean())\n        dep = dep_score(X[:, context], resid) if len(context) > 0 else 0.0\n        mse = float(np.mean(resid ** 2))\n        # Dependence is primary; fit and tiny complexity penalty stabilize selection.\n        value = dep + 0.12 * mse + 0.01 * len(parents) / max(1, len(context) + 1)\n        obj_cache[key] = value\n        return value\n\n    # -----------------------------------------------------------------\n    # Stage 1: estimate a topological order by repeated sink peeling.\n    # Novelty: out-of-fold residual-independence scoring prevents overfit.\n    # -----------------------------------------------------------------\n    remaining = list(range(d))\n    reverse_order = []\n    while remaining:\n        if len(remaining) == 1:\n            reverse_order.append(remaining.pop())\n            break\n\n        best_node = None\n        best_score = np.inf\n        for j in remaining:\n            predictors = tuple(k for k in remaining if k != j)\n            resid = crossfit_residual(X[:, j], X[:, predictors])\n            score = dep_score(X[:, predictors], resid) + 0.12 * float(np.mean(resid ** 2))\n            if score < best_score:\n                best_score = score\n                best_node = j\n\n        reverse_order.append(best_node)\n        remaining.remove(best_node)\n\n    order = reverse_order[::-1]\n\n    # -----------------------------------------------------------------\n    # Stage 2: order-constrained parent selection.\n    # Novelty: knockoff-ranked nonlinear search + pairwise-synergy insertion.\n    # -----------------------------------------------------------------\n    B = np.zeros((d, d), dtype=float)\n    min_gain = 0.006 + 0.02 / np.sqrt(max(n, 2))\n\n    for pos, target in enumerate(order):\n        context = tuple(order[:pos])\n        if len(context) == 0:\n            continue\n\n        local_rng = np.random.default_rng(1234 + target)\n        Z_real = X[:, context]\n        Z_fake = np.column_stack([local_rng.permutation(Z_real[:, i]) for i in range(Z_real.shape[1])])\n        ranker = ExtraTreesRegressor(\n            n_estimators=96 if n < 1000 else 64,\n            max_depth=5,\n            min_samples_leaf=max(2, n // 150),\n            random_state=0,\n            n_jobs=1,\n        )\n        ranker.fit(np.hstack([Z_real, Z_fake]), X[:, target])\n        imp = ranker.feature_importances_\n        comp = imp[: len(context)] - imp[len(context) :]\n        order_idx = np.argsort(-comp)\n        cand_order = [context[i] for i in order_idx]\n\n        selected = tuple()\n        current = objective(target, selected, context)\n\n        while True:\n            best_set = None\n            best_value = current\n\n            # Best single addition.\n            for cand in cand_order:\n                if cand in selected:\n                    continue\n                trial = tuple(sorted(selected + (cand,)))\n                value = objective(target, trial, context)\n                if value < best_value:\n                    best_value = value\n                    best_set = trial\n\n            # Best pair addition among the strongest remaining candidates.\n            top_left = [c for c in cand_order if c not in selected][:5]\n            for a, b in combinations(top_left, 2):\n                trial = tuple(sorted(selected + (a, b)))\n                value = objective(target, trial, context)\n                if value < best_value:\n                    best_value = value\n                    best_set = trial\n\n            if best_set is None or current - best_value <= min_gain:\n                break\n            selected, current = best_set, best_value\n\n        # Backward cleanup.\n        changed = True\n        while changed and len(selected) > 0:\n            changed = False\n            for cand in list(selected):\n                trial = tuple(x for x in selected if x != cand)\n                value = objective(target, trial, context)\n                if value <= current + 0.003:\n                    selected, current = trial, value\n                    changed = True\n                    break\n\n        for parent in selected:\n            B[target, parent] = 1.0\n\n    return B\n# =====================================================================",
5  "after_line": 14,
6  "start_line": 3,
7  "end_line": 14
8}
Tool ResultStep 1
OK: Replaced lines 3..14 with 218 line(s) in causal-learn/bench/custom_algorithm.py. Editable range: 3–220.

[Current file: causal-learn/bench/custom_algorithm.py | editable: 3–220 | total: 220 lines]
     1| import numpy as np
     2| 
     3  # =====================================================================
     4  # EDITABLE: implement run_causal_discovery below
     5  # =====================================================================
     6  def run_causal_discovery(X: np.ndarray) -> np.ndarray:
     7      """
     8      Input:  X of shape (n_samples, n_variables)
     9      Output: adjacency matrix B of shape (n_variables, n_variables)
    10              B[i, j] != 0  means j -> i  (follows causal-learn convention)
    11      """
    12      import os
    13      from itertools import combinations
    14      from sklearn.ensemble import ExtraTreesRegressor, HistGradientBoostingRegressor
    15      from sklearn.feature_selection import mutual_info_regression
    
AssistantStep 2test
Tool ResultStep 2
[Test #1] (2 tests remaining; call submit(n=1) to submit this result)

### ER8-MLP (scripts/eval_er8_mlp.sh)
Results (ER8-MLP):
  SHD=6  F1=0.6316  Precision=0.5000  Recall=0.8571

### ER12-GP (scripts/eval_er12_gp.sh)
Results (ER12-GP):
  SHD=17  F1=0.6538  Precision=0.5152  Recall=0.8947

### SF10-Mixed (scripts/eval_sf10_mixed.sh)
Results (SF10-Mixed):
  SHD=13  F1=0.6341  Precision=0.5200  Recall=0.8125

### ER15-Sigmoid (scripts/eval_er15_sigmoid.sh)
Results (ER15-Sigmoid):
  SHD=29  F1=0.5867  Precision=0.4490  Recall=0.8462

### ER20-Mixed (scripts/eval_er20_mixed.sh)
Results (ER20-Mixed):
  SHD=58  F1=0.3860  Precision=0.3929  Recall=0.3793

### SF20-GP (scripts/eval_sf20_gp.sh)
Results (SF20-GP):
  SHD=37  F1=0.5567  Precision=0.4426  Recall=0.7500

### ER12-LowSample (scripts/eval_er12_lowsample.sh)
Results (ER12-LowSample):
  SHD=17  F1=0.5652  Precision=0.4815  Recall=0.6842

[Leaderboard] Results saved: {'shd_ER8-MLP': 6, 'f1_ER8-MLP': 0.6316, 'precision_ER8-MLP': 0.5, 're
AssistantStep 3edit
Tool Input
1{
2  "op": "replace",
3  "filename": "causal-learn/bench/custom_algorithm.py",
4  "content": "# =====================================================================\n# EDITABLE: implement run_causal_discovery below\n# =====================================================================\ndef run_causal_discovery(X: np.ndarray) -> np.ndarray:\n    \"\"\"\n    Input:  X of shape (n_samples, n_variables)\n    Output: adjacency matrix B of shape (n_variables, n_variables)\n            B[i, j] != 0  means j -> i  (follows causal-learn convention)\n    \"\"\"\n    import os\n    from itertools import combinations\n    from sklearn.ensemble import ExtraTreesRegressor, HistGradientBoostingRegressor\n    from sklearn.feature_selection import mutual_info_regression\n    from sklearn.model_selection import KFold\n\n    os.environ.setdefault(\"OMP_NUM_THREADS\", \"1\")\n    os.environ.setdefault(\"OPENBLAS_NUM_THREADS\", \"1\")\n    os.environ.setdefault(\"MKL_NUM_THREADS\", \"1\")\n\n    X = np.asarray(X, dtype=float)\n    n, d = X.shape\n    if d == 0:\n        return np.zeros((0, 0))\n    if d == 1:\n        return np.zeros((1, 1))\n\n    X = X.copy()\n    X -= X.mean(axis=0, keepdims=True)\n    scale = X.std(axis=0, keepdims=True)\n    scale[scale < 1e-8] = 1.0\n    X /= scale\n\n    rng = np.random.default_rng(0)\n    dep_idx = np.arange(n) if n <= 256 else rng.choice(n, size=256, replace=False)\n\n    resid_cache = {}\n    obj_cache = {}\n    cond_cache = {}\n    pair_cache = {}\n\n    def make_model() -> HistGradientBoostingRegressor:\n        return HistGradientBoostingRegressor(\n            loss=\"squared_error\",\n            learning_rate=0.06,\n            max_iter=50 if n >= 1000 else (70 if n >= 400 else 100),\n            max_depth=3,\n            min_samples_leaf=max(5, n // 80),\n            l2_regularization=1e-3,\n            random_state=0,\n        )\n\n    def crossfit_residual(y: np.ndarray, Z: np.ndarray) -> np.ndarray:\n        if Z.ndim == 1:\n            Z = Z[:, None]\n        if Z.shape[1] == 0:\n            return y - y.mean()\n\n        resid = np.zeros_like(y, dtype=float)\n        kf = KFold(n_splits=2, shuffle=True, random_state=0)\n        for tr, te in kf.split(Z):\n            model = make_model()\n            model.fit(Z[tr], y[tr])\n            resid[te] = y[te] - model.predict(Z[te])\n        return resid\n\n    def get_resid(var: int, cond: tuple) -> np.ndarray:\n        cond = tuple(sorted(cond))\n        key = (var, cond)\n        if key not in resid_cache:\n            resid_cache[key] = crossfit_residual(X[:, var], X[:, cond]) if len(cond) > 0 else (X[:, var] - X[:, var].mean())\n        return resid_cache[key]\n\n    def dep_score(Z: np.ndarray, r: np.ndarray) -> float:\n        if Z.ndim == 1:\n            Z = Z[:, None]\n        if Z.shape[1] == 0:\n            return 0.0\n\n        Zs = Z[dep_idx]\n        rs = r[dep_idx]\n        rs = rs - rs.mean()\n        rs_std = float(rs.std())\n        if rs_std < 1e-10:\n            return 0.0\n        rs = rs / rs_std\n\n        try:\n            mi = mutual_info_regression(\n                Zs,\n                rs,\n                discrete_features=False,\n                n_neighbors=5 if len(dep_idx) >= 160 else 3,\n                random_state=0,\n            )\n        except Exception:\n            mi = np.zeros(Zs.shape[1], dtype=float)\n\n        corr = np.zeros(Zs.shape[1], dtype=float)\n        nonlin = np.zeros(Zs.shape[1], dtype=float)\n        rs2 = np.tanh(rs)\n        for k in range(Zs.shape[1]):\n            zk = Zs[:, k]\n            zk_std = float(np.std(zk))\n            if zk_std < 1e-10:\n                continue\n            c1 = np.corrcoef(zk, rs)[0, 1]\n            c2 = np.corrcoef(np.tanh(zk), rs)[0, 1]\n            c3 = np.corrcoef(zk ** 2, rs2)[0, 1]\n            corr[k] = 0.0 if not np.isfinite(c1) else abs(c1)\n            nonlin[k] = max(0.0 if not np.isfinite(c2) else abs(c2), 0.0 if not np.isfinite(c3) else abs(c3))\n\n        scores = np.nan_to_num(mi, nan=0.0, posinf=0.0, neginf=0.0) + 0.25 * corr + 0.15 * nonlin\n        topk = min(3, len(scores))\n        if topk == 0:\n            return 0.0\n        top = np.partition(scores, len(scores) - topk)[-topk:]\n        return float(0.65 * np.max(top) + 0.35 * np.mean(top))\n\n    def objective(target: int, parents: tuple, context: tuple) -> float:\n        parents = tuple(sorted(parents))\n        context = tuple(context)\n        key = (target, parents, context)\n        if key in obj_cache:\n            return obj_cache[key]\n\n        resid = get_resid(target, parents)\n        dep = dep_score(X[:, context], resid) if len(context) > 0 else 0.0\n        mse = float(np.mean(resid ** 2))\n        complexity = len(parents) * (0.004 + 1.5 * np.log(max(n, 3)) / max(n, 1))\n        value = dep + 0.10 * mse + complexity\n        obj_cache[key] = value\n        return value\n\n    def cond_link_score(target: int, cand: int, cond: tuple) -> tuple:\n        cond = tuple(sorted(cond))\n        key = (target, cand, cond)\n        if key in cond_cache:\n            return cond_cache[key]\n\n        ry = get_resid(target, cond)\n        rx = get_resid(cand, cond)\n        real = dep_score(rx[:, None], ry)\n        perm_rng = np.random.default_rng(2027 + 97 * target + 31 * cand + 7 * len(cond))\n        fake = dep_score(rx[perm_rng.permutation(n)][:, None], ry)\n        cond_cache[key] = (real, fake)\n        return real, fake\n\n    def pair_pref(a: int, b: int) -> float:\n        key = (a, b)\n        if key in pair_cache:\n            return pair_cache[key]\n\n        s_ab = dep_score(X[:, [a]], get_resid(b, (a,)))\n        s_ba = dep_score(X[:, [b]], get_resid(a, (b,)))\n        pref = s_ba - s_ab  # positive => a is more likely before b\n        pair_cache[(a, b)] = pref\n        pair_cache[(b, a)] = -pref\n        return pref\n\n    # Stage 1: sink peeling with pairwise ANM vote regularization.\n    remaining = list(range(d))\n    reverse_order = []\n    while remaining:\n        if len(remaining) == 1:\n            reverse_order.append(remaining.pop())\n            break\n\n        best_node = None\n        best_score = np.inf\n        for j in remaining:\n            others = tuple(k for k in remaining if k != j)\n            resid = get_resid(j, others)\n            score = dep_score(X[:, others], resid) + 0.10 * float(np.mean(resid ** 2))\n            if len(others) > 1:\n                score += 0.28 * float(np.mean([max(0.0, pair_pref(j, k)) for k in others]))\n            if score < best_score:\n                best_score = score\n                best_node = j\n\n        reverse_order.append(best_node)\n        remaining.remove(best_node)\n\n    order = reverse_order[::-1]\n\n    # Stage 2: order-constrained parent search with conditional falsification.\n    B = np.zeros((d, d), dtype=float)\n    base_thr = 0.008 + 0.12 / np.sqrt(max(n, 2))\n\n    for pos, target in enumerate(order):\n        context = tuple(order[:pos])\n        if len(context) == 0:\n            continue\n\n        local_rng = np.random.default_rng(1234 + target)\n        Z_real = X[:, context]\n        Z_fake = np.column_stack([local_rng.permutation(Z_real[:, i]) for i in range(Z_real.shape[1])])\n        ranker = ExtraTreesRegressor(\n            n_estimators=80 if n < 1000 else 64,\n            max_depth=5,\n            min_samples_leaf=max(2, n // 150),\n            random_state=0,\n            n_jobs=1,\n        )\n        ranker.fit(np.hstack([Z_real, Z_fake]), X[:, target])\n        imp = ranker.feature_importances_\n        comp = imp[: len(context)] - imp[len(context) :]\n\n        rank_scores = []\n        for i, cand in enumerate(context):\n            pair_bonus = max(0.0, pair_pref(cand, target))\n            uni = dep_score(X[:, [cand]], X[:, target])\n            rank_scores.append(comp[i] + 0.35 * pair_bonus + 0.15 * uni)\n        rank_scores = np.asarray(rank_scores)\n        cand_order = [context[i] for i in np.argsort(-rank_scores)]\n\n        screen_cap = min(len(context), max(4, int(np.ceil(2.0 * np.sqrt(len(context)) + 1))))\n        cand_order = cand_order[:screen_cap]\n        max_parents = min(screen_cap, max(2, int(np.ceil(1.2 * np.sqrt(len(context)) + 1))))\n\n        selected = tuple()\n        current = objective(target, selected, context)\n\n        while len(selected) < max_parents:\n            best_set = None\n            best_value = current\n            best_metric = base_thr\n            single_stats = {}\n\n            for cand in cand_order:\n                if cand in selected:\n                    continue\n                real, fake = cond_link_score(target, cand, selected)\n                trial = tuple(sorted(selected + (cand,)))\n                value = objective(target, trial, context)\n                gain = current - value\n                metric = (real - 1.10 * fake) + 0.90 * gain + 0.08 * max(0.0, pair_pref(cand, target))\n                single_stats[cand] = (metric, value)\n                if metric > best_metric:\n                    best_metric = metric\n                    best_set = trial\n                    best_value = value\n\n            top_left = [c for c in cand_order if c not in selected]\n            top_left = sorted(top_left, key=lambda c: single_stats.get(c, (-1e9, np.inf))[0], reverse=True)[:4]\n            for a, b in combinations(top_left, 2):\n                trial = tuple(sorted(selected + (a, b)))\n                value = objective(target, trial, context)\n                gain = current - value\n                ra, fa = cond_link_score(target, a, selected)\n                rb, fb = cond_link_score(target, b, selected)\n                metric = gain + 0.60 * ((ra - fa) + (rb - fb))\n                if metric > best_metric + 0.002:\n                    best_metric = metric\n                    best_set = trial\n                    best_value = value\n\n            if best_set is None:\n                break\n            selected, current = best_set, best_value\n\n        changed = True\n        while changed and len(selected) > 0:\n            changed = False\n            full_value = objective(target, selected, context)\n            for cand in list(selected):\n                others = tuple(x for x in selected if x != cand)\n                real, fake = cond_link_score(target, \n... (truncated)",
5  "after_line": 5,
6  "start_line": 3,
7  "end_line": 220
8}
Tool ResultStep 3
OK: Replaced lines 3..220 with 283 line(s) in causal-learn/bench/custom_algorithm.py. Editable range: 3–285.

[Current file: causal-learn/bench/custom_algorithm.py | editable: 3–285 | total: 285 lines]
     1| import numpy as np
     2| 
     3  # =====================================================================
     4  # EDITABLE: implement run_causal_discovery below
     5  # =====================================================================
     6  def run_causal_discovery(X: np.ndarray) -> np.ndarray:
     7      """
     8      Input:  X of shape (n_samples, n_variables)
     9      Output: adjacency matrix B of shape (n_variables, n_variables)
    10              B[i, j] != 0  means j -> i  (follows causal-learn convention)
    11      """
    12      import os
    13      from itertools import combinations
    14      from sklearn.ensemble import ExtraTreesRegressor, HistGradientBoostingRegressor
    15      from sklearn.feature_selection import mutual_info_regression
   
AssistantStep 4test
Tool ResultStep 4
[Test #2] (1 test remaining; call submit(n=2) to submit this result)

### ER8-MLP (scripts/eval_er8_mlp.sh)
Results (ER8-MLP):
  SHD=6  F1=0.4000  Precision=0.3750  Recall=0.4286

### ER12-GP (scripts/eval_er12_gp.sh)
Results (ER12-GP):
  SHD=16  F1=0.3871  Precision=0.5000  Recall=0.3158

### SF10-Mixed (scripts/eval_sf10_mixed.sh)
Results (SF10-Mixed):
  SHD=10  F1=0.6667  Precision=0.6471  Recall=0.6875

### ER15-Sigmoid (scripts/eval_er15_sigmoid.sh)
Results (ER15-Sigmoid):
  SHD=10  F1=0.7719  Precision=0.7097  Recall=0.8462

### ER20-Mixed (scripts/eval_er20_mixed.sh)
Results (ER20-Mixed):
  SHD=54  F1=0.3366  Precision=0.3953  Recall=0.2931

### SF20-GP (scripts/eval_sf20_gp.sh)
Results (SF20-GP):
  SHD=22  F1=0.6579  Precision=0.6250  Recall=0.6944

### ER12-LowSample (scripts/eval_er12_lowsample.sh)
Results (ER12-LowSample):
  SHD=11  F1=0.5517  Precision=0.8000  Recall=0.4211

[Leaderboard] Results saved: {'shd_ER8-MLP': 6, 'f1_ER8-MLP': 0.4, 'precision_ER8-MLP': 0.375, 'reca
AssistantStep 5submit
Tool Input
1{
2  "n": 1
3}
Tool ResultStep 5
[submit] Submitting result from test #1 as final.

### ER8-MLP (scripts/eval_er8_mlp.sh)
Results (ER8-MLP):
  SHD=6  F1=0.6316  Precision=0.5000  Recall=0.8571

### ER12-GP (scripts/eval_er12_gp.sh)
Results (ER12-GP):
  SHD=17  F1=0.6538  Precision=0.5152  Recall=0.8947

### SF10-Mixed (scripts/eval_sf10_mixed.sh)
Results (SF10-Mixed):
  SHD=13  F1=0.6341  Precision=0.5200  Recall=0.8125

### ER15-Sigmoid (scripts/eval_er15_sigmoid.sh)
Results (ER15-Sigmoid):
  SHD=29  F1=0.5867  Precision=0.4490  Recall=0.8462

### ER20-Mixed (scripts/eval_er20_mixed.sh)
Results (ER20-Mixed):
  SHD=58  F1=0.3860  Precision=0.3929  Recall=0.3793

### SF20-GP (scripts/eval_sf20_gp.sh)
Results (SF20-GP):
  SHD=37  F1=0.5567  Precision=0.4426  Recall=0.7500

### ER12-LowSample (scripts/eval_er12_lowsample.sh)
Results (ER12-LowSample):
  SHD=17  F1=0.5652  Precision=0.4815  Recall=0.6842

### ER20-Gauss (scripts/eval_er20_gauss.sh)
Results (ER20-Gauss):
  SHD=62  F1=0.3871  Precision=0.3636  Recall=0.4138