> ## Documentation Index
> Fetch the complete documentation index at: https://docs.worldflux.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Eval bridges

> An eval bridge maps a third-party benchmark or uploaded result bundle into WorldFlux audit input or manifest-friendly artifacts.

## Module path

```
src/worldflux/eval_bridges/
├── annex_iv_derived.py       # Annex IV derived raw result import
├── cosmos.py                 # Cosmos generation bridge
├── cosmos_predict.py         # Cosmos-Predict raw result import
├── external_demo_harness.py  # external demo harness wrapper
├── gr00t_isaac_lab.py        # GR00T N1.7 demo and raw result import
├── libero.py                 # LIBERO suites and PRO/PLUS perturbation resolver
├── libero_plus.py            # LIBERO-Plus difficulty config helper
├── libero_pro.py             # LIBERO-PRO perturbation demo/import helper
├── minecraft.py              # Minecraft offline RL replay
├── uploaded_eval.py          # uploaded eval raw result import
└── vjepa.py                  # V-JEPA latent embedding eval
```

A bridge is a function (sometimes plus dataclass output types). Adapters call the bridge from their `run.py`; the manifest writer picks up the result.

## Pattern

LIBERO is the cleanest example.

```python theme={null}
# src/worldflux/eval_bridges/libero.py
LIBERO_SUITES = ("libero_spatial", "libero_object", "libero_goal", "libero_10")

@dataclass(frozen=True)
class LIBEROEpisodeRecord:
    suite: str
    task_id: int
    seed: int
    success: bool
    episode_return: float
    steps: int
    duration_seconds: float

@dataclass(frozen=True)
class LIBEROEvalSummary:
    suite: str
    num_episodes: int
    success_rate: float
    mean_episode_return: float
    mean_steps_per_episode: float
    episodes: list[LIBEROEpisodeRecord]

def run_libero_eval(
    *,
    server_command: tuple[str, ...],
    server_cwd: Path,
    server_env: dict[str, str],
    server_port: int = 8000,
    suites: tuple[str, ...] = LIBERO_SUITES,
    num_episodes_per_task: int = 10,
    output_dir: Path | None = None,
) -> dict[str, LIBEROEvalSummary]:
    ...
```

The bridge starts the adapter's WebSocket policy server, waits for it to be ready, runs the LIBERO benchmark client per suite, and returns one `LIBEROEvalSummary` per suite. If `output_dir` is set, it also writes a JSON artifact with the same content.

## What ships

| Bridge                  | Suite or source                                           | Entry function                                                        | Output                                                  |
| ----------------------- | --------------------------------------------------------- | --------------------------------------------------------------------- | ------------------------------------------------------- |
| `annex_iv_derived`      | Annex IV derived public claims                            | `import_annex_iv_derived_raw_results(...)`                            | `audit_input_annex_iv_derived.json`                     |
| `cosmos`                | Cosmos generation                                         | `run_cosmos_generation(...)`                                          | `CosmosGenerationRecord` list plus generation artifacts |
| `cosmos_predict`        | Cosmos-Predict raw results                                | `run_cosmos_predict(...)`, `import_cosmos_predict_raw_results(...)`   | normalized audit input                                  |
| `external_demo_harness` | external demo harness                                     | `run_external_demo_harness(...)`                                      | harness result payload                                  |
| `gr00t_isaac_lab`       | GR00T N1.7 Isaac Lab / official GR00T LIBERO sim evidence | `run_gr00t_isaac_lab(...)`, `import_groot_isaac_lab_raw_results(...)` | `audit_input_groot.json`                                |
| `libero`                | LIBERO 4-suite benchmark                                  | `run_libero_eval(...)`                                                | `dict[str, LIBEROEvalSummary]`                          |
| `libero_plus`           | LIBERO-Plus perturbation config                           | `build_libero_plus_difficulty_config(...)`                            | config payload                                          |
| `libero_pro`            | LIBERO-PRO perturbation demo/import                       | `run_libero_pro(...)`, `import_openvla_libero_raw_results(...)`       | normalized audit input                                  |
| `minecraft`             | Minecraft offline RL replay                               | `run_minecraft_eval(...)`                                             | replay summary                                          |
| `uploaded_eval`         | uploaded eval evidence                                    | `import_uploaded_eval(...)`                                           | normalized audit input                                  |
| `vjepa`                 | V-JEPA latent embedding probe                             | `run_vjepa_embed(...)`                                                | `VJEPAEmbedRecord` list plus embed artifacts            |

Some bridges run a local adapter and return dataclasses; others import raw
external results into the audit-input schema. The manifest writer or audit flow
then owns the final JSON surface. Import bridges do not upload artifacts or
decide whether a claim is publishable.

## Wiring an adapter

A curated adapter's `run.py` typically does three things:

1. Start the adapter's policy server (or call its inference function directly).
2. Call the bridge with the right `task` argument.
3. Hand the bridge's return value to the manifest writer.

```python theme={null}
# src/worldflux/templates/openpi/run.py (sketch)
from worldflux.eval_bridges.libero import run_libero_eval

results = run_libero_eval(
    server_command=("uv", "run", "python", "serve.py"),
    server_cwd=adapter_dir,
    server_env=os.environ.copy(),
    output_dir=output_dir,
)
write_manifest(results)
```

## Adding a bridge

<Steps>
  <Step title="Write the function">
    Add `src/worldflux/eval_bridges/<suite>.py`. Keep it function-first; only add a class if state genuinely needs to live across episodes.
  </Step>

  <Step title="Define the output dataclass">
    A frozen `@dataclass` with the fields the dashboard will need to render. Avoid optionals where you can; `mean_*` should always be a float, even if it's 0.0.
  </Step>

  <Step title="Write the JSON artifact">
    If the suite produces per-episode detail, write a JSON file under `output_dir` and reference it from the dataclass. The dashboard's run detail panel previews JSON inline.
  </Step>

  <Step title="Snapshot test">
    `tests/eval_bridges/test_<suite>.py` plus a snapshot or fixture for the smallest deterministic path. Use dry-run or raw-import fixtures for GPU-only suites; do not label smoke data as Grade A benchmark evidence.
  </Step>
</Steps>

## When you do not need a bridge

If the suite already returns JSON in roughly the shape the manifest expects, the adapter can write straight into `manifest.metrics` and skip the bridge layer. Bridges exist for suites whose native output (per-episode logs, per-task success matrices, vector embeddings) does not map onto the manifest 1:1.
