Skip to main content

Module path

src/worldflux/eval_bridges/
├── annex_iv_derived.py       # Annex IV derived raw result import
├── cosmos.py                 # Cosmos generation bridge
├── cosmos_predict.py         # Cosmos-Predict raw result import
├── external_demo_harness.py  # external demo harness wrapper
├── gr00t_isaac_lab.py        # GR00T N1.7 demo and raw result import
├── libero.py                 # LIBERO suites and PRO/PLUS perturbation resolver
├── libero_plus.py            # LIBERO-Plus difficulty config helper
├── libero_pro.py             # LIBERO-PRO perturbation demo/import helper
├── minecraft.py              # Minecraft offline RL replay
├── uploaded_eval.py          # uploaded eval raw result import
└── vjepa.py                  # V-JEPA latent embedding eval
A bridge is a function (sometimes plus dataclass output types). Adapters call the bridge from their run.py; the manifest writer picks up the result.

Pattern

LIBERO is the cleanest example.
# src/worldflux/eval_bridges/libero.py
LIBERO_SUITES = ("libero_spatial", "libero_object", "libero_goal", "libero_10")

@dataclass(frozen=True)
class LIBEROEpisodeRecord:
    suite: str
    task_id: int
    seed: int
    success: bool
    episode_return: float
    steps: int
    duration_seconds: float

@dataclass(frozen=True)
class LIBEROEvalSummary:
    suite: str
    num_episodes: int
    success_rate: float
    mean_episode_return: float
    mean_steps_per_episode: float
    episodes: list[LIBEROEpisodeRecord]

def run_libero_eval(
    *,
    server_command: tuple[str, ...],
    server_cwd: Path,
    server_env: dict[str, str],
    server_port: int = 8000,
    suites: tuple[str, ...] = LIBERO_SUITES,
    num_episodes_per_task: int = 10,
    output_dir: Path | None = None,
) -> dict[str, LIBEROEvalSummary]:
    ...
The bridge starts the adapter’s WebSocket policy server, waits for it to be ready, runs the LIBERO benchmark client per suite, and returns one LIBEROEvalSummary per suite. If output_dir is set, it also writes a JSON artifact with the same content.

What ships

BridgeSuite or sourceEntry functionOutput
annex_iv_derivedAnnex IV derived public claimsimport_annex_iv_derived_raw_results(...)audit_input_annex_iv_derived.json
cosmosCosmos generationrun_cosmos_generation(...)CosmosGenerationRecord list plus generation artifacts
cosmos_predictCosmos-Predict raw resultsrun_cosmos_predict(...), import_cosmos_predict_raw_results(...)normalized audit input
external_demo_harnessexternal demo harnessrun_external_demo_harness(...)harness result payload
gr00t_isaac_labGR00T N1.7 Isaac Lab / official GR00T LIBERO sim evidencerun_gr00t_isaac_lab(...), import_groot_isaac_lab_raw_results(...)audit_input_groot.json
liberoLIBERO 4-suite benchmarkrun_libero_eval(...)dict[str, LIBEROEvalSummary]
libero_plusLIBERO-Plus perturbation configbuild_libero_plus_difficulty_config(...)config payload
libero_proLIBERO-PRO perturbation demo/importrun_libero_pro(...), import_openvla_libero_raw_results(...)normalized audit input
minecraftMinecraft offline RL replayrun_minecraft_eval(...)replay summary
uploaded_evaluploaded eval evidenceimport_uploaded_eval(...)normalized audit input
vjepaV-JEPA latent embedding proberun_vjepa_embed(...)VJEPAEmbedRecord list plus embed artifacts
Some bridges run a local adapter and return dataclasses; others import raw external results into the audit-input schema. The manifest writer or audit flow then owns the final JSON surface. Import bridges do not upload artifacts or decide whether a claim is publishable.

Wiring an adapter

A curated adapter’s run.py typically does three things:
  1. Start the adapter’s policy server (or call its inference function directly).
  2. Call the bridge with the right task argument.
  3. Hand the bridge’s return value to the manifest writer.
# src/worldflux/templates/openpi/run.py (sketch)
from worldflux.eval_bridges.libero import run_libero_eval

results = run_libero_eval(
    server_command=("uv", "run", "python", "serve.py"),
    server_cwd=adapter_dir,
    server_env=os.environ.copy(),
    output_dir=output_dir,
)
write_manifest(results)

Adding a bridge

1

Write the function

Add src/worldflux/eval_bridges/<suite>.py. Keep it function-first; only add a class if state genuinely needs to live across episodes.
2

Define the output dataclass

A frozen @dataclass with the fields the dashboard will need to render. Avoid optionals where you can; mean_* should always be a float, even if it’s 0.0.
3

Write the JSON artifact

If the suite produces per-episode detail, write a JSON file under output_dir and reference it from the dataclass. The dashboard’s run detail panel previews JSON inline.
4

Snapshot test

tests/eval_bridges/test_<suite>.py plus a snapshot or fixture for the smallest deterministic path. Use dry-run or raw-import fixtures for GPU-only suites; do not label smoke data as Grade A benchmark evidence.

When you do not need a bridge

If the suite already returns JSON in roughly the shape the manifest expects, the adapter can write straight into manifest.metrics and skip the bridge layer. Bridges exist for suites whose native output (per-episode logs, per-task success matrices, vector embeddings) does not map onto the manifest 1:1.