Eval bridges

Module path

src/worldflux/eval_bridges/
├── cosmos.py      # NVIDIA Cosmos-Predict / RoboCasa rollouts
├── libero.py      # LIBERO suites (spatial, object, goal, 10)
├── minecraft.py   # Minecraft offline RL
└── vjepa.py       # V-JEPA latent embedding eval

A bridge is a function (sometimes plus dataclass output types). Adapters call the bridge from their run.py; the manifest writer picks up the result.

Pattern

LIBERO is the cleanest example.

# src/worldflux/eval_bridges/libero.py
LIBERO_SUITES = ("libero_spatial", "libero_object", "libero_goal", "libero_10")

@dataclass(frozen=True)
class LIBEROEpisodeRecord:
    suite: str
    task_id: int
    seed: int
    success: bool
    episode_return: float
    steps: int
    duration_seconds: float

@dataclass(frozen=True)
class LIBEROEvalSummary:
    suite: str
    num_episodes: int
    success_rate: float
    mean_episode_return: float
    mean_steps_per_episode: float
    episodes: list[LIBEROEpisodeRecord]

def run_libero_eval(
    *,
    server_command: tuple[str, ...],
    server_cwd: Path,
    server_env: dict[str, str],
    server_port: int = 8000,
    suites: tuple[str, ...] = LIBERO_SUITES,
    num_episodes_per_task: int = 10,
    output_dir: Path | None = None,
) -> dict[str, LIBEROEvalSummary]:
    ...

The bridge starts the adapter’s WebSocket policy server, waits for it to be ready, runs the LIBERO benchmark client per suite, and returns one LIBEROEvalSummary per suite. If output_dir is set, it also writes a JSON artifact with the same content.

What ships

Bridge	Suite	Entry function	Output
`cosmos`	RoboCasa rollouts via Cosmos-Predict	`run_cosmos_rollout(...)`	`RolloutSummary` per scene
`libero`	LIBERO 4-suite benchmark	`run_libero_eval(...)`	`dict[str, LIBEROEvalSummary]`
`minecraft`	Minecraft offline RL replay	`run_minecraft_eval(...)`	`MinecraftReplaySummary`
`vjepa`	V-JEPA latent embedding probe	`run_vjepa_embed_eval(...)`	`VJEPAEmbedSummary`

Each bridge ships its own dataclass output type. The manifest writer flattens them into JSON; the dashboard reads manifest.metrics directly.

Wiring an adapter

A curated adapter’s run.py typically does three things:

Start the adapter’s policy server (or call its inference function directly).
Call the bridge with the right task argument.
Hand the bridge’s return value to the manifest writer.

# src/worldflux/templates/openpi/run.py (sketch)
from worldflux.eval_bridges.libero import run_libero_eval

results = run_libero_eval(
    server_command=("uv", "run", "python", "serve.py"),
    server_cwd=adapter_dir,
    server_env=os.environ.copy(),
    output_dir=output_dir,
)
write_manifest(results)

Adding a bridge

Write the function

Add src/worldflux/eval_bridges/<suite>.py. Keep it function-first; only add a class if state genuinely needs to live across episodes.

Define the output dataclass

A frozen @dataclass with the fields the dashboard will need to render. Avoid optionals where you can; mean_* should always be a float, even if it’s 0.0.

Write the JSON artifact

If the suite produces per-episode detail, write a JSON file under output_dir and reference it from the dataclass. The dashboard’s run detail panel previews JSON inline.

Snapshot test

tests/eval_bridges/test_<suite>.py plus a snapshot under tests/eval_bridges/snapshots/. Run the suite once on the smallest model the project supports and commit the manifest.

When you do not need a bridge

If the suite already returns JSON in roughly the shape the manifest expects, the adapter can write straight into manifest.metrics and skip the bridge layer. Bridges exist for suites whose native output (per-episode logs, per-task success matrices, vector embeddings) does not map onto the manifest 1:1.

Get started

Concepts

Reference

Module path

Pattern

What ships

Wiring an adapter

Adding a bridge

When you do not need a bridge

Get started

Concepts

Reference

Documentation Index

​Module path

​Pattern

​What ships

​Wiring an adapter

​Adding a bridge

​When you do not need a bridge

Module path

Pattern

What ships

Wiring an adapter

Adding a bridge

When you do not need a bridge