Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.worldflux.ai/llms.txt

Use this file to discover all available pages before exploring further.

Module path

src/worldflux/eval_bridges/
├── cosmos.py      # NVIDIA Cosmos-Predict / RoboCasa rollouts
├── libero.py      # LIBERO suites (spatial, object, goal, 10)
├── minecraft.py   # Minecraft offline RL
└── vjepa.py       # V-JEPA latent embedding eval
A bridge is a function (sometimes plus dataclass output types). Adapters call the bridge from their run.py; the manifest writer picks up the result.

Pattern

LIBERO is the cleanest example.
# src/worldflux/eval_bridges/libero.py
LIBERO_SUITES = ("libero_spatial", "libero_object", "libero_goal", "libero_10")

@dataclass(frozen=True)
class LIBEROEpisodeRecord:
    suite: str
    task_id: int
    seed: int
    success: bool
    episode_return: float
    steps: int
    duration_seconds: float

@dataclass(frozen=True)
class LIBEROEvalSummary:
    suite: str
    num_episodes: int
    success_rate: float
    mean_episode_return: float
    mean_steps_per_episode: float
    episodes: list[LIBEROEpisodeRecord]

def run_libero_eval(
    *,
    server_command: tuple[str, ...],
    server_cwd: Path,
    server_env: dict[str, str],
    server_port: int = 8000,
    suites: tuple[str, ...] = LIBERO_SUITES,
    num_episodes_per_task: int = 10,
    output_dir: Path | None = None,
) -> dict[str, LIBEROEvalSummary]:
    ...
The bridge starts the adapter’s WebSocket policy server, waits for it to be ready, runs the LIBERO benchmark client per suite, and returns one LIBEROEvalSummary per suite. If output_dir is set, it also writes a JSON artifact with the same content.

What ships

BridgeSuiteEntry functionOutput
cosmosRoboCasa rollouts via Cosmos-Predictrun_cosmos_rollout(...)RolloutSummary per scene
liberoLIBERO 4-suite benchmarkrun_libero_eval(...)dict[str, LIBEROEvalSummary]
minecraftMinecraft offline RL replayrun_minecraft_eval(...)MinecraftReplaySummary
vjepaV-JEPA latent embedding proberun_vjepa_embed_eval(...)VJEPAEmbedSummary
Each bridge ships its own dataclass output type. The manifest writer flattens them into JSON; the dashboard reads manifest.metrics directly.

Wiring an adapter

A curated adapter’s run.py typically does three things:
  1. Start the adapter’s policy server (or call its inference function directly).
  2. Call the bridge with the right task argument.
  3. Hand the bridge’s return value to the manifest writer.
# src/worldflux/templates/openpi/run.py (sketch)
from worldflux.eval_bridges.libero import run_libero_eval

results = run_libero_eval(
    server_command=("uv", "run", "python", "serve.py"),
    server_cwd=adapter_dir,
    server_env=os.environ.copy(),
    output_dir=output_dir,
)
write_manifest(results)

Adding a bridge

1

Write the function

Add src/worldflux/eval_bridges/<suite>.py. Keep it function-first; only add a class if state genuinely needs to live across episodes.
2

Define the output dataclass

A frozen @dataclass with the fields the dashboard will need to render. Avoid optionals where you can; mean_* should always be a float, even if it’s 0.0.
3

Write the JSON artifact

If the suite produces per-episode detail, write a JSON file under output_dir and reference it from the dataclass. The dashboard’s run detail panel previews JSON inline.
4

Snapshot test

tests/eval_bridges/test_<suite>.py plus a snapshot under tests/eval_bridges/snapshots/. Run the suite once on the smallest model the project supports and commit the manifest.

When you do not need a bridge

If the suite already returns JSON in roughly the shape the manifest expects, the adapter can write straight into manifest.metrics and skip the bridge layer. Bridges exist for suites whose native output (per-episode logs, per-task success matrices, vector embeddings) does not map onto the manifest 1:1.