Module path
run.py; the manifest writer picks up the result.
Pattern
LIBERO is the cleanest example.LIBEROEvalSummary per suite. If output_dir is set, it also writes a JSON artifact with the same content.
What ships
| Bridge | Suite or source | Entry function | Output |
|---|---|---|---|
annex_iv_derived | Annex IV derived public claims | import_annex_iv_derived_raw_results(...) | audit_input_annex_iv_derived.json |
cosmos | Cosmos generation | run_cosmos_generation(...) | CosmosGenerationRecord list plus generation artifacts |
cosmos_predict | Cosmos-Predict raw results | run_cosmos_predict(...), import_cosmos_predict_raw_results(...) | normalized audit input |
external_demo_harness | external demo harness | run_external_demo_harness(...) | harness result payload |
gr00t_isaac_lab | GR00T N1.7 Isaac Lab / official GR00T LIBERO sim evidence | run_gr00t_isaac_lab(...), import_groot_isaac_lab_raw_results(...) | audit_input_groot.json |
libero | LIBERO 4-suite benchmark | run_libero_eval(...) | dict[str, LIBEROEvalSummary] |
libero_plus | LIBERO-Plus perturbation config | build_libero_plus_difficulty_config(...) | config payload |
libero_pro | LIBERO-PRO perturbation demo/import | run_libero_pro(...), import_openvla_libero_raw_results(...) | normalized audit input |
minecraft | Minecraft offline RL replay | run_minecraft_eval(...) | replay summary |
uploaded_eval | uploaded eval evidence | import_uploaded_eval(...) | normalized audit input |
vjepa | V-JEPA latent embedding probe | run_vjepa_embed(...) | VJEPAEmbedRecord list plus embed artifacts |
Wiring an adapter
A curated adapter’srun.py typically does three things:
- Start the adapter’s policy server (or call its inference function directly).
- Call the bridge with the right
taskargument. - Hand the bridge’s return value to the manifest writer.
Adding a bridge
Write the function
Add
src/worldflux/eval_bridges/<suite>.py. Keep it function-first; only add a class if state genuinely needs to live across episodes.Define the output dataclass
A frozen
@dataclass with the fields the dashboard will need to render. Avoid optionals where you can; mean_* should always be a float, even if it’s 0.0.Write the JSON artifact
If the suite produces per-episode detail, write a JSON file under
output_dir and reference it from the dataclass. The dashboard’s run detail panel previews JSON inline.When you do not need a bridge
If the suite already returns JSON in roughly the shape the manifest expects, the adapter can write straight intomanifest.metrics and skip the bridge layer. Bridges exist for suites whose native output (per-episode logs, per-task success matrices, vector embeddings) does not map onto the manifest 1:1.