Skip to main content

Evidence Triage Rules

WorldFlux treats existing run folders as untrusted input. The run-folder importer is read-only: it inventories evidence, writes a private operator report, and emits audit_input.json only when the selected candidate is claim-safe.

Accepted For Claim Packaging

WorldFlux can emit audit_input.json when one selected candidate has explicit per-episode JSON or JSONL with:
  • boolean success per episode;
  • task, suite, or episode identity;
  • a metric contract that matches the claim/protocol;
  • denominator information that prevents silently dropping failed or missing episodes;
  • no secrets, private credentials, signed URLs, or raw customer-only payloads in public fields.

Report-Only Evidence

WorldFlux inventories these signals for review but does not treat them as comparable claim metrics by default:
  • W&B, MLflow, tracker summaries, and aggregate dashboards;
  • simulator, ROS/MCAP, Isaac, Cosmos, LeRobot/GR00T dataset signals;
  • custom metrics without an explicit metric contract;
  • model names or benchmark labels without episode-level outcomes;
  • logs, screenshots, videos, and narrative notes.

Rejected For audit_input.json

WorldFlux writes the private report and refuses to emit audit_input.json when the folder is ambiguous, partial, archive-only, CSV-only, numeric-score-only, aggregate-only, missing episode-level success outcomes, missing denominator policy, dominated by raw binaries, or likely to expose secrets.

VLA Benchmark Claims

For LIBERO, OpenPI, OpenVLA, GR00T, and related VLA benchmarks, evidence grade must be chosen before execution if WorldFlux is involved in the run. Imported results that lack pre-run model identity attestation, frozen episode manifests, attempt policy, or denominator policy must be labeled with the weaker Grade B/C/D wording from the VLA apple-to-apple definition.

Safe Customer Wording

Use:
  • “WorldFlux packaged imported evaluation output.”
  • “The package is signed and tamper-evident.”
  • “The claim is limited to the recorded protocol and evidence scope.”
Do not say “official benchmark score”, “deployment-safe”, “regulatory certified”, “fully Apple-to-Apple”, “live provider runtime supported”, or “tamper-proof” unless separately proven.