Skip to main content
Anti-cherrypick eval packets are local-private artifacts for robotics and VLA startups that need to package credible evaluation evidence for internal review or a private reviewer handoff. They are evidence organization tools, not public certifications and not deployment acceptability decisions.

Boundary

WorldFlux separates two objects:
  • A Protocol Plan is pre-score. It freezes the robot/policy profile, candidate probe registry digest, selected probes, selected tasks, axes, episode indexes, seeds, denominator policy, and missing-evidence questions.
  • An Eval Packet exists only after normalized AuditInput evidence is attached and checked against the frozen plan. Packet generation validates plan/evidence conformance, including matched cells, missing cells, out-of-protocol episodes, digest mismatches, and metadata mismatches.
Generated artifacts are LOCAL_PRIVATE by default. The MVP does not run benchmarks, upload data, host models, proxy credentials, or decide whether a robot can be deployed. The intended distribution is internal review/prep plus local-private private reviewer handoff only. Private reviewer briefs are not public-share-ready. They keep redaction/consent warnings, unsupported claims, missing evidence, and reviewer questions visible. Public sharing, third-party publication, or endorsement-style use requires a separate path for review, redaction, customer consent, signing, and verification.

Flow

uv run worldflux eval-profile create --profile-id customer-vla-v3-tabletop --policy-id customer_vla_v3 --use-case "fixed-camera tabletop pick-and-place demo" --embodiment-class single_arm_tabletop --robot-model franka-panda --simulator-family libero --action-space end_effector_delta_pose --action-space gripper_action --action-control-mode delta_pose --coordinate-frame end_effector --degrees-of-freedom 6 --gripper-control binary --observation-space rgb_static_camera --observation-space language_instruction --camera-topology static --camera-config-id static-front-v1 --language-input --control-frequency-hz 20 --reset-policy episode_reset --metric-contract boolean_success_per_episode --real-to-sim-calibration calibration-v1 --environment-version libero-local --adapter-version worldflux-libero-bridge --training-exposure-possible-benchmark-family libero --training-exposure-caveat "operator did not provide a full training-data manifest" --claim-intent "constrained tabletop manipulation evidence" --reviewer-next-action "review conformance, missing cells, and unsupported claims" --output profile.json
uv run worldflux eval-portfolio plan --profile profile.json --seed 1234 --seed-source operator_supplied --custom-eval-contract contract.json --output plan.json
uv run worldflux eval-portfolio render-plan --plan plan.json --output plan.md
uv run worldflux eval-portfolio render-run-sheet --plan plan.json --output run_sheet.md --run-sheet-json run_sheet.json
uv run worldflux eval-portfolio packet --plan plan.json --audit-input libero-pro=audit_libero.json --audit-input libero-plus=audit_libero_plus.json --output packet.md --packet-json packet.json
uv run worldflux eval-portfolio reviewer-brief --packet packet.json --packet-markdown-ref packet.md --output reviewer_brief.md --brief-json reviewer_brief.json
The run sheet is the handoff for external eval execution. The packet step uses already normalized AuditInput files, including outputs from worldflux audit import run-folder. Add --strict-missing-evidence to fail packet generation when missing or underpowered evidence should not remain a reviewer-visible warning.

Custom eval contracts

Use --custom-eval-contract when the selected probe is customer-owned rather than a public benchmark family. The contract is a private review input and must include:
  • a task manifest reference and digest
  • a metric schema reference
  • inclusion and exclusion rules
  • replay or audit metadata keys
  • a reviewer-owned task source
  • a customer consent marker
  • the customer use case, acceptance question, workflow claim, and task owner
WorldFlux records the declared consent marker but does not verify legal consent validity. Custom contract values are pinned into selected protocol cells and checked again when evidence is attached.

What reviewers should read

Reviewers should treat the packet as a structured acceptance discussion, not a binary pass/fail verdict. The packet shows observed score signals next to unsupported claims, training-exposure caveats, missing protocol cells, out-of-protocol evidence, infrastructure failures, and next falsification axes.