> ## Documentation Index
> Fetch the complete documentation index at: https://docs.worldflux.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Anti-cherrypick eval packets

> Local-private protocol plans, run sheets, eval packets, and reviewer briefs for robotics/VLA evidence review.

Anti-cherrypick eval packets are local-private artifacts for robotics and VLA startups that need to package credible evaluation evidence for internal review or a private reviewer handoff. They are evidence organization tools, not public certifications and not deployment acceptability decisions.

## Boundary

WorldFlux separates two objects:

* A **Protocol Plan** is pre-score. It freezes the robot/policy profile, candidate probe registry digest, selected probes, selected tasks, axes, episode indexes, seeds, denominator policy, and missing-evidence questions.
* An **Eval Packet** exists only after normalized `AuditInput` evidence is attached and checked against the frozen plan. Packet generation validates plan/evidence conformance, including matched cells, missing cells, out-of-protocol episodes, digest mismatches, and metadata mismatches.

Generated artifacts are `LOCAL_PRIVATE` by default. The MVP does not run benchmarks, upload data, host models, proxy credentials, or decide whether a robot can be deployed. The intended distribution is internal review/prep plus local-private private reviewer handoff only.

Private reviewer briefs are not public-share-ready. They keep redaction/consent warnings, unsupported claims, missing evidence, and reviewer questions visible. Public sharing, third-party publication, or endorsement-style use requires a separate path for review, redaction, customer consent, signing, and verification.

## Flow

```bash theme={null}
uv run worldflux eval-profile create --profile-id customer-vla-v3-tabletop --policy-id customer_vla_v3 --use-case "fixed-camera tabletop pick-and-place demo" --embodiment-class single_arm_tabletop --robot-model franka-panda --simulator-family libero --action-space end_effector_delta_pose --action-space gripper_action --action-control-mode delta_pose --coordinate-frame end_effector --degrees-of-freedom 6 --gripper-control binary --observation-space rgb_static_camera --observation-space language_instruction --camera-topology static --camera-config-id static-front-v1 --language-input --control-frequency-hz 20 --reset-policy episode_reset --metric-contract boolean_success_per_episode --real-to-sim-calibration calibration-v1 --environment-version libero-local --adapter-version worldflux-libero-bridge --training-exposure-possible-benchmark-family libero --training-exposure-caveat "operator did not provide a full training-data manifest" --claim-intent "constrained tabletop manipulation evidence" --reviewer-next-action "review conformance, missing cells, and unsupported claims" --output profile.json
uv run worldflux eval-portfolio plan --profile profile.json --seed 1234 --seed-source operator_supplied --custom-eval-contract contract.json --output plan.json
uv run worldflux eval-portfolio render-plan --plan plan.json --output plan.md
uv run worldflux eval-portfolio render-run-sheet --plan plan.json --output run_sheet.md --run-sheet-json run_sheet.json
uv run worldflux eval-portfolio packet --plan plan.json --audit-input libero-pro=audit_libero.json --audit-input libero-plus=audit_libero_plus.json --output packet.md --packet-json packet.json
uv run worldflux eval-portfolio reviewer-brief --packet packet.json --packet-markdown-ref packet.md --output reviewer_brief.md --brief-json reviewer_brief.json
```

The run sheet is the handoff for external eval execution. The packet step uses already normalized `AuditInput` files, including outputs from `worldflux audit import run-folder`. Add `--strict-missing-evidence` to fail packet generation when missing or underpowered evidence should not remain a reviewer-visible warning.

## Custom eval contracts

Use `--custom-eval-contract` when the selected probe is customer-owned rather than a public benchmark family. The contract is a private review input and must include:

* a task manifest reference and digest
* a metric schema reference
* inclusion and exclusion rules
* replay or audit metadata keys
* a reviewer-owned task source
* a customer consent marker
* the customer use case, acceptance question, workflow claim, and task owner

WorldFlux records the declared consent marker but does not verify legal consent validity. Custom contract values are pinned into selected protocol cells and checked again when evidence is attached.

## What reviewers should read

Reviewers should treat the packet as a structured acceptance discussion, not a binary pass/fail verdict. The packet shows observed score signals next to unsupported claims, training-exposure caveats, missing protocol cells, out-of-protocol evidence, infrastructure failures, and next falsification axes.
