What to evaluate
A good evaluation set covers more than final prose output. Depending on the company, test:- import preview quality
- company graph resolution
- skill attachment behavior
- task execution quality
- output artifacts
- token and time cost
Start with realistic test cases
Each eval case should include:- a realistic user or operator prompt
- the company or repo path being evaluated
- expected outputs or behaviors
- optional input files
- import an Agent Company into a new environment and inspect the preview tree
- attach engineering skills to an agent and compare desired vs actual state
- execute a recurring planning task with and without the company
Compare against a baseline
Run each case at least two ways:- with the current company
- without the company or with the previous version
Write objective assertions first
Prefer checks like:- expected manifests were discovered
- skill shortnames resolved correctly
- import preview shows the intended create or update actions
- a report file or artifact exists
- the output includes required sections
Track cost and drift
Collect per-run data such as:- pass rate
- failure category
- duration
- total tokens
- whether the adapter or runtime state matched the company intent
Use failures to refine the company
Read failures at three levels:- company design: wrong boundary between company, team, agent, and skill
- instructions: unclear role behavior or missing defaults
- tooling: weak import preview, weak pinning, weak sync visibility
The loop
- run the eval set with and without the company
- grade objective assertions
- review outputs and execution traces
- tighten manifests, descriptions, or bundled resources
- rerun and compare the delta