> ## Documentation Index
> Fetch the complete documentation index at: https://agentcompanies.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluating Agent Companies

> How to test whether an Agent Company improves outcomes using eval-driven iteration.

An Agent Company is only useful if it improves outcomes in practice.
The right eval loop tests both company structure and actual task execution.

## What to evaluate

A good evaluation set covers more than final prose output.
Depending on the company, test:

* import preview quality
* company graph resolution
* skill attachment behavior
* task execution quality
* output artifacts
* token and time cost

## Start with realistic test cases

Each eval case should include:

* a realistic user or operator prompt
* the company or repo path being evaluated
* expected outputs or behaviors
* optional input files

Examples:

* import an Agent Company into a new environment and inspect the preview tree
* attach engineering skills to an agent and compare desired vs actual state
* execute a recurring planning task with and without the company

## Compare against a baseline

Run each case at least two ways:

* with the current company
* without the company or with the previous version

This tells you whether the company is adding value rather than just consuming more context.

## Write objective assertions first

Prefer checks like:

* expected manifests were discovered
* skill shortnames resolved correctly
* import preview shows the intended create or update actions
* a report file or artifact exists
* the output includes required sections

Add human review after that for broader questions like usefulness, clarity, or whether the output reflects the intended company behavior.

## Track cost and drift

Collect per-run data such as:

* pass rate
* failure category
* duration
* total tokens
* whether the adapter or runtime state matched the company intent

That last point matters because desired state in the manifests may diverge from actual runtime state.

## Use failures to refine the company

Read failures at three levels:

* company design: wrong boundary between company, team, agent, and skill
* instructions: unclear role behavior or missing defaults
* tooling: weak import preview, weak pinning, weak sync visibility

If the same logic is being reinvented in every run, that is usually a sign to improve the company structure, instructions, or bundled references.

## The loop

1. run the eval set with and without the company
2. grade objective assertions
3. review outputs and execution traces
4. tighten manifests, descriptions, or bundled resources
5. rerun and compare the delta

Stop when the Agent Company improves outcomes consistently and the extra context cost is justified.
