Quality & Evals¶

Overview¶

Quality metrics help package consumers evaluate packages before installation. AAM supports two types of quality indicators:

Tests - Standard automated tests (unit, integration, lint)
Evals - AI-specific evaluations measuring accuracy, latency, quality

Both are declared in aam.yaml and can be run with aam test and aam eval.

Why Quality Metrics Matter¶

For consumers:

Choose high-quality packages with confidence
Compare packages based on objective metrics
Understand performance characteristics

For authors:

Demonstrate package quality
Build trust with users
Track quality over time
Catch regressions

Tests¶

Tests are standard automated checks using existing tools (pytest, ruff, mypy, etc.).

Declaring Tests¶

# aam.yaml
quality:
  tests:
    - name: "unit-tests"
      command: "pytest tests/ -v"
      description: "Unit tests for skills and agents"

    - name: "lint-check"
      command: "ruff check ."
      description: "Code quality and style checks"

    - name: "type-check"
      command: "mypy src/"
      description: "Static type checking"

Running Tests¶

# Run all declared tests
aam test

# Run specific test
aam test unit-tests

# Run in CI/CD
aam test --ci  # Fails if any test fails

Output:

Running tests for @author/my-agent@1.0.0...

unit-tests: pytest tests/ -v
✓ Passed (2.3s)

lint-check: ruff check .
✓ Passed (0.8s)

type-check: mypy src/
✓ Passed (1.5s)

All tests passed (3/3)

Test Requirements¶

command must be executable from package root
Exit code 0 = pass, non-zero = fail
Output is captured and displayed

Evals¶

Evals are AI-specific evaluations that measure quality, accuracy, latency, and other metrics.

Declaring Evals¶

# aam.yaml
quality:
  evals:
    - name: "accuracy-eval"
      path: "evals/accuracy.yaml"
      description: "Accuracy on benchmark tasks"
      metrics:
        - name: "accuracy"
          type: "percentage"
          target: 90.0  # Target value

        - name: "latency_p95"
          type: "duration_ms"
          target: 500

Eval Definition Files¶

Create eval definitions in evals/ directory:

# evals/accuracy.yaml
name: accuracy-eval
description: "Evaluate agent accuracy on benchmark dataset"

dataset: evals/benchmark-dataset.jsonl
model: gpt-4  # Model to evaluate
judge: evals/judge-prompt.md  # LLM-as-judge prompt

tasks:
  - name: "code-review"
    weight: 0.5
    inputs:
      - type: "code"
        source: "dataset.code"
    expected_output:
      - type: "review"
        source: "dataset.expected_review"

metrics:
  - name: "accuracy"
    aggregation: "mean"
    description: "Percentage of correct responses"

  - name: "latency_p95"
    aggregation: "percentile_95"
    description: "95th percentile latency"

Benchmark Dataset¶

{"code": "def add(a, b): return a + b", "expected_review": "Looks good"}
{"code": "def divide(a, b): return a / b", "expected_review": "Missing zero check"}

LLM-as-Judge Prompt¶

<!-- evals/judge-prompt.md -->
Compare the agent's review with the expected review.

Agent review: {agent_output}
Expected review: {expected_output}

Score 1 if the reviews match in substance, 0 otherwise.

Running Evals¶

# Run all declared evals
aam eval

# Run specific eval
aam eval accuracy-eval

# Run and publish results to registry
aam eval --publish

Output:

Running evals for @author/my-agent@1.0.0...

accuracy-eval: Accuracy on benchmark tasks
  → Running 50 tasks...
  ✓ accuracy: 94.2% (target: 90.0%)
  ✓ latency_p95: 245ms (target: 500ms)

Eval passed (2/2 metrics met targets)

Publishing Results¶

Publish eval results to the registry for consumers to see:

# Publish results
aam eval --publish

Results are attached to the published version:

aam info @author/my-agent

# Output includes:
Quality Metrics:
  accuracy-eval:
    accuracy: 94.2% (✓ target: 90.0%)
    latency_p95: 245ms (✓ target: 500ms)

  Tests: 3/3 passed
  Last evaluated: 2026-02-09T14:30:00Z

Metric Types¶

Type	Description	Example	Target Format
`percentage`	0-100 scale	`94.2%`	`90.0`
`score`	Arbitrary number	`8.5/10`	`8.0`
`duration_ms`	Milliseconds	`245ms`	`500`
`duration_s`	Seconds	`1.2s`	`2.0`
`boolean`	Pass/fail	`true`	`true`
`count`	Integer count	`42`	`50`

Quality Badges¶

Display quality badges in your package:

# README.md

![Tests](https://registry.aam.dev/badge/@author/my-agent/tests)
![Accuracy](https://registry.aam.dev/badge/@author/my-agent/accuracy)

Badges are generated by the HTTP registry based on latest eval results.

CI/CD Integration¶

Run Tests in CI¶

# .github/workflows/test.yml
name: Test
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install AAM
        run: pip install aam

      - name: Run tests
        run: aam test --ci

Run Evals in CI¶

# .github/workflows/eval.yml
name: Eval
on:
  push:
    branches: [main]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install AAM
        run: pip install aam

      - name: Run evals
        run: aam eval

      - name: Publish results
        if: success()
        run: aam eval --publish
        env:
          AAM_TOKEN: ${{ secrets.AAM_TOKEN }}

Publish with Quality Gate¶

# .github/workflows/publish.yml
name: Publish
on:
  release:
    types: [created]

jobs:
  publish:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install AAM
        run: pip install aam

      - name: Run tests
        run: aam test --ci

      - name: Run evals
        run: aam eval --ci  # Fail if targets not met

      - name: Publish package
        if: success()
        run: aam pkg publish --sign
        env:
          AAM_TOKEN: ${{ secrets.AAM_TOKEN }}

Best Practices¶

For Tests¶

Include multiple test types:
Unit tests for code correctness
Lint for code quality
Type checking for safety
Make tests fast:
Tests run on every aam test
Keep under 30 seconds total
Fail clearly:
Exit non-zero on failure
Provide useful error messages

For Evals¶

Use representative datasets:
Cover common use cases
Include edge cases
Keep dataset version-controlled
Set realistic targets:
Based on actual performance
Don't make targets too easy or hard
Update targets as you improve
Publish regularly:
Publish results for each version
Track quality over time
Build trust with consumers

For Consumers¶

Check quality before installing:

aam info @author/my-agent
# Look for quality metrics

Compare packages:

aam search "code-reviewer" --sort relevance

Verify recent evaluation:
Check "Last evaluated" timestamp
Outdated evals may not reflect current quality

Examples¶

Simple Testing Setup¶

# aam.yaml
quality:
  tests:
    - name: "tests"
      command: "pytest"
      description: "All tests"

Comprehensive Testing¶

# aam.yaml
quality:
  tests:
    - name: "unit-tests"
      command: "pytest tests/unit/ -v"
      description: "Unit tests"

    - name: "integration-tests"
      command: "pytest tests/integration/ -v"
      description: "Integration tests"

    - name: "lint"
      command: "ruff check ."
      description: "Linting"

    - name: "format"
      command: "black --check ."
      description: "Code formatting"

    - name: "types"
      command: "mypy src/"
      description: "Type checking"

    - name: "security"
      command: "bandit -r src/"
      description: "Security checks"

Agent Evaluation¶

# aam.yaml
quality:
  evals:
    - name: "accuracy"
      path: "evals/accuracy.yaml"
      description: "Accuracy on code review tasks"
      metrics:
        - name: "accuracy"
          type: "percentage"
          target: 90.0
        - name: "false_positive_rate"
          type: "percentage"
          target: 5.0

Troubleshooting¶

Tests Failing¶

Symptom: aam test returns non-zero exit code.

Solution:

# Run tests individually
aam test unit-tests
aam test lint-check

# Check test command
cat aam.yaml  # Verify test commands

# Run command manually
pytest tests/ -v

Evals Not Running¶

Symptom: aam eval says "No evals defined".

Solution:

Check aam.yaml:

quality:
  evals:
    - name: "my-eval"
      path: "evals/my-eval.yaml"  # File must exist
      # ...

Results Not Publishing¶

Symptom: aam eval --publish fails.

Causes:

Not logged in
Not package owner
Network error

Solutions:

# Login
aam login

# Verify ownership
aam info @author/my-agent

# Check network
aam registry list

Metric Target Not Met¶

Symptom: Eval fails because metric doesn't meet target.

Solution:

Either improve the package or adjust targets:

# Lower target temporarily
metrics:
  - name: "accuracy"
    type: "percentage"
    target: 85.0  # Was 90.0

Next Steps¶

HTTP Registry - Publish eval results
Package Signing - Sign quality-assured packages
Dist-Tags - Tag high-quality versions
Configuration Reference - Quality section details