mirror of https://github.com/harivansh-afk/sandbox-agent.git synced 2026-04-15 18:01:30 +00:00

Nathan Flurry b49776145b fix: add docker-setup action, runtime Dockerfile, and align release workflow

- Add .github/actions/docker-setup composite action (from rivet)
- Add docker/runtime/Dockerfile for Docker image builds
- Update release.yaml to match rivet patterns:
  - Use corepack enable instead of pnpm/action-setup
  - Add reuse_engine_version input
  - Add Docker job with Depot runners
  - Use --no-frozen-lockfile for pnpm install
  - Add id-token permission for setup job

2026-01-27 19:29:54 -08:00

4.8 KiB

Raw Blame History

Server Testing

Test placement

Place all new tests under server/packages/**/tests/ (or a package-specific tests/ folder). Avoid inline tests inside source files unless there is no viable alternative.

Test locations (overview)

Sandbox-agent integration tests live under server/packages/sandbox-agent/tests/:
- Agent flow coverage in agent-flows/
- Agent management coverage in agent-management/
- Shared server manager coverage in server-manager/
- HTTP endpoint snapshots in http/ (snapshots in http/snapshots/)
- Session capability snapshots in sessions/ (one file per capability, e.g. session_lifecycle.rs, permissions.rs, questions.rs, reasoning.rs, status.rs; snapshots in sessions/snapshots/)
- UI coverage in ui/
- Shared helpers in common/
Extracted agent schema roundtrip tests live under server/packages/extracted-agent-schemas/tests/

Snapshot tests

HTTP endpoint snapshot entrypoint:

server/packages/sandbox-agent/tests/http_endpoints.rs

Session snapshot entrypoint:

server/packages/sandbox-agent/tests/sessions.rs

Snapshots are written to:

server/packages/sandbox-agent/tests/http/snapshots/ (HTTP endpoint snapshots)
server/packages/sandbox-agent/tests/sessions/snapshots/ (session/capability snapshots)

Agent selection

SANDBOX_TEST_AGENTS controls which agents run. It accepts a comma-separated list or all. If it is not set, tests will auto-detect installed agents by checking:

binaries on PATH, and
the default install dir ($XDG_DATA_HOME/sandbox-agent/bin or ./.sandbox-agent/bin)

If no agents are found, tests fail with a clear error.

Credential handling

Credentials are pulled from the host by default via extract_all_credentials:

environment variables (e.g. ANTHROPIC_API_KEY, OPENAI_API_KEY)
local CLI configs (Claude/Codex/Amp/OpenCode)

You can override host credentials for tests with:

SANDBOX_TEST_ANTHROPIC_API_KEY
SANDBOX_TEST_OPENAI_API_KEY

If SANDBOX_TEST_AGENTS includes an agent that requires a provider credential and it is missing, tests fail before starting.

Credential health checks

Before running agent tests, credentials are validated with minimal API calls:

Anthropic: GET https://api.anthropic.com/v1/models
- x-api-key for API keys
- Authorization: Bearer for OAuth tokens
- anthropic-version: 2023-06-01
OpenAI: GET https://api.openai.com/v1/models with Authorization: Bearer

401/403 yields a hard failure (invalid credentials). Other non-2xx responses or network errors fail with a health-check error.

Health checks run in a blocking thread to avoid Tokio runtime drop errors inside async tests.

Snapshot stability

To keep snapshots deterministic:

Use the mock agent as the master event sequence; all other agents must match its behavior 1:1.
Snapshots should compare a canonical event skeleton (event order matters) with strict ordering across:
- item.started → item.delta → item.completed
- presence/absence of session.ended
- permission/question request and resolution flows
Scrub non-deterministic fields from snapshots:
- IDs, timestamps, native IDs
- text content, tool inputs/outputs, provider-specific metadata
- source and synthetic flags (these are implementation details)
Scrub reasoning and status content from session-baseline snapshots to keep the core event skeleton consistent across agents; validate those content types separately in their capability-specific tests.
The sandbox-agent is responsible for emitting synthetic events so that real agents match the mock sequence exactly.
Event streams are truncated after the first assistant or error event.
Permission flow snapshots are truncated after the permission request (or first assistant) event.
Unknown events are preserved as kind: unknown (raw payload in universal schema).
Prefer snapshot-based event skeleton assertions over manual event-order assertions in tests.
Never update snapshots based on any agent that is not the mock agent. The mock agent is the source of truth for snapshots; other agents must be compared against the mock snapshots without regenerating them.
Agent-specific endpoints keep per-agent snapshots; any session-related snapshots must use the mock baseline as the single source of truth.

Typical commands

Run only Claude session snapshots:

SANDBOX_TEST_AGENTS=claude cargo test -p sandbox-agent --test sessions

Run all detected session snapshots:

cargo test -p sandbox-agent --test sessions

Run HTTP endpoint snapshots:

cargo test -p sandbox-agent --test http_endpoints

Universal Schema

When modifying agent conversion code in server/packages/universal-agent-schema/src/agents/ or adding/changing properties on the universal schema, update the feature matrix in README.md to reflect which agents support which features.

4.8 KiB Raw Blame History