Strategy Testing

Testing runs your strategy against multiple randomized historical periods and gives you aggregate performance statistics. It is the step between creating a strategy and deploying it live.

What Testing Does

A single backtest on one time period can be misleading — the strategy might just be lucky on that particular month. Multi-episode testing solves this.

When you start a test run with N episodes, the platform:

Randomly selects N non-overlapping date ranges within your specified window
Runs a full backtest for each episode (same strategy definition, different dates)
Aggregates the results across all episodes
Runs the Recommendation Engine on the aggregated data
Saves per-episode metrics and per-pair breakdowns

The result is a statistically meaningful picture of how your strategy behaves across different market conditions.

Test runs are executed by Celery workers. Depending on episode count and date range, a run may take seconds to several minutes. Poll for completion or watch the UI.

Running a Test

from agentexchange import AgentExchangeClient
import time

client = AgentExchangeClient(api_key="ak_live_...")

# Start the test
test = client.run_test(
    strategy_id="strat_abc123",
    version=1,
    episodes=20,
    date_range={"start": "2025-01-01", "end": "2025-08-01"},
    episode_duration_days=30,
)

# Poll for completion
while True:
    status = client.get_test_status("strat_abc123", test["test_run_id"])
    print(f"Progress: {status['progress_pct']:.0f}%")
    if status["status"] in ("completed", "failed"):
        break
    time.sleep(5)

# Read results
results = client.get_test_results("strat_abc123", test["test_run_id"])

# Start a test run
curl -X POST http://localhost:8000/api/v1/strategies/strat_abc123/test \
  -H "Authorization: Bearer $JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "version": 1,
    "episodes": 20,
    "date_range": {"start": "2025-01-01", "end": "2025-08-01"},
    "episode_duration_days": 30
  }'

# Poll status
curl http://localhost:8000/api/v1/strategies/strat_abc123/tests/{test_id} \
  -H "Authorization: Bearer $JWT"

# Get results
curl http://localhost:8000/api/v1/strategies/strat_abc123/test-results \
  -H "Authorization: Bearer $JWT"

Test Configuration

Parameter	Default	Description
`version`	required	Strategy version number to test
`episodes`	`10`	Number of test episodes to run
`date_range.start`	required	Earliest date for episode selection
`date_range.end`	required	Latest date for episode selection
`episode_duration_days`	`30`	Length of each episode in days

Monitoring Test Progress

GET /api/v1/strategies/{id}/tests/{test_id}

{
  "test_run_id": "run_xyz789",
  "strategy_id": "strat_abc123",
  "version": 1,
  "status": "running",
  "progress_pct": 65.0,
  "episodes_completed": 13,
  "episodes_total": 20
}

Status values: pending, running, completed, failed, cancelled.

Test Results

Once status is completed, fetch the full results:

{
  "test_run_id": "run_xyz789",
  "strategy_id": "strat_abc123",
  "version": 1,
  "status": "completed",
  "results": {
    "episodes_completed": 20,
    "episodes_profitable": 14,
    "episodes_profitable_pct": 70.0,
    "avg_roi_pct": 4.2,
    "median_roi_pct": 3.8,
    "best_roi_pct": 12.1,
    "worst_roi_pct": -4.5,
    "std_roi_pct": 3.1,
    "avg_sharpe": 1.4,
    "avg_max_drawdown_pct": 6.8,
    "avg_trades_per_episode": 18,
    "total_trades": 360
  },
  "by_pair": [
    {
      "symbol": "BTCUSDT",
      "avg_roi_pct": 5.1,
      "avg_sharpe": 1.6,
      "episodes_profitable_pct": 75.0
    },
    {
      "symbol": "ETHUSDT",
      "avg_roi_pct": 3.3,
      "avg_sharpe": 1.2,
      "episodes_profitable_pct": 65.0
    }
  ],
  "recommendations": [
    "ETHUSDT underperforms BTCUSDT by 1.8% avg ROI — consider removing it",
    "TP/SL ratio is 2.7:1 — good risk/reward balance"
  ]
}

Aggregate Metrics

Metric	Description
`episodes_completed`	Number of episodes that ran to completion
`episodes_profitable`	Episodes with positive final ROI
`episodes_profitable_pct`	Win rate across episodes
`avg_roi_pct`	Average ROI across all episodes
`median_roi_pct`	Median ROI — less sensitive to outliers
`best_roi_pct` / `worst_roi_pct`	Best and worst single-episode ROI
`std_roi_pct`	Standard deviation of ROI — measures consistency
`avg_sharpe`	Average Sharpe ratio across episodes
`avg_max_drawdown_pct`	Average worst drawdown per episode
`avg_trades_per_episode`	Average trade count per episode
`total_trades`	Total trades across all episodes

Per-Pair Breakdown

Results also include per-pair performance, so you can identify which pairs in your pairs list are contributing vs dragging down performance. Each entry has the same metrics as the aggregate, grouped by symbol.

The Recommendation Engine

After a test run completes, the Recommendation Engine analyzes the results and generates plain-English suggestions. There are 11 rules:

Trigger	Recommendation
Pair ROI disparity > 5%	Remove the underperforming pair
Win rate < 50%	Tighten entry conditions or widen take-profit
Win rate > 75%	Relax entry conditions to capture more opportunities
Max drawdown > 15%	Tighten stop-loss
Max drawdown < 3%	Stop-loss may be too tight — consider loosening
Avg trades < 3 per episode	Entry conditions too restrictive — loosen them
Avg trades > 50 per episode	Add ADX filter to reduce overtrading
Sharpe < 0.5	Reduce position size or improve entry timing
ADX threshold > 30	Consider lowering to 20–25
ADX threshold < 15	Raise ADX threshold to 20+ for better trend filtering
TP/SL ratio < 1.5:1	Widen take-profit or tighten stop-loss

results = client.get_test_results(strategy_id, test_run_id)
for rec in results["recommendations"]:
    print(f"  - {rec}")

Recommendations are advisory — you decide whether to apply them. Create a new version for each change and compare test results before committing to a direction.

Comparing Versions

After testing multiple versions, compare them side by side:

comparison = client.compare_versions(
    strategy_id="strat_abc123",
    v1=1,
    v2=2
)

print(comparison["v1"])       # aggregate metrics for version 1
print(comparison["v2"])       # aggregate metrics for version 2
print(comparison["improvements"])  # % improvement per metric
print(comparison["verdict"])  # "Version 2 outperforms on 3/4 metrics"

GET /api/v1/strategies/strat_abc123/compare-versions?v1=1&v2=2

The Testing Workflow

1. Create strategy (version 1)
         |
         v
2. Run test (20 episodes, 6-month date range)
         |
         v
3. Check aggregate results:
   - episodes_profitable_pct > 60%? Good baseline.
   - avg_sharpe > 1.0? Acceptable risk-adjusted return.
   - avg_max_drawdown_pct < 10%? Manageable risk.
         |
         v
4. Read recommendations
         |
         v
5. Create version 2 with improvements
         |
         v
6. Run test on version 2 with same date range
         |
         v
7. Compare versions → deploy the winner

Always test on the same date range when comparing versions. Different periods introduce market regime bias and make comparisons meaningless.

Test Endpoint Reference

Method	Path	Description
`POST`	`/api/v1/strategies/{id}/test`	Start a test run
`GET`	`/api/v1/strategies/{id}/tests`	List all test runs
`GET`	`/api/v1/strategies/{id}/tests/{test_id}`	Get status and results for a run
`POST`	`/api/v1/strategies/{id}/tests/{test_id}/cancel`	Cancel a running test
`GET`	`/api/v1/strategies/{id}/test-results`	Latest completed test results
`GET`	`/api/v1/strategies/{id}/compare-versions?v1=N&v2=M`	Side-by-side version comparison

Next Steps

Deploying Strategies — deploy a validated strategy to live trading
Gymnasium Environments — train an RL agent to optimize beyond rule-based conditions