Strategy Testing
Multi-episode testing and the recommendation engine
Testing runs your strategy against multiple randomized historical periods and gives you aggregate performance statistics. It is the step between creating a strategy and deploying it live.
What Testing Does
A single backtest on one time period can be misleading — the strategy might just be lucky on that particular month. Multi-episode testing solves this.
When you start a test run with N episodes, the platform:
- Randomly selects N non-overlapping date ranges within your specified window
- Runs a full backtest for each episode (same strategy definition, different dates)
- Aggregates the results across all episodes
- Runs the Recommendation Engine on the aggregated data
- Saves per-episode metrics and per-pair breakdowns
The result is a statistically meaningful picture of how your strategy behaves across different market conditions.
Test runs are executed by Celery workers. Depending on episode count and date range, a run may take seconds to several minutes. Poll for completion or watch the UI.
Running a Test
from agentexchange import AgentExchangeClient
import time
client = AgentExchangeClient(api_key="ak_live_...")
# Start the test
test = client.run_test(
strategy_id="strat_abc123",
version=1,
episodes=20,
date_range={"start": "2025-01-01", "end": "2025-08-01"},
episode_duration_days=30,
)
# Poll for completion
while True:
status = client.get_test_status("strat_abc123", test["test_run_id"])
print(f"Progress: {status['progress_pct']:.0f}%")
if status["status"] in ("completed", "failed"):
break
time.sleep(5)
# Read results
results = client.get_test_results("strat_abc123", test["test_run_id"])# Start a test run
curl -X POST http://localhost:8000/api/v1/strategies/strat_abc123/test \
-H "Authorization: Bearer $JWT" \
-H "Content-Type: application/json" \
-d '{
"version": 1,
"episodes": 20,
"date_range": {"start": "2025-01-01", "end": "2025-08-01"},
"episode_duration_days": 30
}'
# Poll status
curl http://localhost:8000/api/v1/strategies/strat_abc123/tests/{test_id} \
-H "Authorization: Bearer $JWT"
# Get results
curl http://localhost:8000/api/v1/strategies/strat_abc123/test-results \
-H "Authorization: Bearer $JWT"Test Configuration
| Parameter | Default | Description |
|---|---|---|
version | required | Strategy version number to test |
episodes | 10 | Number of test episodes to run |
date_range.start | required | Earliest date for episode selection |
date_range.end | required | Latest date for episode selection |
episode_duration_days | 30 | Length of each episode in days |
Monitoring Test Progress
GET /api/v1/strategies/{id}/tests/{test_id}
{
"test_run_id": "run_xyz789",
"strategy_id": "strat_abc123",
"version": 1,
"status": "running",
"progress_pct": 65.0,
"episodes_completed": 13,
"episodes_total": 20
}
Status values: pending, running, completed, failed, cancelled.
Test Results
Once status is completed, fetch the full results:
{
"test_run_id": "run_xyz789",
"strategy_id": "strat_abc123",
"version": 1,
"status": "completed",
"results": {
"episodes_completed": 20,
"episodes_profitable": 14,
"episodes_profitable_pct": 70.0,
"avg_roi_pct": 4.2,
"median_roi_pct": 3.8,
"best_roi_pct": 12.1,
"worst_roi_pct": -4.5,
"std_roi_pct": 3.1,
"avg_sharpe": 1.4,
"avg_max_drawdown_pct": 6.8,
"avg_trades_per_episode": 18,
"total_trades": 360
},
"by_pair": [
{
"symbol": "BTCUSDT",
"avg_roi_pct": 5.1,
"avg_sharpe": 1.6,
"episodes_profitable_pct": 75.0
},
{
"symbol": "ETHUSDT",
"avg_roi_pct": 3.3,
"avg_sharpe": 1.2,
"episodes_profitable_pct": 65.0
}
],
"recommendations": [
"ETHUSDT underperforms BTCUSDT by 1.8% avg ROI — consider removing it",
"TP/SL ratio is 2.7:1 — good risk/reward balance"
]
}
Aggregate Metrics
| Metric | Description |
|---|---|
episodes_completed | Number of episodes that ran to completion |
episodes_profitable | Episodes with positive final ROI |
episodes_profitable_pct | Win rate across episodes |
avg_roi_pct | Average ROI across all episodes |
median_roi_pct | Median ROI — less sensitive to outliers |
best_roi_pct / worst_roi_pct | Best and worst single-episode ROI |
std_roi_pct | Standard deviation of ROI — measures consistency |
avg_sharpe | Average Sharpe ratio across episodes |
avg_max_drawdown_pct | Average worst drawdown per episode |
avg_trades_per_episode | Average trade count per episode |
total_trades | Total trades across all episodes |
Per-Pair Breakdown
Results also include per-pair performance, so you can identify which pairs in your pairs list are contributing vs dragging down performance. Each entry has the same metrics as the aggregate, grouped by symbol.
The Recommendation Engine
After a test run completes, the Recommendation Engine analyzes the results and generates plain-English suggestions. There are 11 rules:
| Trigger | Recommendation |
|---|---|
| Pair ROI disparity > 5% | Remove the underperforming pair |
| Win rate < 50% | Tighten entry conditions or widen take-profit |
| Win rate > 75% | Relax entry conditions to capture more opportunities |
| Max drawdown > 15% | Tighten stop-loss |
| Max drawdown < 3% | Stop-loss may be too tight — consider loosening |
| Avg trades < 3 per episode | Entry conditions too restrictive — loosen them |
| Avg trades > 50 per episode | Add ADX filter to reduce overtrading |
| Sharpe < 0.5 | Reduce position size or improve entry timing |
| ADX threshold > 30 | Consider lowering to 20–25 |
| ADX threshold < 15 | Raise ADX threshold to 20+ for better trend filtering |
| TP/SL ratio < 1.5:1 | Widen take-profit or tighten stop-loss |
results = client.get_test_results(strategy_id, test_run_id)
for rec in results["recommendations"]:
print(f" - {rec}")
Recommendations are advisory — you decide whether to apply them. Create a new version for each change and compare test results before committing to a direction.
Comparing Versions
After testing multiple versions, compare them side by side:
comparison = client.compare_versions(
strategy_id="strat_abc123",
v1=1,
v2=2
)
print(comparison["v1"]) # aggregate metrics for version 1
print(comparison["v2"]) # aggregate metrics for version 2
print(comparison["improvements"]) # % improvement per metric
print(comparison["verdict"]) # "Version 2 outperforms on 3/4 metrics"GET /api/v1/strategies/strat_abc123/compare-versions?v1=1&v2=2The Testing Workflow
1. Create strategy (version 1)
|
v
2. Run test (20 episodes, 6-month date range)
|
v
3. Check aggregate results:
- episodes_profitable_pct > 60%? Good baseline.
- avg_sharpe > 1.0? Acceptable risk-adjusted return.
- avg_max_drawdown_pct < 10%? Manageable risk.
|
v
4. Read recommendations
|
v
5. Create version 2 with improvements
|
v
6. Run test on version 2 with same date range
|
v
7. Compare versions → deploy the winner
Always test on the same date range when comparing versions. Different periods introduce market regime bias and make comparisons meaningless.
Test Endpoint Reference
| Method | Path | Description |
|---|---|---|
POST | /api/v1/strategies/{id}/test | Start a test run |
GET | /api/v1/strategies/{id}/tests | List all test runs |
GET | /api/v1/strategies/{id}/tests/{test_id} | Get status and results for a run |
POST | /api/v1/strategies/{id}/tests/{test_id}/cancel | Cancel a running test |
GET | /api/v1/strategies/{id}/test-results | Latest completed test results |
GET | /api/v1/strategies/{id}/compare-versions?v1=N&v2=M | Side-by-side version comparison |
Next Steps
- Deploying Strategies — deploy a validated strategy to live trading
- Gymnasium Environments — train an RL agent to optimize beyond rule-based conditions