Reward Functions
5 built-in reward functions and how to write custom ones
The reward function shapes what your agent learns to optimize. Choosing the right one is one of the most important training decisions.
Available Reward Functions
from tradeready_gym.rewards import (
PnLReward,
LogReturnReward,
SharpeReward,
SortinoReward,
DrawdownPenaltyReward,
CustomReward,
)
Pass a reward function instance to any environment via the reward_function parameter:
env = gym.make(
"TradeReady-BTC-v0",
api_key="ak_live_...",
reward_function=SharpeReward(window=50),
)
PnLReward
Formula: current_equity - previous_equity
The simplest reward: the absolute dollar change in equity per step.
env = gym.make("TradeReady-BTC-v0",
reward_function=PnLReward() # default
)
Best for: Baselines, sanity checks, and simple environments. Easy to debug because the reward is directly interpretable.
Drawback: An agent can achieve high PnL with very high variance — it learns to chase returns without penalizing risk. This often produces strategies with large drawdowns.
LogReturnReward
Formula: log(current_equity / previous_equity)
Log returns produce more stable gradients than raw PnL because they are symmetric and additive across time steps.
env = gym.make("TradeReady-BTC-v0",
reward_function=LogReturnReward()
)
Best for: Initial training runs where gradient stability is a priority.
Note: Log returns require current_equity > 0. The environment handles the edge case of equity reaching zero (episode truncated).
SharpeReward
Formula: Rolling Sharpe ratio delta over the last N steps.
The agent is rewarded for improvements to its risk-adjusted return, not just raw return. An agent that earns 0.5% with low volatility gets a higher reward than one that earns 0.5% with high volatility.
env = gym.make("TradeReady-BTC-v0",
reward_function=SharpeReward(window=50)
)
| Parameter | Default | Description |
|---|---|---|
window | 50 | Number of steps in the rolling Sharpe calculation |
Best for: Training agents that should manage risk — strategies meant for live deployment where drawdown control matters.
Drawback: Can converge slowly because the reward signal is noisy at the start of training when the rolling window is not yet full.
SortinoReward
Formula: Rolling Sortino ratio delta — like Sharpe but only penalizes downside volatility.
The Sortino ratio ignores upside variance, which means an agent is not penalized for large positive returns. Only downside swings reduce the reward.
env = gym.make("TradeReady-BTC-v0",
reward_function=SortinoReward(window=50)
)
| Parameter | Default | Description |
|---|---|---|
window | 50 | Rolling window size |
Best for: Strategies where the goal is specifically to minimize losses while allowing for asymmetric upside.
DrawdownPenaltyReward
Formula: PnL - penalty_coeff * current_drawdown * equity
Combines raw PnL with a penalty proportional to the current drawdown from the equity peak. The agent is penalized more as the drawdown deepens.
env = gym.make("TradeReady-BTC-v0",
reward_function=DrawdownPenaltyReward(penalty_coeff=1.0)
)
| Parameter | Default | Description |
|---|---|---|
penalty_coeff | 1.0 | How much to penalize drawdowns. Higher = more conservative. |
Best for: Capital preservation strategies where surviving is more important than maximizing returns.
Tuning: A penalty_coeff of 0.5 mildly discourages drawdowns. A value of 2.0 aggressively punishes any drawdown and tends to produce very conservative agents.
Reward Comparison
| Reward | Signal type | Convergence speed | Risk awareness |
|---|---|---|---|
PnLReward | Return | Fast | None |
LogReturnReward | Return | Fast | None |
SharpeReward | Risk-adjusted | Slow | High |
SortinoReward | Downside risk | Slow | High (asymmetric) |
DrawdownPenaltyReward | Return - penalty | Medium | Medium |
Custom Rewards
Subclass CustomReward to define any reward you want. You must implement:
compute(prev_equity, curr_equity, info) -> float— called on every stepreset()— called at the start of each episode to clear accumulated state
from tradeready_gym.rewards import CustomReward
class RiskAdjustedReward(CustomReward):
def __init__(self, risk_penalty: float = 0.5):
self.risk_penalty = risk_penalty
self._peak = 0.0
def compute(self, prev_equity: float, curr_equity: float, info: dict) -> float:
pnl = curr_equity - prev_equity
# Track running peak for drawdown calculation
self._peak = max(self._peak, curr_equity)
drawdown = (self._peak - curr_equity) / self._peak if self._peak > 0 else 0.0
# Reward = PnL minus a drawdown penalty
return pnl - self.risk_penalty * drawdown * curr_equity
def reset(self) -> None:
self._peak = 0.0
env = gym.make(
"TradeReady-BTC-v0",
api_key="ak_live_...",
reward_function=RiskAdjustedReward(risk_penalty=0.5),
)
The info dict passed to compute contains the full step info including filled_orders, unrealized_pnl, position_value, and virtual_time. You can use any of these to craft reward signals.
The reset() method is called automatically by the environment at the start of each new episode. Forgetting to implement it causes state to bleed between episodes, which will corrupt training.
Reward Shaping Tips
- Start with
PnLRewardto verify the environment and training loop are working before switching to more complex rewards. - Use
SharpeRewardfor production-quality agents — risk-adjusted returns produce more stable live performance. - Tune
DrawdownPenaltyRewardcoefficient by watching the equity curve in training. If the agent barely trades, lower the coefficient. - Combine ideas in a custom reward — e.g. log return (stable gradient) plus a drawdown penalty (risk control).
Next Steps
- Training Tracking — visualize learning curves in the dashboard
- Examples — complete PPO and custom reward training scripts