TradeReady.io
Gymnasium / RL Training

Reward Functions

5 built-in reward functions and how to write custom ones

Download .md

The reward function shapes what your agent learns to optimize. Choosing the right one is one of the most important training decisions.


Available Reward Functions

from tradeready_gym.rewards import (
    PnLReward,
    LogReturnReward,
    SharpeReward,
    SortinoReward,
    DrawdownPenaltyReward,
    CustomReward,
)

Pass a reward function instance to any environment via the reward_function parameter:

env = gym.make(
    "TradeReady-BTC-v0",
    api_key="ak_live_...",
    reward_function=SharpeReward(window=50),
)

PnLReward

Formula: current_equity - previous_equity

The simplest reward: the absolute dollar change in equity per step.

env = gym.make("TradeReady-BTC-v0",
    reward_function=PnLReward()  # default
)

Best for: Baselines, sanity checks, and simple environments. Easy to debug because the reward is directly interpretable.

Drawback: An agent can achieve high PnL with very high variance — it learns to chase returns without penalizing risk. This often produces strategies with large drawdowns.


LogReturnReward

Formula: log(current_equity / previous_equity)

Log returns produce more stable gradients than raw PnL because they are symmetric and additive across time steps.

env = gym.make("TradeReady-BTC-v0",
    reward_function=LogReturnReward()
)

Best for: Initial training runs where gradient stability is a priority.

Note: Log returns require current_equity > 0. The environment handles the edge case of equity reaching zero (episode truncated).


SharpeReward

Formula: Rolling Sharpe ratio delta over the last N steps.

The agent is rewarded for improvements to its risk-adjusted return, not just raw return. An agent that earns 0.5% with low volatility gets a higher reward than one that earns 0.5% with high volatility.

env = gym.make("TradeReady-BTC-v0",
    reward_function=SharpeReward(window=50)
)
ParameterDefaultDescription
window50Number of steps in the rolling Sharpe calculation

Best for: Training agents that should manage risk — strategies meant for live deployment where drawdown control matters.

Drawback: Can converge slowly because the reward signal is noisy at the start of training when the rolling window is not yet full.


SortinoReward

Formula: Rolling Sortino ratio delta — like Sharpe but only penalizes downside volatility.

The Sortino ratio ignores upside variance, which means an agent is not penalized for large positive returns. Only downside swings reduce the reward.

env = gym.make("TradeReady-BTC-v0",
    reward_function=SortinoReward(window=50)
)
ParameterDefaultDescription
window50Rolling window size

Best for: Strategies where the goal is specifically to minimize losses while allowing for asymmetric upside.


DrawdownPenaltyReward

Formula: PnL - penalty_coeff * current_drawdown * equity

Combines raw PnL with a penalty proportional to the current drawdown from the equity peak. The agent is penalized more as the drawdown deepens.

env = gym.make("TradeReady-BTC-v0",
    reward_function=DrawdownPenaltyReward(penalty_coeff=1.0)
)
ParameterDefaultDescription
penalty_coeff1.0How much to penalize drawdowns. Higher = more conservative.

Best for: Capital preservation strategies where surviving is more important than maximizing returns.

Tuning: A penalty_coeff of 0.5 mildly discourages drawdowns. A value of 2.0 aggressively punishes any drawdown and tends to produce very conservative agents.


Reward Comparison

RewardSignal typeConvergence speedRisk awareness
PnLRewardReturnFastNone
LogReturnRewardReturnFastNone
SharpeRewardRisk-adjustedSlowHigh
SortinoRewardDownside riskSlowHigh (asymmetric)
DrawdownPenaltyRewardReturn - penaltyMediumMedium

Custom Rewards

Subclass CustomReward to define any reward you want. You must implement:

  • compute(prev_equity, curr_equity, info) -> float — called on every step
  • reset() — called at the start of each episode to clear accumulated state
from tradeready_gym.rewards import CustomReward

class RiskAdjustedReward(CustomReward):
    def __init__(self, risk_penalty: float = 0.5):
        self.risk_penalty = risk_penalty
        self._peak = 0.0

    def compute(self, prev_equity: float, curr_equity: float, info: dict) -> float:
        pnl = curr_equity - prev_equity
        # Track running peak for drawdown calculation
        self._peak = max(self._peak, curr_equity)
        drawdown = (self._peak - curr_equity) / self._peak if self._peak > 0 else 0.0
        # Reward = PnL minus a drawdown penalty
        return pnl - self.risk_penalty * drawdown * curr_equity

    def reset(self) -> None:
        self._peak = 0.0

env = gym.make(
    "TradeReady-BTC-v0",
    api_key="ak_live_...",
    reward_function=RiskAdjustedReward(risk_penalty=0.5),
)

The info dict passed to compute contains the full step info including filled_orders, unrealized_pnl, position_value, and virtual_time. You can use any of these to craft reward signals.

The reset() method is called automatically by the environment at the start of each new episode. Forgetting to implement it causes state to bleed between episodes, which will corrupt training.


Reward Shaping Tips

  • Start with PnLReward to verify the environment and training loop are working before switching to more complex rewards.
  • Use SharpeReward for production-quality agents — risk-adjusted returns produce more stable live performance.
  • Tune DrawdownPenaltyReward coefficient by watching the equity curve in training. If the agent barely trades, lower the coefficient.
  • Combine ideas in a custom reward — e.g. log return (stable gradient) plus a drawdown penalty (risk control).

Next Steps

  • Training Tracking — visualize learning curves in the dashboard
  • Examples — complete PPO and custom reward training scripts

On this page