Historical performance analysis of our trading strategies. All results are based on out-of-sample backtesting with no lookahead bias.
Our system uses three independent CatBoost machine learning models, each optimized for a different forecast horizon (3-month, 6-month, and 12-month). Each model independently produces a binary signal: LONG (expecting the S&P 500 to rise) or CASH (expecting it to fall or stay flat).
Short-term signal capturing quarterly earnings cycles, sentiment shifts, and momentum.
Medium-term signal balancing responsiveness with stability across business cycles.
Long-term signal focused on fundamental economic trends and macro regime shifts.
How signals become strategies: These raw signals feed into two trading strategies. The Consensus Model uses a majority-vote rule (2+ models LONG = 100% invested), while the Position-Sized Model scales exposure gradually based on the degree of agreement.
Distribution of consensus signals with average 3-month forward S&P 500 returns
Classification performance on test set and cross-validation results
| Horizon | AUC | Accuracy | Precision | Recall | F1 Score | CV AUC | Features |
|---|---|---|---|---|---|---|---|
3M | 0.672 | 0.737 | 0.737 | 1.000 | 0.848 | 0.657 ± 0.136 | 10 |
6M | 0.570 | 0.433 | 0.683 | 0.394 | 0.500 | 0.664 ± 0.210 | 12 |
12M | 0.841 | 0.722 | 0.947 | 0.636 | 0.761 | 0.534 ± 0.089 | 12 |
AUC: Target ≥ 0.60 (higher = better predictive power)
Precision: Accuracy of LONG predictions (higher = fewer false signals)
Recall: Ability to identify opportunities (higher = catches more good periods)
CV AUC: Cross-validation performance (lower std = more stable)
How well each model classifies LONG vs CASH signals
Accuracy: 73.7%
Accuracy: 43.3%
Accuracy: 72.2%
Train vs Validation AUC per fold — detecting overfitting
5 folds completed
4 folds completed
3 folds completed
Top 10 most important features for each prediction horizon
Most important features for prediction
Most important features for prediction
Most important features for prediction
Confidence levels for each time horizon (50% threshold for LONG signal). Vertical markers show consensus signal changes.
Dot color reflects number of models LONG