- PART 0 - INTRO
- PART 1 - DATA ANALYSIS
- PART 2 - MODELLING
- PART 3 - BACKTESTING
- PART 4 - USEFUL FINANCIAL FEATURES
- PART 5 - HIGH-PERFORMANCE COMPUTING RECIPES
Advances In Financial Machine Learning is a highly technical book utilizing advanced mathematics throughout. Therefore one needs to study concepts introduced under each chapter to get the maximum benefit. With that said, this repository attempts reducing this density by higlighting the most important concepts, providing chapter summaries as well as the exercise solutions using sample bitcoin data.
Exercises for following chapters are not included:
Chapter 6 - Ensemble Methods: This chapter has all theoretical exercises for which the ChatGPT Chat Link is available
Chapter 11 - The Dangers of Backtesting: This chapter is a warning mentioning common sins and exercises are covering cases where certain sins are committed.
Chapter 16 - Machine Learning Asset Allocation: Skipped section since the concentration of this study is to concentrate initially on trading applications.
Chapter 20/21/22 - These are sections belonging to High-Performance Computing Recipes Part. Previously utilized mpPandasObj parallelization function provided under Chapter 20. It would be ideal to refer this sections when training the models with actual vast amounts of data rather than exercise samples.
Book is consisted of 5 Parts (Data Analysis, Modelling, Backtesting, Useful Financial Features and High-Performance Computing Receipes) with multiple Chapters under each part.
The Problem: Most quantitative strategies fail because they are "false discoveries" resulting from a flawed, unscientific research process.
The Cause: Standard ML tools fail because financial data is unique, it has a low signal-to-noise ratio, is not IID, and is subject to structural breaks.
The Culprit: The traditional backtest is the main tool for self-deception, leading to massive overfitting.
The Solution: A paradigm shift is needed towards a collaborative, theory-driven, and industrialized "strategy factory" approach that treats financial ML as its own scientific discipline.
The Path Forward: The rest of the book is dedicated to building the components of this factory, providing specific, practical tools to overcome the challenges identified in this chapter.
De Prado suggests the following members for creating a team for building a Strategy Factory:
-
Data Curators: Acquire, clean, and structure raw market data into robust, analysis-ready formats.
-
Feature Analysts: Transform structured data into informative variables (features) that have potential predictive power.
-
Strategists: Develop and train machine learning models that generate predictive signals based on the engineered features.
-
Backtesting Team: Rigorously evaluate a model's historical performance, focusing on preventing backtest overfitting and assessing its true viability.
-
Deployment Team: Integrate the validated model into the live trading infrastructure, managing execution and operational risk.
-
Portfolio Managers: Allocate capital across a portfolio of multiple strategies and manage the overall combined risk.
This chapter argues that standard time bars (e.g., daily, hourly) are a poor choice for financial ML. Because market activity is not uniform, time-based sampling leads to data with undesirable statistical properties. The solution is to use information-driven bars, which sample data based on market activity (like trade volume or price changes), resulting in series that are much closer to being IID (Independent and Identically Distributed) and better suited for modeling.
These bars are formed by sampling data whenever a certain amount of market information has been exchanged.
-
Tick Bars: Sample every N transactions (ticks).
-
Volume Bars: Sample every N units of asset traded (e.g., shares).
-
Dollar Bars: Sample every N dollars of value traded. !!! DOLLAR BARS ARE SIGNIFICANT AND CONVENIENT. THEY ARE USED FOR MOST EXERCISES IN REMAINING CHAPTERS !!!
-
Tick Imbalance Bars (TIBs): Sample when the imbalance between buy vs. sell ticks exceeds a threshold.
-
Volume/Dollar Imbalance Bars (VIBs/DIBs): Sample when the volume/dollar imbalance between buys vs. sells exceeds a threshold.
-
Tick Run Bars (TRBs): Sample at the end of a "run," a sequence of consecutive buyer- or seller-initiated ticks.
-
Volume/Dollar Run Bars (VRBs/DRBs): Sample at the end of runs based on volume or dollar value.
Under the Sampling section there exists the CUSUM (Cumulative Sum) Filter which is another important trick used in few places across chapters.
The CUSUM Filter: An event-based sampling technique that triggers when the cumulative sum of price deviations from a mean crosses a predefined threshold, effectively capturing significant market events.
Labelling is one of the most important sections as it introduces the Triple-Barrier Method and Meta-Labeling concepts.
A sophisticated labeling technique that mimics how a human trader thinks about a position. For each data point, we simulate a trade and see which of three "barriers" it hits first.
- Upper Barrier (Profit-Take): A price level above the entry for taking profit.
- Lower Barrier (Stop-Loss): A price level below the entry for cutting losses.
- Vertical Barrier (Time Limit): A maximum holding period for the trade.
The resulting label is not just direction, but outcome:
- 1: The profit-take horizontal barrier was hit.
- -1: The stop-loss horizontal barrier was hit.
- 0: The time limit (vertical barrier) was hit without touching the other barriers. (This can also be labelled with -1 for binary classification)
This method creates labels that are directly tied to a trading strategy's P&L and risk profile.
Meta-labeling is a powerful technique for improving an existing strategy. It separates the decision of what direction to bet from whether to bet at all.
It works in two stages:
- Primary Model: An initial, simpler model generates a prediction for the side of the bet (e.g., a mean-reversion model predicts "long").
- Meta-Model (The "Confidence" Model): A secondary ML model is trained to predict the probability that the primary model's bet will be successful (i.e., hit the profit-take barrier). It uses the binary outcome ({0, 1} where 1 = success) from the Triple-Barrier method as its target.
- Improves Precision: It acts as a sophisticated filter, screening out the primary model's low-confidence bets.
- Reduces False Positives: By avoiding bad trades, it significantly increases the strategy's Sharpe ratio.
- Controls Sizing: The probability output from the meta-model can be used to determine the size of the bet (bet more on high-confidence predictions).
Simple Workflow
- -> [Primary Model] -> Suggests a Bet (e.g., Long)
- -> [Meta-Model] -> Predicts P(Success) > threshold?
- -> YES: Place the trade. / -> NO: Pass on the trade.
This chapter addresses a critical side-effect of the Triple-Barrier Method which is overlapping outcomes. It explains why this is a major problem for ML models and introduces methods for measuring this overlap (uniqueness) and correcting for it using sample weights. The core idea is that we have far fewer unique observations than our dataset size suggests.
In financial data, labels are not IID (independent and identically distributed). Specifically:
Some events (e.g., trading signals) overlap in time, so their outcomes are not independent.
If you train a model on such overlapping events without adjusting for that, your model will overfit, double-count some information, and perform poorly out-of-sample.
To fix this, the chapter introduces a way to assign weights to samples (events) so that:
Events with greater uniqueness get higher weight.
Events that overlap a lot get down-weighted.
This ensures your model is trained on more independent information.
This is the simplest way to measure redundancy. It asks: At any given point in time, how many different triple-barrier windows are active?
This count, say c_t, measures the degree of overlap at a specific time t. A high c_t means that a price movement at this moment will affect many different labels, making it disproportionately influential.
The instantaneous uniqueness at time t is defined as 1 / c_t. This forms the building block for the more advanced methods.
This section moves from measuring overlap at a single point in time to measuring the overall uniqueness of an entire label.
A label's "average uniqueness" is calculated by averaging the instantaneous uniqueness (1 / c_t) over all the time steps in its evaluation window.
For a label i spanning T_i time steps, its average uniqueness u_i is:
u_i = (Sum of 1/c_t for all t in label i's window) / T_i
This u_i value gives us a single, powerful number representing how redundant or unique a specific training example is. This is the value we can directly use as a sample weight during model training.
This section provides a crucial application of the uniqueness concept, specifically for ensemble methods like Random Forest or Bagging.
Standard bagging (bootstrapping) samples data points with replacement. In finance, where labels overlap, this is dangerous. You are highly likely to select many redundant, non-unique samples, which makes your bootstrapped training sets very similar to each other. This defeats the purpose of bagging, which relies on model diversity from diverse sub-samples.
The Solution Sequential Bootstrapping suggests that instead of sampling purely at random, we can use our uniqueness measure to guide the process. Draw samples sequentially. For each sample drawn, assign it a uniqueness value (e.g., using the average uniqueness from 4.4). The probability of drawing the next sample can be adjusted based on the uniqueness of the samples already drawn, ensuring the final bootstrapped set is more diverse.
This chapter introduces a technique for feature engineering that solves the stationarity vs memory dilemma.
When preparing data for a machine learning model, we face a critical trade-off:
-
Stationarity: ML models require data whose statistical properties (like mean and variance) are constant over time. Raw price series are non-stationary and will break most models.
-
Memory: The original price series contains valuable, long-term trend information (its "memory") that is essential for making accurate predictions.
The dilemma is that the standard methods (i.e. working with returns) used to make a series stationary completely destroy its memory. We are forced to choose between a series that is statistically valid but predictively useless, or one that is predictive but statistically invalid.
The textbook approach is to compute integer differences, most commonly by calculating returns (price_t - price_{t-1}).
What it does: This is an "all-or-nothing" operation. It successfully transforms a non-stationary price series into a stationary returns series.
The Problem: In doing so, it erases almost all of the original series' memory. We "throw the baby out with the bathwater," leaving a series that is largely noise and very difficult to predict.
De Prado offers a far more elegant solution that avoids the harsh trade-off. Instead of a binary on/off switch, Fractional Differentiation acts like a "dimmer switch" for memory.
FracDiff is a technique that generalizes differentiation to a fractional order, controlled by a parameter d.
- d = 0: The original series (full memory, non-stationary).
- d = 1: Standard returns (no memory, stationary).
- 0 < d < 1: A new series that balances both properties.
The Goal: The key insight is to find the minimum d that is just high enough to make the series stationary. We apply just enough differentiation to satisfy our model's statistical needs, while preserving the maximum possible amount of the original series' predictive memory. This allows us to create features that have the best of both worlds: they are stationary enough to be used in ML models, but still rich with the long-term memory needed to build a powerful predictive strategy.
🎯 What it does in simple terms:
Instead of subtracting only the previous value (like in P_t - P_{t-1}),
It subtracts a weighted sum of previous values.
For example:
new_value_t = P_t - 0.9 * P_{t-1} - 0.8 * P_{t-2} - 0.7 * P_{t-3} ...
The weights decay over time.
Smaller d → slower decay → more memory kept.
Larger d → faster decay → less memory, more like regular differencing.
Chapter 6 provides a timeless foundation on the principles of bagging and boosting, which remain essential knowledge. However, its toolkit is dated, as it predates the modern era of machine learning that began around its publication. The chapter underrepresents the now-ubiquitous, hyper-optimized gradient boosting libraries like XGBoost and LightGBM, and entirely omits the rise of deep learning architectures like Transformers for handling sequential data. Furthermore, it lacks modern interpretability frameworks like SHAP, which are now critical for explaining these complex models. While its concepts are fundamental, a practitioner today must supplement this chapter with these more powerful, contemporary methods.
There are two main families of ensemble methods, each with a different philosophy.
Bagging focuses on reducing variance and creating stability by averaging out errors. It's a parallel approach where models are trained independently.
The Idea: Create a "committee" of diverse models by training each one on a slightly different, random subset of the data.
How it Works:
- Bootstrap: Create multiple training datasets by sampling with replacement from the original data.
- Train: Train one model (e.g., a Decision Tree) on each of these bootstrapped datasets.
- Aggregate: For a new prediction, let all the models "vote" (for classification) or average their outputs (for regression). The majority/average decision is the final output.
Why it Works: Individual models might be unstable and overfit to noise in their specific dataset. By averaging them, their individual errors tend to cancel each other out, leading to a much more stable and reliable final prediction.
Random Forest is a powerful and popular extension of bagging, specifically for decision trees. It introduces an extra layer of randomness to make the models even more diverse.
The Idea: It's bagging, but with a twist to prevent the models from becoming too similar.
The Key Addition: When building each decision tree, at every split point, the algorithm is only allowed to consider a random subset of the features. This process is called feature bagging. It forces the trees to be different from one another. Without it, every tree might learn to rely on the same one or two "super-predictive" features. By restricting the feature choice, Random Forest builds a more diverse committee of experts, making the overall model more robust if a key feature's signal fades.
Boosting is a sequential approach that focuses on reducing bias and building a single, highly accurate model by learning from mistakes. The Idea: Build a "chain" of weak models, where each new model is trained to correct the errors made by the previous ones.
How it Works:
- Train a simple, "weak" base model on the data.
- Identify which observations the model got wrong.
- Train the next model, giving more weight and focus to the observations that the previous model misclassified.
- Repeat this process, with each new model focusing on the hardest remaining cases.
- The final prediction is a weighted sum of all the models' predictions.
Why it Works: It converts a series of weak learners (models that are only slightly better than random guessing) into a single, powerful "strong learner." It's an expert at finding and modeling complex, non-linear patterns.
By using these ensemble techniques, we move away from the fragile search for a single perfect model and toward building robust, diversified, and more reliable predictive systems.
This chapter addresses a common mistake in quantitative finance: using standard cross-validation (CV) techniques on financial data. A flawed validation process is the primary reason why so many strategies that look brilliant in backtests fail in live trading. This chapter provides a robust solution to prevent this.
Standard K-Fold CV shuffles data and splits it randomly. This works for IID (Independent and Identically Distributed) data, but financial time series are not IID. Applying standard CV to financial data leads to a critical flaw: data leakage.
What is Data Leakage?
The training set becomes contaminated with information from the testing set. This happens because the labels (e.g., from the Triple-Barrier Method) are a function of future data. Shuffling can place a training observation before a testing observation, while its label was determined by information that occurred after that testing observation.
The Consequence: The model is inadvertently trained on information from the future. Its performance in the backtest is artificially inflated because it's being evaluated on data it has already "seen." This leads to catastrophic overfitting and strategies that are guaranteed to fail.
The Solution: Purged and Embargoed K-Fold Cross-Validation
De Prado introduces a purpose-built CV method that respects the temporal nature of financial data and systematically eliminates leakage. It has two key components:
The first step is to clean the training set.
The Idea: Go through the training set and remove ("purge") any observation whose label's evaluation period overlaps with the testing period. The Result: This ensures that the model is not trained on any data that could provide a hint about the testing set's outcomes.
The second step is to prevent leakage from serial correlation (when one observation influences the next). The Idea: Place a small time gap or "embargo" period immediately after the end of the training data. This data is not used for either training or testing.
The Result: This creates a buffer zone, ensuring that the performance on the first few test samples is not contaminated by information from the last few training samples (e.g., due to features with a look-ahead window like moving averages).
MDA vs SFI: Which is more expensive?
SFI becomes more computationally expensive than MDA as the number of features (N) becomes large.
While MDA has a high, fixed upfront cost (training the main model), SFI's cost scales directly with the number of features. In modern financial ML where datasets can have hundreds or thousands of potential features, the requirement to train a separate model for each one makes SFI the more resource-intensive method in terms of total CPU time. However, because SFI is perfectly parallelizable, its wall-clock time can be drastically reduced if you have a multi-core machine. Even so, for very large feature sets, MDA is generally the more computationally tractable approach.
This chapter tackles one of the final and most dangerous sources of backtest overfitting: hyper-parameter tuning. Choosing the right hyper-parameters (e.g., the number of trees in a Random Forest, the learning rate in a GBM) is critical for model performance.
The go-to method for hyper-parameter tuning in many ML libraries is Grid Search with K-Fold Cross-Validation (GridSearchCV). This approach is doubly flawed in finance.
Combinatorial Explosion (The "Curse of Dimensionality"):
Grid search is a brute-force method that exhaustively tests every possible combination of parameters. With more than a few parameters, the number of models to train becomes computationally astronomical, making it impractical.
Data Leakage (The Fatal Flaw):
Standard GridSearchCV uses standard K-Fold CV, which, as we learned in Chapter 7, is completely inappropriate for financial data. It leaks information from the future into the past, causing the search to select hyper-parameters that are not genuinely robust but simply overfit to the test sets used during cross-validation. This is a primary cause of strategies failing in the real world.
De Prado advocates for a two-part solution that is both more efficient and, crucially, more robust.
1. The Search Method: From Brute-Force to Intelligent Search. Instead of an exhaustive grid search, use a randomized approach.
-
Randomized Search (RandomizedSearchCV):
- Instead of testing every combination, this method randomly samples a fixed number of parameter combinations from the specified distributions.
- It is far more efficient and often finds equally good (or better) parameters than grid search in a fraction of the time.
-
The Coarse-to-Fine Workflow (Recommended):
-
- Random Search: Begin with a randomized search across a wide range of parameter values.
-
- Analyze: Identify the "promising regions" where the best-performing parameters were found.
-
- Grid Search: Perform a much smaller, focused grid search only within those promising regions to fine-tune the final selection.
-
2. The Validation Method: The Foundation of Reliability
This is the most critical part. The search method (random or grid) must be combined with the robust cross-validation technique from Chapter 7.
- Use Purged K-Fold Cross-Validation:
- When performing the randomized or grid search, you must use a Purged K-Fold CV object as the cross-validation splitter.
- This ensures that every evaluation performed during the hyper-parameter search is free from data leakage. Each fold is properly purged and embargoed.
The Final Workflow
Hyper-Parameter Tuning = (Random Search + Focused Grid Search) + Purged K-Fold CV
"How much should I bet?" Getting the direction right is only half the battle. A model that makes many correct but low-conviction predictions can still lose money if it bets too much on the wrong trades. The chapter argues that the size of our position should be dynamic and directly proportional to the model's confidence in its prediction.
The central principle remains the same: the size of our position should be a direct function of the model's confidence. High-confidence predictions should lead to larger bets, while low-confidence predictions should lead to smaller bets or no bet at all. This is where the Meta-Labeling technique from Chapter 3 is invaluable, as its purpose is to generate an accurate probability (p) of a strategy's success.
De Prado's Sizing Function: The S-Shaped Curve
Instead of a simple linear mapping, De Prado proposes using a function that generates an S-shaped (sigmoid) curve. This is a much safer and more realistic approach.
The Function: The probability p is first converted into a standardized variable z, and then plugged into the Cumulative Distribution Function (CDF) of the standard Normal distribution.
1. z = (p - 0.5) / sqrt(p * (1-p))
2. Bet Size = 2 * N(z) - 1, where N() is the Normal CDF.
Why an S-Curve is Superior:
-
It's Conservative: For probabilities close to 0.5 (low conviction), the bet size remains very small. It doesn't increase linearly.
-
It Ramps Up for High Conviction: The bet size only grows substantially when the model's probability p moves significantly away from 0.5 towards 0 or 1.
-
It Prevents Over-Betting: This non-linear mapping prevents the model from taking excessive risk on marginal signals, which is a major source of losses in strategies that use linear sizing.
This chapter is an easier read yet, it is very informative about commonpitfalls. Below figure represents the 7 sins but giving the whole chapter a read would still be helpful and practical when backtesting ML models.
Seven Sins of Quantitative Investing” (Luo et al. [2014])
The standard industry approach to backtesting is "walk-forward," where a model is trained on a period of data and tested on the subsequent period, rolling this window through time.
The Flaw (Path-Dependence): This method evaluates the strategy on only one single path through history—the chronological one. It also does not utilize the data to its full poteintial. A strategy's entire backtest performance can be made or broken by a single lucky or unlucky period (like a crisis). It doesn't tell you if the strategy's logic is fundamentally sound, only how it performed on that one specific sequence of events. This makes it a poor estimator of future performance.
The Solution: Backtesting as a Cross-Validation Problem
De Prado's solution is to re-frame backtesting not as a single simulation, but as a cross-validation exercise. The goal is to test the strategy's logic across many different historical scenarios, not just one.
The Main Tool: Combinatorial Purged Cross-Validation (CPCV)
This is the chapter's core technique for implementing a robust backtest.
- The Idea: Instead of one chronological path, we create many different historical paths by combining different data segments for training and testing.
- The Workflow:
- Split the Data: Divide the entire dataset into N distinct, non-overlapping groups (e.g., 6 groups of 4 months each for a 2-year backtest).
- Form All Combinations: Create all possible train/test splits by taking every combination of k groups for training. For example, from the 6 groups, you would test all 15 combinations of 4 groups for training and 2 for testing.
- Run a Backtest on Each Path: For each of these combinatorial paths, run a full backtest. Crucially, each test split within a path must use the purging and embargoing techniques from Chapter 7 to prevent data leakage.
- Aggregate the Results: The final output is not a single performance metric (like one Sharpe Ratio), but a distribution of performance metrics from all the tested paths.
- Path-Independence: By averaging performance across many paths, the final result is not dependent on the luck of one specific historical sequence. It measures the robustness of the strategy's underlying logic.
- Provides a Distribution of Outcomes: You get a full distribution of Sharpe Ratios. This allows you to assess the strategy's risk profile, such as the probability of failure (P[SR < 0]) and the stability of its performance.
- More Reliable Estimate: The average performance across all combinatorial paths is a much more reliable and unbiased estimator of the strategy's true out-of-sample performance than a single walk-forward test.
This chapter addresses a fundamental limitation of all backtesting: we only have one realization of history. A strategy might look great on the single historical path we have, but what if that path was unusually kind? This chapter provides the ultimate stress test by generating thousands of alternative, plausible historical paths—synthetic datasets—and backtesting our strategy on all of them.
To create Synthetic Data that aligns with the properties of real data 13.4 introduces the Ornstein-Uhlenbeck (O-U) based framework for generating synthetic prices.
Afterwards, 13.5 builts the Algorithm on top via the 5 Step Model below:
Step 1: Model the Price Dynamics
What it does: The algorithm first analyzes the historical price data to understand its mean-reverting behavior. It fits a statistical model (an Ornstein-Uhlenbeck process) to the data. The Output: This yields two key parameters: φ (the speed of mean-reversion) and σ (the volatility of the process). These two numbers effectively become the "DNA" for the price behavior we want to simulate.
Step 2: Define a Grid of Potential Trading Rules What it does: A comprehensive grid (or "mesh") of potential trading rules is created. Each rule is a pair consisting of a stop-loss level and a profit-taking level, both defined in terms of the volatility σ calculated in Step 1. Example: It might create a 20x20 grid, testing stop-losses from -0.5σ to -10σ against profit-takes from +0.5σ to +10σ, resulting in 400 unique rule combinations.
Step 3: Generate Thousands of Synthetic Price Paths What it does: Using the mean-reversion speed (φ) and volatility (σ) from Step 1, the algorithm generates a large number of new, synthetic price paths (e.g., 100,000). The Key: Each path starts from the observed initial conditions of a real trading opportunity and simulates what could have happened next, according to the statistical properties of the model. A maximum holding period (a vertical barrier) is also imposed.
Step 4: Run a Massive Backtesting Experiment What it does: This is the heart of the process. The algorithm takes every single trading rule from the grid in Step 2 and backtests it against every single one of the 100,000 synthetic paths from Step 3. The Output: This doesn't produce one result, but a distribution of outcomes (e.g., 100,000 Sharpe Ratios) for each of the 400 trading rules.
Step 5: Determine the Optimal Trading Rules
What it does: The algorithm analyzes the massive set of results from Step 4 to find the best-performing rule. This can be done in three ways:
- 5a (Unconstrained): Find the single best-performing stop-loss/profit-take pair from the entire grid.
- 5b (Constrained Profit-Take): If your strategy already has a fixed profit target, use the results to find the optimal stop-loss that should accompany it.
- 5c (Constrained Stop-Loss): If your fund has a mandatory maximum stop-loss, use the results to find the optimal profit-taking level to maximize returns for that given level of risk.
This chapter introduces the methods and indicators used in the industry for evaluating a strategies performance. It also builds on top of the standardized Sharpe Ratio by introducing more robust performance evaluation metrics such as Probabilistic Sharpe Ratio (PSR) and Deflated Sharpe Ratio (DSR).
This section provides metrics that describe the fundamental operational nature, style, and potential biases of the strategy.
- Time range: The start and end dates of the backtest. A longer period covering multiple market regimes is essential for robustness.
- Average AUM: The average dollar value of Assets Under Management.
- Capacity: The highest AUM the strategy can manage before performance degrades due to transaction costs and market impact.
- Leverage: The amount of borrowing used, measured as the ratio of the average dollar position size to the average AUM.
- Maximum dollar position size: The largest single position taken during the backtest. A value close to the average AUM is preferred, as it indicates the strategy doesn't rely on rare, extreme bets.
- Ratio of longs: The proportion of bets that were long positions. For a market-neutral strategy, this should be close to 0.5. A significant deviation indicates a potential directional bias.
- Frequency of bets: The number of bets per year. A "bet" is a complete cycle from a flat position to another flat position or a flip, not to be confused with the number of trades.
- Average holding period: The average number of days a bet is held.
- Annualized turnover: The ratio of the average dollar amount traded per year to the average annual AUM. This measures how actively the portfolio is managed.
- Correlation to underlying: The correlation between the strategy's returns and the returns of its investment universe. A low correlation is desired to prove the strategy is generating unique alpha.
This section lists the raw, unadjusted metrics that describe the strategy's absolute profitability before risk adjustments.
- PnL: The total profit and loss in dollars (or the currency of denomination) over the entire backtest, including costs.
- PnL from long positions: The portion of the total PnL generated exclusively by long positions. This is useful for assessing directional bias in long-short strategies.
- Annualized rate of return: The time-weighted average annual rate of total return, as calculated by the TWRR method to correctly account for cash flows.
- Hit ratio: The fraction of bets that resulted in a positive PnL.
- Average return from hits: The average PnL for all profitable bets.
- Average return from misses: The average PnL for all losing bets.
This category assesses the path-dependency and risk profile of the returns, which is crucial for non-IID financial series.
- Drawdown (DD): The maximum loss from a portfolio's peak value (high-water mark).
- Time Under Water (TuW): The longest time the strategy spent below a previous high-water mark.
- Returns Concentration (HHI): A key metric inspired by the Herfindahl-Hirschman Index. It measures whether PnL comes from a few massive wins (risky) or is distributed evenly across many small wins (robust). This is measured for both positive and negative returns, as well as concentration in time (e.g., all profits came in one month).
This category grounds the backtest in reality by analyzing its sensitivity to real-world trading costs.
- Costs vs. PnL: Measures performance relative to trading costs (brokerage fees, slippage).
- Return on Execution Costs: A crucial ratio that shows how many dollars of profit are generated for every dollar spent on execution. A high multiple is needed to ensure the strategy can survive worse-than-expected trading conditions.
This is where the chapter introduces its most powerful statistical tools for judging performance after accounting for risk and selection bias.
- Sharpe Ratio (SR): The standard but flawed metric.
- Probabilistic Sharpe Ratio (PSR): A superior metric that adjusts the SR for non-Normal returns (skewness, fat tails). It estimates the probability that the true SR is positive.
- Deflated Sharpe Ratio (DSR): It's a PSR that also corrects for selection bias by penalizing the result based on the number of strategies tried. It answers the question: "What is the probability this result is a fluke?"
These metrics are specifically for evaluating the performance of the meta-labeling model from Chapter 3.
- Accuracy, Precision, Recall: Standard classification metrics.
- F1-Score: The harmonic mean of precision and recall, which is a much better metric than accuracy when dealing with the imbalanced datasets typical in finance (i.e., many more "pass" signals than "bet" signals).
This category seeks to understand where the PnL comes from by decomposing performance across different risk factors (e.g., duration, credit, sector, currency). This helps identify the true source of a portfolio manager's skill.
This is a lighter chapter with the core objective of quantifying the strategy risk. It models a strategy as a series of binomial bets (profit or loss outcomes) to understand how sensitive its success is to its core parameters: betting frequency, precision, and the size of its wins and losses. The analysis progresses from a simplified model to a more realistic one.
The model assumes a strategy consists of a series of independent bets where:
- There are n bets per year (frequency).
- Each bet has a probability p of winning (precision).
- The payouts are symmetric: a win yields a profit of +π and a loss yields an identical loss of -π.
Under these assumptions, the chapter derives the formula for the annualized Sharpe Ratio (θ). This derivation leads to a critical insight:
In the symmetric case, the payout size π cancels out of the Sharpe Ratio formula. The strategy's risk-adjusted performance depends only on its precision (p) and frequency (n).
This simplified model provides that a strategy's success is a function of its statistical properties, not necessarily the size of its individual bets.
This section lifts off the key constraint of the first model where returns are identical to build a more realistic and powerful framework for evaluating real-world strategies.
The model defines for asymmetric payouts by:
- A winning bet yields a profit of π+.
- A losing bet results in a loss of π-.
π+ does not have to be equal to |π-|.
With this model, the Sharpe Ratio (θ) is now a function of all four parameters: precision (p), frequency (n), profit target (π+), and stop-loss (π-). Payouts no longer cancel out. The Sharpe Ratio formula becomes:
Onwards, there are visuals of the strategy results with varying combinations of these parameters.
This Chapter introduces CUSUM and Explosiveness Tests to identify structural breaks.
CUSUM tests:The CUSUM (Cumulative Sum) filter introduced in Chapter 2 for sampling bars whenever some variable, like cumulative prediction errors, exceeded a predefined threshold. This concept is further extended to test for structrual breaks
Explosiveness tests: Beyond deviation from white noise, these test whether the process exhibits exponential growth or collapse, as this is inconsistent with a random walk or stationary process, and it is unsustainable in the long run.
| Test Name | Core Idea | How it Works | Key Advantages | Key Drawbacks / Limitations |
|---|---|---|---|---|
| Brown-Durbin-Evans CUSUM | Detects breaks by testing if the cumulative sum of recursive forecasting errors deviates from a baseline of zero. | Uses Recursive Least Squares (RLS) to get 1-step-ahead prediction errors based on a feature set x_t. The cumulative sum of these standardized errors (St) is tested for statistical significance. |
Incorporates the predictive power of external features (x_t) into the break detection process. |
The results can be sensitive and inconsistent due to the arbitrary choice of the regression's starting point. |
| Chu-Stinchcombe-White CUSUM | A simplified CUSUM test that works directly on a price series by assuming a "no change" forecast and detecting deviations from a reference point. | Computes the cumulative standardized deviation of the current price from a past reference price (yn). A significant deviation implies a break. |
Computationally much simpler than the Brown-Durbin-Evans test as it does not require external features or recursive regressions. | Also suffers from the arbitrary choice of a reference level (yn), which can affect the results. |
| Chow-Type Dickey-Fuller | A basic test designed to detect a single switch from a random walk to an explosive process at a known date. | It fits an autoregressive model using a dummy variable D_t that "activates" an explosive term after a pre-specified break date τ*. |
Conceptually simple and easy to implement. | Highly impractical for finance as it requires knowing the break date τ* in advance and assumes only one break occurs. |
| Supremum ADF (SADF) | The chapter's flagship method for detecting periodically collapsing bubbles without prior knowledge of the number or timing of the breaks. | Uses a double-loop algorithm. The outer loop advances the window's endpoint t. The inner loop runs ADF tests on all backward-expanding windows [t0, t]. The SADF statistic is the supremum (maximum) ADF value found in the inner loop. |
Highly effective at detecting multiple, overlapping bubbles and their subsequent collapses. Does not require any prior assumptions about break dates. | Extremely computationally expensive (O(T^2)). The supremum statistic is very sensitive to single outliers, which can make it noisy. |
| Quantile ADF (QADF) | A robust enhancement to the SADF test that is less sensitive to single outliers. | Instead of taking the absolute supremum (maximum) ADF statistic from the inner loop, it uses a high quantile (e.g., the 95th percentile) of the distribution of ADF statistics. |
Provides a more stable and robust measure of "market explosiveness" compared to the standard SADF test. | Slightly more complex to calculate and requires choosing a quantile level (q). |
| Conditional ADF (CADF) | A further enhancement to SADF that measures the central tendency of the right tail of the ADF distribution, making it even more robust. | It calculates the average of all ADF statistics from the inner loop that fall above a certain high quantile (e.g., above the 95th percentile). | Even more robust to extreme outliers than QADF because it averages the tail rather than picking a single point from the distribution. | Adds another layer of computational complexity. |
| Sub/Super Martingale Tests | A family of alternative explosiveness tests that use different functional forms (not the ADF's autoregressive model) to detect bubbles. | Fits polynomial, exponential, or power-law trends to the data within the same double-loop framework as SADF. Includes a penalty term (t-t0)^φ to adjust the test's sensitivity to long-run vs. short-run bubbles. |
Offers greater flexibility by not being tied to ADF's specific model assumptions. The φ parameter allows the test to be tuned for specific investment horizons. |
Requires choosing an appropriate functional form (e.g., polynomial, exponential) and tuning the horizon parameter φ. |
This chapter also notes the existence of literature studies carrying out structural breaks on raw prices. However, log prices are better due to their more preferable properties:
They Ensure Time-Symmetric Returns: The magnitude of a log return is the same whether a price goes up or down. Simple percentages are not symmetric because the base of the calculation changes.
Example: A move from $10 to $15 is a +50% simple return. The reverse move from $15 to $10 is a -33.3% simple return. With log returns, the move up is ln(15/10) ≈ +0.405, and the move down is ln(10/15) ≈ -0.405. The magnitude is identical.
They Make Returns Additive Over Time: Simple returns are multiplicative, which is mathematically inconvenient. Log returns are additive, making them much easier to aggregate and analyze over time.
Example: A stock goes from $10 -> $15 -> $18. The simple returns are +50% and +20%. The total return is +80%, which is not 50%+20%.
The log returns are ln(1.5)≈0.405 and ln(1.2)≈0.182. The total log return is ln(1.8)≈0.587, which is the sum of the individual log returns.
Statistical Validity (Homoscedasticity): Log prices lead to a statistically valid model where return volatility is assumed to be constant, avoiding the unrealistic assumptions and errors (heteroscedasticity) that arise when testing raw prices. This is specicially critical for ADF/SADF test.
Example:
With raw prices:
A stock goes from $10 -> $10.5. Return is $0.5.
Same stock a year later goes from $100 -> $105. Return is $5.
Although the volatility is the same 5%, returns in dollar terms has changed which would break the model.
With logs:
$10 - $10.5. Return log(10.5) - log(10) = log(10.5/10) = 0.048 vs.
$100 - $105. Return log(105) - log(100) = log(105/100) = 0.48
This chapter introduces entropy as a way to measure the amount of information or, conversely, the degree of predictability contained within a financial price series.
In perfect markets, prices are unpredictable because they instantly incorporate all available information. However, real markets are imperfect, containing informational asymmetries that create predictable patterns. The goal is to quantify this "informational content" to create features for machine learning models, which might learn, for example, that momentum strategies work best in low-information (predictable) environments and mean-reversion in high-information (random) ones.
Measures the average uncertainty of a data source. A high entropy value signifies a complex, unpredictable series (like a fair coin toss), while low entropy signifies a simple, patterned, and redundant series (like a loaded die). The key intuition is that we learn the most from unexpected events.
To apply this to a financial series, we need a way to estimate entropy. The chapter highlights two methods:
-
Plug-in Estimator: A straightforward method that calculates entropy based on the observed frequencies of different patterns in the data.
-
Lempel-Ziv (LZ) Estimator: A more powerful approach that measures a series's complexity by its compressibility. The intuition is that a highly random, high-entropy series is difficult to compress, whereas a patterned, low-entropy series is easily compressible. The Kontoyiannis algorithm is a key implementation of this idea.
Please see the exercises for applications of these methods.
Since entropy estimators work on discrete symbols (like letters in a message), a continuous price series must first be converted into a string. This crucial step is called encoding.
-
Binary Encoding: Assigning '1' for a positive return and '0' for a negative one.
-
Quantile or Sigma Encoding: Discretizing returns into several bins, either by ensuring each bin has the same number of observations (quantile) or by making each bin cover the same range of return values (sigma).
With these tools, entropy can be applied to finance in several ways:
- Market Efficiency: High entropy suggests an efficient, unpredictable market. Persistently low entropy can signal an inefficient market with redundant patterns, which may be a precursor to a bubble.
- Portfolio Concentration: Entropy can measure the effective diversification of a portfolio's risk, offering a more nuanced view than simply counting assets.
- Market Microstructure (Adverse Selection): By calculating the entropy of the order flow imbalance (the sequence of net buying vs. selling pressure), we can estimate the risk of adverse selection. A predictable, low-entropy order flow is less risky for market makers, while a complex, high-entropy flow suggests the presence of informed traders, providing a powerful feature for predicting toxic market conditions.
This chapter offers a set of practical features part of it can be calculated directly from orderbook level data and other part can be calculated from the higher level bar data.
For the application of the features and deeper dive of the key ideas see the exercises where there is a link available for the gemini chat.
| Concept / Feature | What It Measures | In Simple Terms | Key Insight for a Trader |
|---|---|---|---|
Tick Rule (b_t) |
The inferred aggressor side of a trade (buy or sell). | A simple rule to guess if a trade was a buy (+1) or a sell (-1) based on price change. | Provides the foundational "signed trade" data needed for almost all other microstructure analysis. |
| Order Flow Imbalance (OFI) | The net buying or selling pressure, measured in volume. | The running score of "Team Buy" vs. "Team Sell." OFI = Σ (sign * volume). |
A powerful short-term momentum indicator. A rising OFI suggests prices will likely continue to rise. |
Kyle's Lambda (λ_K) |
The market's price response to order flow imbalance. | The poker dealer's "skepticism level." How much the price moves per dollar of net order flow. | Measures market impact. Helps estimate the cost of executing a large order and a high value indicates informed trading. |
Amihud's Lambda (λ_A) |
The market's illiquidity, based on total dollar volume. | How "shallow the puddle is." The absolute price change per dollar of total volume traded. | A simple, robust measure of overall market liquidity. High value means the market is illiquid and fragile. |
Roll Model (c) |
The effective bid-ask spread, inferred from price bounces. | Measures the size of the "hop" a price makes from the bid to the ask. | Useful for estimating transaction costs in markets where explicit bid/ask data is unavailable (e.g., bonds, illiquid stocks). |
High-Low Volatility (σ_hl) |
The price range volatility within a bar. | How wild the price movement was during the formation of a single bar. | A more efficient and information-rich measure of volatility than close-to-close, as it uses more data points. |
| VPIN | The probability of informed trading based on volume imbalance. | A "toxicity meter." A high VPIN suggests order flow is dangerously one-sided, signaling a potential liquidity crisis. | An early-warning system for flash crashes and extreme market instability. |
| Order Size Moments | The statistical properties of trade sizes within a bar. | "Who is trading?" Are they small retail traders or large institutions? | High skew or kurtosis can reveal the presence of "stealth" algorithms hiding large orders among small ones. |
| Volume Concentration (HHI) | The "lumpiness" of trading volume within a bar. | "Was the volume from one giant trade or a thousand tiny ones?" | Measures trade fragmentation. A rising concentration can signal that large players are becoming more aggressive. |
| Signed Volume Autocorrelation | The persistence of buy/sell activity over consecutive trades. | "If I just saw an apple, what's the chance the next item is also an apple?" | The "order splitting" detector. A high positive value is a strong signature of a large institution working a single, massive order. |
Volatility Ratio (C2C / HL) |
The character of volatility: jumps vs. ranges. | "Is the market trending in big jumps or just churning in place?" | A regime filter. A high ratio (>1) signals a trending, "gappy" market. A low ratio (<1) signals a range-bound, mean-reverting market. |
This chapter explains why and how to use parallel computing techniques—specifically vectorization and multiprocessing—to handle the computationally intensive nature of machine learning algorithms. It provides a blueprint for building a robust and reusable multiprocessing engine in Python.
ML algorithms are computationally demanding. To be efficient, they must leverage all available CPU cores, whether on a single machine or a distributed cluster. The chapter introduces mpPandasObj, a function used throughout the book, as the primary tool for this.
Vectorization is the simplest form of parallelization, where operations are applied to entire arrays at once instead of using slow, explicit for loops. The chapter shows how to replace a nested loop for a Cartesian product with a more efficient, scalable, and dimension-agnostic solution using Python's itertools.
In Python, the Global Interpreter Lock (GIL) limits true parallelism for CPU-bound tasks in multithreading by allowing only one thread to execute at a time per processor. Therefore, multiprocessing, which uses separate processes with their own memory space, is the standard way to achieve parallelism and fully utilize multiple CPU cores.
Atoms: The smallest, indivisible computational tasks. Molecules: A group of atoms assigned to a single processor to be executed sequentially. The goal of parallelization is to partition all atoms into molecules in a way that balances the workload across all available processors, ensuring no single processor becomes a bottleneck.
-
linParts (Linear Partitioning): The simplest method, which divides the list of atoms into equally sized molecules. This is suitable when all atomic tasks have similar complexity.
-
nestedParts (Nested-Loop Partitioning): A more advanced method for tasks with varying complexity, such as in a nested loop where later iterations do more work (e.g., calculating a lower-triangular matrix). This algorithm creates molecules with a similar total computational load, even if they contain a different number of atoms, leading to much more efficient parallelization.
The chapter details the construction of a generic engine to parallelize any function.
- Job Preparation (mpPandasObj): This function takes a user function, a list of "atoms" (e.g., a list of dates or tickers), and partitions them into molecules using linParts or nestedParts. It then creates a list of "jobs," where each job is a dictionary containing the function to call and the specific molecule (subset of atoms) to process.
- Asynchronous Execution (processJobs): It uses Python's multiprocessing library, specifically pool.imap_unordered, to distribute these jobs to a pool of worker processes. This method is asynchronous, meaning results are processed as they are completed, which is more efficient.
- Core Mechanics (expandCall, Pickling): The engine uses a helper function (expandCall) to unpack the job dictionary and call the user function with the correct arguments. It also includes a necessary workaround to handle the "pickling" of methods, which is required for sending tasks to other processes.
A crucial enhancement for memory-intensive tasks is to process results as they become available, rather than collecting all outputs in a list and combining them at the end. The processJobsRedux function is introduced, which can apply a "redux" function (e.g., pd.DataFrame.add, list.append) to aggregate results on the fly, significantly reducing memory consumption and preventing crashes with large outputs.
The chapter concludes with a real-world example of calculating principal components on a dataset too large to fit in memory. By breaking the data into column-based files ("molecules") and processing them in parallel while aggregating the results on the fly, the engine solves a problem that is impossible on a single thread due to both time and memory constraints. This highlights that multiprocessing is essential not just for speed, but also for scalability and memory management.
This chapter explores how to solve complex financial problems that are "intractable" for classical computers by reformulating them for quantum computers.
Many financial optimization problems, such as dynamic portfolio allocation with realistic transaction costs, are combinatorial and NP-hard. This means finding the optimal solution with a standard computer would require a brute-force search that is computationally infeasible.
Quantum computers, using qubits and the principle of superposition, can evaluate a vast number of potential solutions simultaneously. This makes them uniquely suited for solving the kind of brute-force optimization problems that overwhelm classical computers.
The chapter presents a strategy to make a problem "quantum-ready":
- Discretize the Problem: Convert a continuous problem (like portfolio weights) into a discrete integer problem. This is done by dividing capital into a set number of units (K) to be allocated among assets (N).
- Generate All Possibilities: Frame the problem as finding the best combination among all possible allocation trajectories. This involves generating every feasible portfolio at each time step and then every possible path (trajectory) through time.
- Formulate for Brute Force: This turns the problem into an exhaustive search, where the goal is to evaluate every single trajectory to find the one with the best Sharpe Ratio.
While this brute-force approach is impossibly slow on a sequential, classical computer, it creates a structure that a quantum computer can solve efficiently. The chapter demonstrates that even the most complex financial ML problems can be tackled by translating them into a format solvable by the next generation of computing hardware.
"High-Performance Computational Intelligence and Forecasting Technologies," written by Kesheng Wu and Horst D. Simon from the Lawrence Berkeley National Laboratory (LBNL). The chapter details the work of their CIFT (Computational Intelligence and Forecasting Technologies) project.
This chapter is useful for researchers and engineers who work with massive, time-sensitive datasets (i.e., streaming data) and face computational bottlenecks. Its principles are applicable to finance (like high-frequency trading analysis), energy management (power grid stability), manufacturing, and scientific research where near real-time analysis of complex data is critical.
The chapter argues that High-Performance Computing (HPC), traditionally used for large-scale scientific simulations, is superior to standard cloud computing for analyzing complex, high-volume streaming data, especially when low latency is required.
While cloud platforms are designed for high-throughput, parallel tasks, their virtualization layer introduces significant performance overhead, making them slower for time-critical, interdependent calculations. HPC systems avoid this overhead and use specialized hardware and software to minimize latency and maximize performance.
The main application of the HPC within the context of finance is a 720-fold speedup in calculating the VPIN (Volume-Synchronized Probability of Informed Trading) early-warning indicator.















