Trading performance measurement. The necessary tools and metrics.
Presenting the necessary tools to evaluate your trading.
One particular question arises after applying a trading strategy is, how good is it? The goodness metric is reflected through three characteristics:
- Consistent low-volatility returns;
- Stable accuracies through time;
- Economically intuitive robust features;
We will discuss the three characteristics and present some performance and evaluation metrics. After all, we do not want to get excited about a strategy that outperformed during a certain period of time by luck, we have to make sure that it is valid for different periods of time and different values.
Consistent low-volatility returns
A strategy that produces volatile results is not a desirable one. It is preferred that the strategy can handle different market fluctuations and regimes and not provide a 28% return on a period only to give out -54% the next one. The difficulty here is assessing how well in the future will it be consistent, and the most efficient way to address this is through time series cross validation and calculating the standard deviation of these results (when dealing with systematic strategies).
Stable accuracies through time
This concept is extremely important as it shows how well the strategy is doing on average. You can have a fantastic accuracy of 64% but on another back-test you get 31%. With random-like financial time series we are interested on average in a stable accuracy of between 52% — 54% with a risk reward that is at least around 1.8–2.0 as that will also give us a margin for when transaction costs and other fees are processed. Cross validation can also be performed to measure accuracy each time the model is run.
Economically intuitive robust features
If you want to predict housing prices, you will intuitively think of areas, number of bedrooms, parking spaces, region, etc. It is reasonable to assume that these variables impact housing prices. More bedrooms should increase the price of a house, all else equal. Therefore, when creating a model other than an auto-regressive algorithm, variables should be chosen so as to explain the variations in the dependent variable. Chocolate ice cream sales cannot be used to predict default rates on financial institutions, correlation does not mean causality, even though in the field of data science, many ignore this rule. In other words, when developing a fundamental strategy, we have to make sure that the explanatory variables are economically sound.
PERFORMANCE & EFFICIENCY MEASUREMENT
The ultimate goal of speculative trading is alpha generation if it is benchmarked to an index, otherwise a positive net return after transaction costs and various fees, as well as a decent return after tax considerations.
— — Gross vs net return
The first measure is undoubtedly the most common and important one. Gross and net returns are what investors and traders are seeking. A 5% return on a $100,000 investment is a $5,000 profit. Gross return is calculated as the price return of the asset’s appreciation or depreciation while net return is gross return minus transaction costs, taxes, and other fees, it is what is left after you have paid-off everything.

With r0 as the starting price of the asset and rt as the final (selling) price of the asset. We can also assume that rt is the current portfolio value and r0 is the starting portfolio value.
— — Sharpe, Sortino, and Calmar ratios
The Sharpe ratio is the most commonly used ratio of risk-adjusted returns. It is based on the excess return relative to the risk taken. It is very useful in comparing different strategies. Of course, the higher the Sharpe ratio the better the strategy when adjustments are made to make sure it is stable. The formula is:

A Sharpe that is higher than 1 is desirable, while a Sharpe less than one and greater than zero suggests that the risks taken might be excessive for that level of return. A negative Sharpe ratio means that the return generated is lower than the risk-free, hence, an investment in a risk-free security provides better results. As useful as the Sharpe ratio can be, it has its limitations such as:
- It suffers from time dependency.
- Illiquid holdings tend to bias the ratio to the upside.
- It is not appropriate for non-normal returns, i.e. skewed.
- It does not consider correlations among assets.
The Sortino ratio has the same intuition as the Sharpe ratio and is calculated in the same manner except that we divide by the standard deviation of negative returns (it is referred to as downside deviation) as opposed to the Sharpe ratio that uses the standard deviation of returns including positive ones. A Sortino ratio of 2 is considered adequate and a good rule of thumb is to use the Sharpe ratio when evaluating low-volatility investments and the Sortino ratio for high-volatility investments.

The Calmar ratio however, takes the return of the portfolio and divides it by maximum drawdown. Generally, it is used to evaluate hedge fund and CTA’s performance.

— — The information ratio
Before discussing the IR, we need to understand two concepts, active return and active risk. Active return is also called excess-return and it is the return from the portfolio minus a benchmark which is sometimes a risk-free security. We can also use an equity benchmark if we are managing an equity portfolio. Active risk is also called tracking error and it is the standard deviation of active returns. The information ratio is therefore active return divided by active risk.

— — Profit factor
This is a relatively quick and straightforward method to compute the profitability of the strategy. It is calculated as the total gross profit over the total gross loss in absolute values, hence, the interpretation of the profit factor (also referred to as profitability index in the jargon of corporate finance) is how much profit is generated per $1 of loss. The formula for the profit factor is:

STRATEGY AND RISK EVALUATION
Risk exists wherever there is an opportunity and managing it means maximizing your chances for ending up in positive territory. We will now explore risk and various metrics that can offer us more insight into our strategy and help us optimize it.
— — Hit ratio
The hit ratio is extremely easy to use. It is simply the number of winning trades over the number of the trades taken in total. For example, if we have 1359 trades over the course of 5 years and we have been profitable in 711 of them , then our hit ratio (accuracy) is 711/1359 = 52.31%.
— — Risk-Reward ratio
The risk-reward ratio (or reward-risk ratio) measures on average how much reward do you expect for every risk you’re willing to take. For example, you want to buy a stock at $100, you have a target at $110, and you place your stop-loss order at $95. What is your risk reward ratio? Clearly, you’re risking $5 to gain $10 and thus 10/5 = 2.0. Your risk reward ratio is therefore 2. It is generally recommended to always have a ratio that is higher than 1.0 with 2.0 as being optimal. In this case, if you trade equal quantities (size) and risking half of what you expect to earn, you will only need a hit ratio of 33.33% to breakeven. A good risk-reward ratio will take the stress out of pursuing a high hit ratio.
— — RMSE
The root-mean-square error measures how well the model has performed by calculating the difference between the predictions and the actual values. In other words, it is the standard deviation of the errors (residuals). Heavily used in regression models, we can see that it is very intuitive. The bigger the RMSE, the worse our model has performed because of the large distances between the predictions and the actual values. Calculating it is a simple three-step way, we find the residuals (forecasts minus actual values) and square them, calculate their mean and take its square root. Just like in the process of calculating standard deviation.

Another measure worth mentioning in the case of a linear model is the R2. This is goodness-of-fit measure is very common in the econometrics field. It is the percentage of the dependent variable that is explained by the independent variable(s). Before we introduce the formula (that is very simple), we must mention two calculations, SSE (squared sum of errors) is the unexplained part by the model and the SST (squared sum of totals) is the unexplained plus the explained part by the model. Intuitively, from the formula below we can see that the R2 measures the percentage explained by the model.

It is obvious that this metric is better suited for systematic strategies that rely on forecasts.
— — Maximum drawdown
This measure alongside the Calmar ratio is used heavily in evaluating hedge funds’ performance and risk. Maximum drawdown is the distance from a peak to a trough of a portfolio on the condition that a new peak has not been formed yet. The aim of this risk metric is capital preservation as in intermediate periods, the value of the portfolio can get so low that it is impossible to continue without offering more margins or investors panicking. The closer to zero the better, and a maximum drawdown of zero means that the portfolio’s value never went below its initial value.

— — Expectancy
Expectancy is a flexible measure that is composed of the average win/loss and the hit ratio. It provides the expected profit or loss on a dollar figure weighted by the hit ratio. The win rate is what we refer to as the hit ratio in the below formula, and through that, the loss ratio is 1 — hit ratio.

— — Ratio of longs and shorts
When you buy an ETF, you are buying every share of the index and that means a pure long exposure which translates to a long ratio of 1 (100% of positions taken are long). Similarly, when you perform three trades, long short and long again, your long ratio is 66.66% and your short ratio is 33.33%. Therefore, in the case we create a long/short machine learning algorithm that trades on every signal and we find out that it has a long ratio of 1, we should understand that it is not really doing anything but buy. It would be cheaper to buy an ETF at the beginning rather than buy on every trading day. The optimal long (or short) ratio should be around 0.5 with some acceptable deviations. A 0.8 can be interpreted as a biased predictor.
It is obvious that this metric is better suited for systematic unbiased strategies.
— — Average win & loss
A high accuracy rate does not mean a profitable strategy, rather, the system has good predictive ability, and the torch should now be passed to the risk management system which will have a principal objective of preserving capital. It is absolutely crucial to have an efficient system that allows us to structure the trade in a way that minimizes risk without cutting too much profit expectations. A stop-loss is essential for trading, otherwise losses can get very big in such a short period. A way of estimating expected future gains and losses is through calculating the historical values. Of course, it does not mean that the next ones will be the same, but it is the best estimate to our expectations. Average win percentage is taken by calculating the previous winning trades and dividing them by the number of the winning trades, while average loss percentage is taken by calculating the previous losing trades and dividing them by the number of the losing trades. An indication of profitability with a proper risk system is when the average win is greater than the average loss, the bigger the better, although a very big gap between the two may suggest very volatile returns. For example, on five winning trades $100, $200, $100, $950, $50 we find that the fourth one seems to be an outlier and might skew the average. That is why, a good system will have good accuracy, good average win, and low volatility of returns.
Therefore, an important conclusion has to be made here. It is not absolutely primordial for the accuracy to be well-above 50% if the average win ratio is bigger than the average loss. That is the power of risk management in generating and keeping profits.
Estimating transaction costs
Something you should always be aware of is that back-testing results are false. You are most likely to never get a good estimate of future results except perhaps by luck. You cannot accurately estimate the actual fees, spreads, slippage, and any other unexpected events that will occur during live execution and therefore when including a proxy of these costs in your back-tests, it is always helpful to bias them upwards. For instance, consider that the average historical bid-ask spread on the USDCAD pair given by your broker is 0.6 pips, the best thing to do is to suppose that the actual spread is at least the historical average plus a margin for all the unexpected costs.
Here’s a more detailed example in the following table:

The best argument for biasing the costs upwards is to escape from unpleasant surprises during live trading as well as to test the robustness of the model when encountering a volatile environment. The disadvantage of doing so is that many short-term models will get filtered out, for instance, models that run on M5 time frames are more sensitive to costs than models that run on hourly time frames and thus cost management is imperative for the model to be able to provide consistent results. However, if your model depends on maximizing the accuracy of expected transaction costs, then it is helpful to know that they have been proven to be non-linearly correlated with certain variables such as actual volatility. However, a more simplistic example would even be to try a regression using past variables to explain the historical costs and assuming that the relationship will hold over the short time frame. The back-tests will use just a small number of these performance metrics and the transaction costs will be arbitrary.
