Product Development: Is the Backtest Trustworthy?
Part one of a series of articles on product development
“I have never seen a bad backtest” is an often stated criticism of backtesting that has a high degree of truth to it – many strategies launch with strong backtests yet do not pan out as intended.
In truth, a well-designed backtest is a key tool among a number of tools that should be used for financial product manufacturers and asset managers to develop and vet investment strategies. Yet, many backtests have critical flaws inherent within them. Today, I will explore some of the general flaws frequently found in poorly designed backtests, as well as methods which a prospective investor can use to “kick the tires” on a backtest to separate the well designed from the flawed.
Let’s level-set the conversation by introducing two broad statements that we believe are true in regards to product development in general and backtesting in particular:
- Product testing prior to launch is a characteristic used by most industries – testing is a critical component of new product development to ensure time and resources aren’t wasted on bad ideas (and in the case of pharma testing, to limit the number of people harmed by unintended consequences).
- A financial backtest is a specific type of product testing whereby a financial strategy is inserted within an artificial, simulated recreation of the past. Given the abundance of historical financial information, and the relatively controlled way in which securities transactions work, backtesting, done correctly, should be an excellent tool to separate out good investment strategies from weak ones. A correctly performed backtest should really mitigate the launch of “lemons” in the financial services world.
A word of caution though – even a well designed backtest may not be predictive of the future, as there may be a disconnect between historical performance and future performance. Of course, this limitation applies to all predictions based on past performance, not just backtesting – actual historic fund performances present the same weaknesses. And in fact, some of the questions asked below in analyzing backtested performance can easily be asked of a fund manager as well.
With this being said, let’s at least make sure backtesting provides a fair extrapolation of how we can expect a strategy to perform in similar markets to the one analyzed. Unfortunately, there is no standard industry practice to use independent third parties to perform backtests – wouldn’t it be nice if the big four accounting firms would backtest all new funds? Thus, it becomes critical for a would-be investor to “kick the tires” on a backtest to determine if a backtest was wrong about what would have happened in the past.
How can one differentiate between a well-executed backtest and a poorly designed one? Here are two high level issues that can ruin a backtest’s design, and some appropriate questions to ask in regards:
- Transactional “Friction”: Most strategies rely upon signals of some sort which trigger transactions – e.g., when the 50 day moving average moves above the 200 day moving average, buy the S&P 500.
One challenge of backtesting is losing a sense of the little challenges that arise when placing a transaction. Here’s an example of a flawed strategy – if a member of the S&P 500’s earnings announcements beat expectations, buy the stock same day and short that company’s closest competitor.
Can you spot the flaw? The signal to buy (i.e., the earnings announcement) happens after the market closes, but the strategy assumes the transaction occurs on the close. That strategy may backtest perfectly well but it would be impossible to implement.
The historical liquidity of the security in question can also be lost in a backtest. For instance, trying to trade out of esoteric fixed income securities, such as CDOs, was extremely difficult in late 2008 – the cost to do so would have been prohibitive, and not necessarily captured in a backtest that may use a mid-market valuation mark.
- Questions for “kicking the tires” on transactional friction
- When and how is a transaction executed within the backtest? How may this differ in the real world?
- What is the liquidity of the underlying securities of the strategy?
- Bid vs. ask spread is important, as well as the depth of market
- Market makers may take advantage of any material consistent flows, raising transactional costs above backtested assumptions
- What are the general execution costs per transaction, and are they appropriately factored into the backtest?
- How frequently does the backtest signal a transaction to occur? The more frequent the transaction, the higher the frictional costs…
- Overfitting: Backtests use a sample period (“In Sample” data) for initially testing and refining a strategy. It is important that the final refined strategy then be tested with “Out of Sample” data that doesn’t overlap with In Sample data; the Out of Sample data should not be used to refine the strategy.
Overfitting is an esoteric concept which can be readily explained by going outside of finance for a moment – Larry is a “trekkie” and has a ton of success at Star Trek conventions (his “In Sample” data) meeting friends and finding dates, often weaving phrases like “live long and prosper” and “beam me up, Scotty” into conversations. Post conventions, he decides to use his successful trekkie dialogue and his Vulcan ears while interviewing for Wall Street positions. This is an example of overfitting, because those strategies were perfectly tailored for one particular, albeit incredibly important, In Sample time in his life. It would have been better for Larry to first test his strategy with an Out of Sample data set (e.g., a Monster Truck show), before applying the strategy to a real world situation like a job interview.
Similarly, product designers may be tempted to tweak an investment strategy and “juice returns” using In Sample data, thereby creating a self-fulfilling prophecy of success.
To further illustrate with a finance example, if I use my In Sample data to calculate the optimal amount of days to use in a moving average crossover signal, and then test my strategy with the same In Sample data, of course I am going to see favorable returns in the backtest. Data-mining within In Sample data to prove an underlying principle, especially if it is used to develop the initial strategy, is dangerous. A well-executed Out of Sample test might highlight the design flaws presented here.
Unfortunately, overfitting is a nefarious mistake that can be harder to tease out than the transactional mistakes noted above.
- Questions for “kicking the tires” on overfitting
- Over what time period was the backtesting conducted, and why?
- At a minimum, the time period should cover a market cycle
- How does performance look if we shift the start date in both directions?
- Ask to see performance – you should provide the dates
- What was the original principle behind the strategy? If modified from inception to completion, how and why?
- If Alpha / superior metrics are claimed: Why/how did the strategy capture stated metrics, and what is the thesis for how it will continue to capture said performance?
- Has the backtest utilized Out of Sample data or other methods of testing such as a Monte Carlo Simulation to sanity check results?
- If not, why not?
- If so, what were the results?
- Over what time period was the backtesting conducted, and why?
In conclusion, backtests done correctly are an effective tool in the development of financial strategies. However, going into the past is a challenge, just ask Marty McFly – many missteps can happen along the way. Hopefully some of the general considerations we have pointed out, as well as corresponding questions to ask, can help investors properly vet any backtests that may be presented.
In this piece, we have focused largely on the dangers of backtests not correctly assessing how a strategy would have performed in the past. In the next piece in this product development series, we will discuss the challenge of using past information, be it actual or simulated backtest information, to predict future performance.