Podcast: Ep.8 Back, Back, Back it Up (Backtesting)
In this Podcast the following material will help you follow along:
Here is a Google Spreadsheet of a backtest where we buy a stock after it falls 5% in 5 days, and then hold the stock for 10 days. The stock is an S&P 500 ETF (SPY). We use this strategy as an example throughout the podcast and here is how you can make one simply using Excel/Gooogle Spreadsheets!
Google Spreadsheet of a Sample Backtest
Are are some other useful resources mentioned in the podcast
- Value and Momentum Everywhere – A famous paper made by the founder of AQR, the 2nd largest quantitative hedge fund showing how a backtest can be run and how strategies don’t need theoretical-physics-level-math
- A blog by Brett Steenbarger, a well-respected trading psychologist who helps us form trading processes and understanding our biases
- Psychology of Intelligence Analysis – A fantastic book made public domain by the CIA that focuses on our biases when studying quantitative information. A trading mentor once told me he considers it the best “trading” book ever written
Forming a backtest is a skill I’ve spent the past 8 years honing, and after many years of toiling, I share with you some of the secrets I’ve uncovered. Hear the lessons I’ve learned the hard way and the biggest mistakes I see traders and investors make, including experienced ones at banks and hedge funds. This is an easy-to-follow episode that discusses different ways to conduct backtests and the gotchas behind them. I also share a rigorous 10 question check-list I always use when running a new study. This episode is applicable even if you’re a purely discretionary/gut trader as the greatest discretionary traders also rely on historical studies. And if you’re a data scientist, you’ll especially enjoy this episode.
Here is the script that was used in today’s episode.
Note: I don’t follow scripts word-for-word as they can sound unnatural, but the episodes do closely follow them.
Ep.8 Back, back, back, it up (Backtesting)
Listeners! This is possibly going to be one of the most useful episodes for you all whether or not you know what backtesting is. The reason? This is something I spent many many years trying to hone in and understand and was blessed to be mentored by some of the most fantastic people in trading who know this subject well. So this episode is going to be a combination of the past 8 years of my failures, trials, and eventual success in backtesting. Even if you’ve never backtested, or you’re a data scientist and think you know what this is, trust me – this will shape the way you think about the investing and trading.
So briefly? What is backtesting? You’ve actually seen backtesting not only on CNBC, which hopefully you watch sparingly, but also on ESPN! So if you think you it’s too complicated, trust me – you’ve already been exposed to it.
So backtesting is simply taking an investing or trading strategy, forming it into rules, then seeing how those rules performs historically. A simple example you may see on TV is, “when the S&P was down 3 days in a row, it was also down on the 4th day.” Or on ESPN it may be “a 1st round seed has never lost in the first round in NCAA basketball.” Just a heads up: I’m making these numbers up.
So what’s the rule in the first example? You want to see if it’s worthwhile to buy an S&P ETF since it’s been down 3 days in a row and you think it’s time for it to comeback. So you want to see historically if this has worked out in the past. Typically, somebody would make a test that looks back every time the S&P has been down 3 days in a row and then measure if it would go up the fourth day. There are a lot more fun nuances we’ll get into this and how to properly test it.
In the ESPN example, the backtest’s rule is simple: has any #1 seed ever lost in the first round? You go through all the data historically and test to see if that’s ever happened.
Before you shut off the podcast, know that you don’t have to be a programmer anymore to do stuff like this, you can now use things that look very simple. Tiingo actually has tools to do this, and we’re building more, but this is becoming a trend. This podcast episode will discuss some resources where you can backtest things, whether or not you are a programmer or not a programmer. We will also walk through a backtest example that you can do in excel.
I made this episode also because I see backtests in news articles and the media, and often they do it wrong. The tools are becoming much more accessible, even for hardcore programmers, but we still need to learn how to use them. Likewise, having a hammer, nails, and wood wont build a new house. We still gotta learn how to use the tools!
And before we deep dive into this, just want to take a quick break to describe some Tiingo announcements. The magazine issue of Modern Trader featuring Tiingo in the cover story is available at moderntrader.com or Barnes and Noble as the July issue. If the issue is out of print by the time you’re listening, ask me and I’ll send you a scan of the Tiingo page so you can listen it J It was a huge honor and we are incredibly thankful for it.
Secondly Tiingo.com is now available in a mobile version, so check it out on your device! It’s pretty surreal to think people now have a high-end financial app in their pockets. I realized I took this for granted, the fact that I can access google, my E-mail, or Facebook right in my pocket…but it really is extraordinary! And now you can access awesome data and a portfolio risk system in your pocket. This wraps up the major UI overhaul and now changes will be more incremental.
Thirdly, Tiingo is now using modern cryptography, so when using Tiingo, your data is encrypted using the latest security measures.
And finally, the fundamental data has received a massive, massive update. We now have structured fundamental data for over 4,300 companies, including companies that no longer trade and very small microcap companies. Not only that, but you can see annual statements in addition to quarterly that goes back over ten years. Annnd to make It even sweeter, you can now see what fundamental data the company reported when they filed, and also see any restatements they made. This is all structured on Tiingo, so it’s pure data, you don’t have to dig through documents anymore.
If you like what Tiingo’s doing, whether it’s the podcast, the website, mission, or so on, we ask that you pay what you can on Tiingo.com/support, that’s Tiingo.com/support (spell out).
That concludes the announcements so let’s get back into it!
So let’s walk through a tradeable backtest and how we can create one. This will be the foundation for the rest of the podcast. You may notice, I’m going to spend a lot more time discussing how to test a backtest and the problems with backtests, rather than how to create one. This is because there are so many traps you can make as a data scientist in finance and unlearning then re-learning is so much harder than learning it properly the first time.
First, to continue we need to define a backtest study vs a tradeable backtest. Previously we gave examples of two backtests, but if we think back to them, they are not tradeable. If the S&P falls the past 3 days on noticing what happens the next day is an interesting study, but not tradeable. In order for a backtest to be tradeable we need to meet two conditions
- There has to be a buy condition
- There has to be a sell condition
Another markets example would be, “what would happen if I bought a stock after it fell 5% in one week?” This is an incomplete back test because it gives us the condition for buying a stock but not selling it. A complete rule would be, “If a stock falls 5% in 5 days, I will buy and hold the stock for 10 days and then sell shares.” Here we have both a buy and sell condition. I’m going to use this example for the rest of the episode.
To test an idea like this, we can simply do this in Excel or Google spreadsheets. In the blog, blog.tiingo.com, I attached a link to this backtested strategy in Google spreadsheets. Before of the feedback from you all, I’ve learned it’s not very effective to walk through a spreadsheet via podcast haha. So we’re going to skip over, but the spreadsheet document on the blog is well-annotated. It also goes through very simply why we use log returns instead of simple returns when doing backtests. We discussed this in a prior episode, so I won’t repeat it here as to what the differences are. The spreadsheet does a much better job than I could do over voice.
Anyway, with the idea of a tradeable backtest established, I want to dig into something else. I want to now dig into the problems I see all the time in both the news media and publications sent out to hedge funds, banks, and so on. And that’s the topic of poor data science in markets.
A quick story before we move on: There is a general rule in financial backtests and that’s “if it’s too good to be true, it probably is.” A few months ago I had a company come to me trying to pitch me their product. Generally when people do this, I always listen because as a guy trying to grind out a new business himself, I totally empathize. In fact, I’ll often give advice back to the owners and spend an insane amount of time crafting the advice. Many of my users and listeners do that for me, so I will do that for others! It’s the golden rule.
Anyway, this company comes to me and pitches me a product with innnnsane performance. I mean the performance of this strategy was mind-blowing. And as soon as I saw it, I asked them a few questions and realized they didn’t understand the mechanics of backtesting. That’s okay, because if you’re new to markets, why would you expect anybody to understand backtesting? Heck, this is kind of embarrassing but I only knew what the Louve was 3 years ago. I never grew up around art or was exposed to it. Sometimes what seems so obvious to us is not so much for others.
But this is kind of an unintuitive concept isn’t it? A strategy performs so well that you know it can’t be real? This company then told us they went to many quant funds and they haven’t won any contracts. And it hit me, it’s because the people who backtest for a living know something is up. My friend who works at a big fund these days saw the company’s business card on my desk and said, “Ah, they spoke to us too. What did you think?” I responded with, “the same thing your company thought.”
My hope is that for all my listeners listening, that by the end of this episode you will know the gotchyas to backtesting. My goal is that if you were the company presenting, you would be able to defend your performance and thesis from people like me. Or if you have a theory on how markets work, you will be able to test it.
The problem with a poorly formed backtest is that you will lose money. Your backtest will work historically, but fail miserably in the future for reasons we’ll get into. You will trade the strategy with confidence when it only loses you money.
Often, even discretionary traders back-test ideas. If you’re a discretionary trader, a back-test will help you understand how much value baselines give you. For example, you may try to look for stocks that are undervalued, so you may look at a P/E ratio…basically what a stock’s price is to how much money it makes. A low p/e ratio typically means undervalued, but if you backtest it you can see if buying low p/e stocks actually works. Also, if it does work, you can see how often it works. Maybe it works only 55% of the time? That makes it a much lower conviction trade. So this is why even gut traders like backtests, it puts their view and ideas in the context of how they’ve performed in the past.
I make this argument many times, but even if you are a data scientist who doesn’t focus in finance, I believe you will find good value in this episode. The reason is that data science in tech is becoming a hot topic, but finance was forced to innovate and explore this topic long ago. The truth is that in trading, if your backtest or study is even the slightest bit off, you will know pretty soon when you lose money and you will be out of a job. This has made finance approach studies and data science with an intense rigor, and because of the incentives of trading, it’s often beneifical to keep these a secret as you’re competing with others.
So, let me reveal some of those secrets to you all J
The main issues I have found are overfitting and model robustness, the dual in-sample problem, and product knoweldge
So what is overfitting? Well taking the above example that if a stock drops 5% in 5 days, we will buy the stock and hold it for 10 days, it’s very clear why we chose some of those numbers. 5 days are the number of business days in a week. It’s another way of saying 1 week. 10 days is 2 weeks. 5% is also a nice round number.
What if that above strategy returns, on average, a 2% return a year? But we think, “what only 2% a year? That’s nothing, I want more”
So we start tweaking our model parameters. A parameter is something in our model that we can change. In the example backtest, we have 3 parameters:
- How much a stock drops , the 5%
- How many days do we measure that drop? In this case we’re measuring the 5% drop in 5 days
- And how long do we hold the stock for before we sell it? In this case it’s 10 days, or 2 weeks.
After our tinkering we find that we can get the strategy to return an average of 9% a year if we do the following:
If a stock drops 7.62% in 12 days, we buy and hold the stock for 16 days.
But looking at these numbers, what do they all mean? We chose 5% in the original backtest because it was a nice round number and a multiple of 5. But what Is 7.6%? Where does that number come from. And why are we measuring the drop in 12 days? Where does 12 come from? It’s not really 1 week or 2 weeks, it’s 2 weeks and 2 days. And why did we choose 16 days? That’s not 3 weeks, it’s 3 weeks and 1 day.
All of the parameters above were just randomly chosen. And that is the dangerous part.
But you may be wondering, “Rishi, why does that even matter? Who cares, it results in the best performance.” And this is why the problem is so dangerous. With enough tinkering, any model can be made profitable or predictive.
Let’s take a look at example that may make this more obvious. Every week, on a Thursday at 8:30am, the government releases numbers with the number of people filing for unemployment. This is called the initial jobless claims. Many researchers and wall street analysts try to predict this number as it can sometimes move markets. After the 2008 recession, traders watched this release because it helped guide the economic recovery. If the economy was healing faster than people thought, markets would rise. If it was healing slower than people thought, markets would fall – generally speaking.
So Google has a tool called google correlate. What it does is that it allows you to submit Google data, and it tells you what search results were correlated to that timeseries. So I fed Google a timeseries of these unemployment claims. When we do that, we see initial jobless claims correlated to the search word “load modification” with a correlation of 96%. This could make sense, maybe people want to modify their loans because of foreclosure. But we were also going through a housing crises? What would’ve happened in 2001 where it was a tech bubble bursting rather than the housing bubble?
Also, all of the other correlated search results are nonsense. “laguna beach jeans” correlated 95% with unemployment claim data. Does the search result of laguna beach jeans predict initial jobless claims or is that a statistical artifiact?
I’ll let you play with Google’s data for this. It’s fun stuff and Google actually has a paper out that shows how correlate could be a useful tool for predicting economic data. Wow I’ve plugged google like 3 times in this podcast…. Google google google, use google yay. It’s like when I was watching the terminator 2 the other day and I noticed pepsi cans and vending machines.
Just like our correlation example, if we keep digging into data long enough, we find random relationships. This is called overfitting, modifying the data until we get the result that we want. If you’re reading a financial article or speaking to people on wall street, they may refer to overfitting as “data mining.” For anybody in tech or somebody interested in statistics, this is confusing as data mining means something entirely different. In finance though, data mining is almost always used negatively to mean overfitting. That’s just a quick semantical aside.
But even if the relationship makes sense, it may be so specific that it doesn’t work outside of the timeframe. For example, “load modification” may work for a crises related to the mortgage crises, but what about if it was a tech bubble bursting? Are people googleing for “loan modification” really a good indicator? Also is that data even applicable today? As Google in 2000 was a far different company than today. Will Google search results be an indicator of the future?
So how do we counter overfitting? How do we measure model robustness?
So we just described overfitting and model robustness.
As a data scientist you have to question every single one of your inputs and model parameters. Not just the results, but why everything was chosen.
With overfitting, we really have to practice self-discipline. This is the tough answer. We as people can always torture and twist data to get us to tell us what we want it to. You can see this all the time when political issues where two lobbying groups will use data to support their idea even though they are polar opposites. How can both parties use data to prove something? Because they take a some truth and use the statistics they want to tell their side of the story.
Unfortunately for us, if we do that in markets, the markets will take our money. We have to find the truth and be real with ourselves. If we are dishonest, we will lose our own money. This is harder than you think and there are trading psychology books that go into this. To combat overfitting, we have to hold ourselves accountable.
And to hold ourselves accountable, all – and yes I say all – successful traders – both discretionary and quantitative, have a journal or a process in place. These are individually crafted rules that hold ourselves accountable. Here are a few processes and rules I have that let me make sure I am being honest with myself. Maybe some will work for you, and some may not. And noticehow I don’t include any statistical tests below. Those are my last-stage tests because like I said, we can use statistics to tell us the picture we want. I first like to make sure my ideas have grounding before getting stats involved as it prevents me from twisting data and overfitting.
If you ask any experienced trader, all – yes all – will tell you simplicity is favored over complexity. You absolutely should specific statistical tests like t-tests, p values, distributions and so on, but that’s beyond the scope of this episode and there are really nice simple visualizations online of them.
Also, if you read the papers published by AQR, the 2nd largest quantitative hedge fund, you will find much of their research is totally accessible and their math does not really get any more complex than calculus, much of it can be done with algebra.
The truth is, and this is something I see often, that machine learning, advanced statistical analysis, and so on do not make a better trader. In fact, it gives you more creative ways to part with your money. I see it all the time, and you would be surprised with how simple many quantitative trading strategies can be. I’ll add some links to AQRs papers if you don’t believe me in the blog – blog.tiingo.com
And an aside for those of you who hear about machine learning: right now machine learning in markets is sexy and sells, but remember it very rarely makes money by itself. It’s not the holy grail of trading. In fact, every quantitative trader I know who uses machine learning, uses it after many years of getting their models working without using it. The ones that do use it, often use it as a last optimization. And even the traders I know who use it, I can count on one hand. Their profitability did not drastically change once they used machine learning. The blog will contain papers by big hedge funds just to show you how simple the math can be.
Anyway, here are some of my snippets I use to hold myself accountable and make sure my models are flexible and robust. The accountability and overfitting really go hand in hand.
- Why would this idea work? What is the current research and conditions out there that support why this would and wouldn’t work
- What is my hypothesis, or null hypothesis – what am I testing?
- Are there any relevant research papers out there? Can I replicate them? My trading mentor told me he’s only been able to replicate 20-30% of papers, and I have found about the same to be true. Some of the errors in research papers out there are horrible
- Should this theory or idea work across markets and/or across stocks? Or does it only work for one stock or one asset class? If it only works for one why? This is a huge warning sign for me. If looking at stocks, it should at the very very least work in the sector.
- What is the risk adjusted return of this model? Basically what is the average return and volatility of this model?
- How many times did I run this model and change parameters? How many times did these changes result in better performance? Keeping a tally of how many times you tweaked parameters is a good way to be honest with yourself about how much you tortured the data
- Does the model trade all stocks equally or is the majority of returns driven by a couple stocks
- For all the big gains and losses in the strategy, check them manually for data errors
- When will this strategy fail? This is such an important question. If you don’t know when or why this strategy fails, then you don’t really know the strategy or all or why it makes money.
- How does the profitability of a strategy change if I slightly tweak a parameter? Is there a relationship between how much I tweak the parameter, how much the profitability changes?
This is an incomplete list, but I think it’s a good starting point.
One thing that people do to help prevent overfitting in the in-sample and out-of-sample backtest. But I’ve found this often results in something I call the dual-in-sample error.