The Ultimate Way to Compare Risk Models: "Out of Sample Back-Testing"

How should firms make sensible choices, when choosing between different risk models?

A logical consequence of the financial crisis is that many institutions are reviewing the tools they use to control market risks. This makes perfect sense: experiencing serious loss during the global crisis can definitively be viewed as a failure of the risk management process. Not learning from such an experience will be hard to explain when the next global financial crisis eventually strikes.

Now, how should you make a sensible choice, when you have to choose between different risk models?

The first option is also the most standard: the rationale is that it is better to be wrong with the crowd than to take the risk of being right with a happy few. When coming to extreme risk management, this is obviously a very dangerous rationale. Let's reformulate the so-called rationale view in a more explicit way: it is equivalent to saying "better to die or be severely harmed with the majority rather than to walk harmless through the crisis with the minority".

The second option is to select a system with the largest possible choice of models and parameters to play with, and with as many references as possible to key quant terms like "fat tails", "skew", "kurtosis", "garch, "copula", "stable distribution", etc. While providers of these approaches will say it is the only way to properly measure risk, this is only true if you had been sufficiently lucky to spot and select, among many, the one particular set of parameters which could have created an efficient warning. With luck, it might work better than the credit risk models used by the experts to explain that sub primes were AAA products…

It is a pity that a lot of people in the market are still focused on choosing between these two options, given their lack of effectiveness. Risk modeling is a science. All scientists will agree with Karl Popper that a scientific model is falsifiable. A term that does not mean something is false; rather, that if it is false, then this can be shown by observation or experiment. Therefore, there are tests to demonstrate the validity of a risk model, and, as long as the model passes the tests, it can be considered valid. To define the falsifiable test for a risk system, you must define the exact meaning of "a valid risk model that works."

The immediate answer is that it should help avoid surprises: this is precisely where a number of them failed in 2008. For many portfolio managers, the way their portfolio reacted to the markets was a complete surprise. A risk system which indicates you cannot lose more than 2%, but where you actually then lose 20% obviously has problems. This is the same conclusion for a system that claims you are market neutral, while you actually lose 10% when the equity market drops by 20%.

So what is a good risk system? It is first of all a system capable of correctly anticipating the behavior of a portfolio regardless of the market conditions. But that is not sufficient: if the system warns that you may lose 10% if the market drops by 20% only one second before the Lehman bankruptcy announcement, that's nice, but clearly too late. And, when we are talking about fund investors, too late is not one second, it is anything between one week and one month or one quarter. It means that the risk system must be able to deliver a timely anticipation, consistent with the liquidity of the portfolio.

Still this is not sufficient. Let's imagine I warned you in November 2007 that your portfolio could lose 10% in case of a financial crisis. But, one month later in December 2007, on the same portfolio, I changed my mind and claimed you could only lose 2%. And again, the following month, I revised my anticipation and the figure became 20%, etc. Clearly, my timely anticipation is completely useless, because it is not stable. Stable does not mean static: it means that changes in warnings can be explained either by portfolio changes or by long-term trends: if I get 8% in November 2007, then 10% in December, then 12% in January, this is a clear, effective warning for me.

In conclusion, a risk system works if it is able to anticipate portfolio behavior in a timely and stable manner, whatever the market conditions.

Out-of-sample, back-testing is the most undisputable way to test the ability of a risk model to anticipate portfolio behavior. It is easy to implement and provides results which can be validated and compared between various risk solutions.

The test protocol is simple to define:

* Choose a simulation period in the past. It should be before the global financial crisis, sufficiently in advance so that the delay would have allowed you to realistically rebalance your portfolio considering the liquidity of your assets and your decision process.

* Run the risk model as it was at this time (i.e. only based on the information available at this time) on the portfolio you had at this time (or a similar one)

* Check the results and compare them with the effective performance of the same portfolio AFTER the period of simulation.

This is called the "out of sample" back testing protocol, and simply answers the question: if I had used this risk model at that time, would it have helped me in an effective way, i.e. would it have delivered a timely, stable and meaningful warning on my portfolio.

Not only will such a test help you choose a good risk system, it will moreover help quantify its business value. You just need to go a step further in the test: list the decisions the system's warnings would have led you to take. And then, quantify their impact on your portfolio returns thereafter. Finally, compare that with the cost of the system.

What if the risk system does not allow you to run such tests, because it is not possible to "back date" it – i.e. to retrieve the model and the data as they were e in the past? The answer is simple: a risk system which you cannot back test in this way is not "falsifiable" therefore you should stay away from it. You could also argue that running the test by yourself could be expensive, but delegating it to a vendor could lead to all sorts of manipulations. The answer is in sending "blind" data for the test (hide names, truncate time series) and in the future the ability to reproduce the test by yourself: that can be one the contractual constraints imposed to the vendor!

Running such tests will definitely help narrow the lists of the risk systems which work. It is also a way to demonstrate to your board, your clients and your regulators that you understand what risk is about, what are the best practices in the industry and what are the limitations of the system you finally select.

Olivier Le Marois is President of Riskdata, a leading provider of risk management solutions for the global hedge fund and asset management industries.