Wednesday, August 7, 2019

How to know whether your data is trustworthy

As a consultant, and former bank executive, I have been responsible for working on a lot of data modeling  initiatives, especially in the financial and trading areas.

A key way to test whether your data model is trustworthy is to randomly break your data up into several parts.  Use one part for developing/training your model.  Then use your model to evaluate one or more of the other parts of your data, and see if the results match statistically.

For example, if you wanted to develop a system for trading stocks, and you had data from 1990 - 2015, you might want to randomly divide it up into three parts of roughly 8 years each. 

Then, you would develop the system using the data in part 3.  Let's say that it returned 3% more than the market. You would then test your model with parts 1 and parts 2.  If the results are similar, then you would know that the data is trustworthy.

Another example might be x-ray data used to build an AI model to detect cancer.  You could randomly divide the x-rays into 4 groups. Then, train the model on one group.  Then, find out how many false positives and negatives you get.  Then you would test the models on the other 3 groups of X-rays.  If the number of false positives and negatives were similar, then you know that your data model is trustworthy.

© 2019 Praveen Puri

Praveen Puri is the Strategic Simplicity® expert who has delivered over $400 million in value. He helps clients "weaponize" simplicity and bridge the gap between strategy and execution. Visit