With the tournament drawing near, fans of the World Cup are starting to get excited. Thinking back to recent World Cups, 2010 was remembered equally for the piercing wails of Vuvuzellas as it was for Spain’s dominance. That dominance was perfectly predicted by an Octopus named Paul, who rose to fame after successfully calling the winner of 8 straight matches during the tournament, and with it, had its moment in the zeitgeist. While some fans waited with baited breath as he made choice after choice (often on live tv), others saw the whole circus act as a foolish waste of time.

Paul pretty much encapsulates the dilemma gripping data science (bet you didn’t see that one coming). As a new genre of statistical methods proliferate, data scientists are hotly debating the importance of inference vs prediction. This distinction splits traditional statistical research with new-fangled machine learning techniques. In a recent piece in Nature magazine, researchers Danilo Bzdok, Naomi Altman, and Martin Krzywinski (“BAK”) clearly articulate this distinction.

As BAK state, “despite convincing prediction results, the lack of an explicit model can make ML solutions difficult to directly relate to existing…knowledge.” In the example above, our cephalopod friend Paul is the machine learning algorithm. It doesn’t make sense why Paul is so clinical at guessing World Cup match results. Paul’s selection criteria is a black box where inputs go in and results come out. But results don’t lie, and Paul was perfect in the tournament. Not too many pundits can claim that!

__Why we Infer__

Even though we get intoxicated by them, humans aren’t satisfied solely through results. They want to understand why. As BAK continue, “Inference creates a mathematical model of the data-generation process to formalize understanding or test a hypothesis about how the system behaves.” Inference is about understanding the why, and it builds upon existing frameworks of knowledge. This leads us to listen to “experts”, to engage in pattern recognition, to capture advanced analytical data, and to model relationships. Founding our knowledge on intuition gives us comfort in results, even if the predictive value is insignificant, or worse, overfit. Per BAK, “statistical methods have a long-standing focus on inference, which is achieved through the creation and fitting of a project-specific probability model.”

Inference is colored by human bias. We’ve progressively gotten better at identifying this, but it hasn’t erased its impact on disparate fields such as political punditry and investment management. We often lose sight of this through the rigid application of statistics and our faith in numerical validation, such as “significant p-values”. The problems lie not with the tools, but with how they are applied to infer “truth”. As researchers Ronald Wasserstein and Nicole Lazar state, p-values do not measure the probability of truth; they indicate how incompatible data are with the models used.

__Validation__

If we throw various statistical models and regressions at a problem, it it relatively easy to find substantial results. But this doesn’t yield *valid* prediction, let alone meaningful inference. We see this every day in markets, an unwieldy data-set that is particularly noisy and reflexive. The news headlines are made after the fact, often fitting to the narrative that feels right. Markets are weighing machines that factor in an infinite number of outcomes with varying probabilities. We as humans want to infer cause and effect, but any single outcome can be the result of an outlier probability. Drawing inference solely from outcomes is like reading a choose your own adventure by taking a single plot arc and removing the rest of the pages from sight. Trying to infer loose causal connections leads to anomalous understandings. Behavioral finance advocates have are making careers out of analyzing such events.

Validity must focus on more pervasive phenomenon to control for probabilistic outcomes. We need to understand this before applying tools to infer causal connections. World Cup matches are a great example of this. History has shown us that any one game can be swung by a single fluke play. But that doesn’t provide a significant basis for infering that results should be random. We see this through the totality of results. There are rarely “Cinderella” teams that win the tournament. Only 8 teams have ever won the World Cup since 1930.

__Method to Madness__

If we understand that inference is colored by the inherent range of probabilities (e.g., a single play versus a tournament result), then we can focus our efforts on addressing the validity of approach, and taken steps to avoid bias. Further examination of these datasets help inform which tools are best equipped to provide meaningful output. Certain statistical tools are best at allowing humans to draw inference, while others are better at driving predictive results, and these tools are better equipped to handle different forms of data. As BAK put it, “ML methods are particularly helpful when one is dealing with ‘wide data’, where the number of input variables exceeds the number of subjects, in contrast to ‘long data’, where the number of subjects is greater than that of input variables.” The statistical methods can be grouped across a gradient, with traditional statistical methods on one side and more exotic machine learning methods on the other:

As we move through this gradient, the ability to infer *while controlling for bias *becomes hazier, while the ability to yield predictive methods *may* increase. This is like playing with fire, because the allure of increasing predictive power is countered by a potential loss of validity. There is a trade-off in question that academia is grappling with.

A few things strike us as fundamentally important:

- Expertise is often born through deep research, which usually yields anchored beliefs. Throwing new-fangled tools at experts can be a recipe for unscrupulous results if done in a slipshod fashion.
- Similarly, traditional research must be able to adapt to an evolving landscape of data science whose research is at one-hand breathtaking in fields such as image recognition and natural-language processing, while a mixed bag at more generalized fields, including investment management.
- There will always be a trade-off between inference and prediction. Bias shouldn’t be forgotten in this jousting match.