MORE POSTS

When data is misleading : Survivorship Bias 

September 09th, 2020   •   Ryan Lee   •   Min Read: 4

The logical error of only considering the information that is seen.

 

TL;DR  Survivorship bias relates to people’s tendency of focusing on data, groups, and things that are present and have survived a selection process. It has caused really smart people throughout history to make flawed decisions, and continue to do so. You need to learn how to spot it in order to prevent it from wreaking havoc in your product and/or business

 

Young students might be forgiven for believing that dropping out of college to pursue big ideas is a key to success. After all, Steve Jobs, Bill Gates, and Mark Zuckerberg have done so. But what this perception hides are many more people who left college and failed in their business endeavors. By focusing on famous tech moguls and ignoring unsuccessful dropouts, students are exhibiting what is known as “survivorship bias.”

This tendency to draw conclusions based on the things that have survived and ignore those that didn’t creep into different aspects of everyday life, including data science and artificial intelligence (AI). And biased interpretations of data have major consequences. They can lead to inaccurate and overly optimistic predictions and conclusions, prompting companies to engage in futile actions. Knowing how to spot and avoid survivorship bias is thus vital for anyone looking to take advantage of data and smart algorithms.

 

Two ways survivorship bias is shown

Survivorship bias affects a decision-making process in two main ways – inferring a norm and inferring causality. Inferring a norm relates to people’s learning to believe that things that survived a process, and whose existence can be either proved or directly observed in the present, are the only ones that ever existed. For instance, old medieval monuments we see today are primarily made of stone. This might prompt tourists to believe that most buildings in the past were made of stone despite wood being an equally important construction material.

Inferring causality is a tendency to believe that anything that survived a process was shaped by it. People might assume, for instance, that working at Microsoft makes someone a great startup founder given 50% of the startups that became unicorns – valued at $1B or more- are from Microsoft (these are made up numbers). And while this could be true, this claim cannot be made without considering the non-unicorn startups with founders from Microsoft.  

 

Survivorship bias in the business world  

Biased interpretations of events and data come in many shapes and forms. Take, for example, a company that launched a data analytics tool. After one month, it discovered that marketers are the only user group that opts for paid plans and creates sophisticated analyses. Managers might conclude that the tool resonates with marketing professionals and empowers their work. This reasoning, however, is flawed as the company is only considering active users.

To avoid the survivorship bias trap, managers need to also look into people who gave up on the tool.  A large number of churned marketers, for instance, would disprove the hypothesis that the software is particularly suited for this user group. Also, users might be skillful at creating various analyses despite the flawed design of the tool, not because of it. Knowing customers’ skill levels before the onboarding may allow the managers to better judge the impact of their product.

Another likely scenario of survivorship bias in action is a startup building a machine learning model for churn prediction. Its engineers have looked at active customers to identify factors useful for the forecast and are now developing the new tool. What they overlooked, however, is the fact that their model is biased and doesn’t take into account the customers that already churned. And by only considering surviving customers, the software will produce biased insights that won’t be highly effective at churn prevention.

Biased actions can have even more serious consequences. Say, for instance, that a software development company is building a new fraud prevention algorithm for a bank. Engineers could decide to build a machine learning model to protect against the fraudsters that breached the bank’s previous solution. But accounting only for ‘survivors’ would mean that all the other attackers the old system blocked are now ignored and might find it easier to trick the new algorithm.

 

How to avoid survivorship bias?

Survivorship bias affects individuals and companies in various ways. Typically, it leads to overly optimistic conclusions as the resilience of ‘surviving’ data affects the outcome, while the parameters that have ceased to exist are ignored. The predictions shaped by survivorship bias aren’t representative of real-life environments.

Preventing survivorship bias, therefore, requires constant vigilance. It’s important to be selective with data sources and take into consideration observations no longer in existence or cannot be (easily) seen. And before every analysis, think about the data that’s not present but should be as that’ll ensure that your individual reasoning, as well as machine learning models, are as objective as possible.

 

Working on AI projects the right way

AI and machine learning technologies enable companies to retain a competitive edge by providing forecast, automation, and optimization capabilities. But getting the most out of your data requires accounting for survivorship bias. Being aware of this issue is the first step to avoiding it. Then, actively looking for signs of the bias will ensure that your data sets reflect the reality and aren’t irrationally optimistic. And with these foundations in place, you’ll be ready to get the most out of AI-driven efforts.

 

Other interesting real-life Survivorship Bias examples 

In 1987, a study was published that stated that cats falling 5 stories or more had fewer injuries than those that falling from lower heights due to terminal – velocity, relaxing of the muscles,  blah blah. This was based on fall injury records of cats provided by veterinarians. Of course, this excluded all the cats that died after falling from 5 stories or more. 

During World War II, it was proposed that the US military should add more armor to areas in the aircraft with more bullet damage after returning from a mission. Of course, this excluded data from all the airplanes that did not make it back.  The correct decision would be to add more armor to areas with less bullet damage. 

Further reading:

Survivorship bias in Data Science and Machine Learning

Survivorship Bias – Ignoring Hard to Find Data

The Perils of “Survivorship Bias”

How ‘survivorship bias’ can cause you to make mistakes

 

Photo by Jari Hytönen on Unsplash