When data is misleading : Selection Bias

November 01st, 2020   •   Ryan Lee   •   Min Read: 3

In our previous post, we covered survivorship bias, a tendency to focus on things that have survived and overlook those that didn’t. It misleads data scientists and distorts their reasoning. Far from being an isolated problem, however, survivorship bias is part of a broader selection bias. The latter encompasses various instances when data sets are not representative of the population intended to be analyzed.

The primary source of these problems is human involvement in collecting and feeding data. To ensure that your machine learning (ML) models don’t get corrupted, it’s critical to be aware of biases and take corrective actions.

Different forms of selection bias

Selection bias takes many different forms. When data is not selected in a representative fashion, then it’s the case of coverage bias. Take, for instance, a scenario when an ML model is trained to predict future sales based on data collected solely from customers who bought the product. By ignoring feedback from people who opted for a competing product, the company allowed coverage bias to creep in their algorithm. 

Non-response bias, also known as participation bias, is another example of selection bias. It occurs because of participation gaps in the data-collection process. In relation to the example above, consider that the company decided to conduct phone surveys with consumers who purchased its product and with those who went to competitors. But those who opted for a competing product were 70% less likely to complete the survey, hence they’re underrepresented in the sample.

Sampling bias is a challenging problem as well. It occurs when the sample isn’t diverse or random enough. Suppose a cinema runs an email survey to measure customer satisfaction. Instead of feeding an algorithm with feedback from a diverse group of people, the person in charge simply selects the first 300 responses. But this approach is compromised as people most enthusiastic about the cinema may be prone to voicing their opinion faster than typical visitors.

Data scientists should also look out for reporting bias. It occurs when the frequency of events in data sets is not reflective of reality and only outliers are registered. Movie or book reviews online are often prone to reporting bias because mostly people who either love or hate the product may care enough to write a review.

Furthermore, selection bias extends beyond the realm of data science. The Caveman effect is a case in point. It refers to our tendency to imagine prehistoric people living in caves because of many paintings, fire pits, and burial sites found in these locations. That image isn’t entirely correct, though. Humans were also painting on trees and animal skins and built their structures in open fields. But unlike in caves, external environments were less forgiving and rains, winds, and other forces of nature destroyed what humans made. The only data that we associate with prehistoric people is thus the one found in caves, causing selection bias in our understanding of past events.

How to avoid selection bias?

Avoiding selection bias is critical to building reliable ML models. From a tactical perspective, there are several things you can do to avoid selection bias (and other biases) from distorting your data sets. In the pre-processing stage, it’s recommended that you validate the integrity of the data source and methods of measurement. Algorithms in this stage can suppress the protected attributes, change class labels, or perform other actions to ensure the input data is balanced.

During the in-processing stage, data scientists can opt for several mitigation strategies. They can, for instance, integrate a fairness penalty in the loss function. Or, they can take advantage of generative adversarial networks (GANs) to achieve fair classification. And in post-processing, methods such as a Bayes optimal equalized odds predictor can be of value in mitigating biases.

Taking actions on a strategic level is equally important. Companies should produce educational material and run training that raises awareness of biases among employees and other stakeholders. Also, it’s vital to identify fairness definition for each application case and make clear instructions for data labeling.

Detecting and mitigating biases should be a continuous process. There is a number of open-source libraries that help with these efforts. They can become an integral part of development workflows. On top of that, hiring external experts is another option for detecting biases your team might have overlooked.

Preventing wrong conclusions 

Selection bias can distort the performance and accuracy of your ML models. No matter how large your data set is, it remains only a snapshot of reality and needs to be refined and guarded against different biases. Failure to account for these problems can lead to wrong conclusions that cost time and money. Eliminating biases is thus critical in your bid to harness the power of algorithms and gain a competitive edge.


Photo by Franci Strümpfer from FreeImages