These comments get casually thrown around during discussions around data analysis, but correlation and causation are two terms that can often elude decision-makers’ proper understanding of statistics and in turn data science. If not correctly understood, this can lead to incorrect conclusions and actions.
Correlation is a pattern that is observed among variables of interest. For instance, if we are given data about a person’s height and the total number of mobile phones worldwide by year we may observe that there is some relationship between these two variables. Correlation is the observation that there is merely an association between the two variables, but does not say anything about height having an influence on the total number of cell phones in the world. It is important to understand the difference between these two terms so that we can make the proper conclusion of the relationship between variables; experiments, data, and careful analysis are necessary to avoid creating flawed plans (i.e. creating plans to increase the height of people in order to increase the size of the mobile phone market)
In a fictional scenario, let’s say that some data scientists at Kurvv wanted to use data to uncover hidden ways to increase revenue for a customer. Kurvv collected every kind of data that they could find and the data set included: height of its employees, the number of weekly sales, the amount spent on advertising, and the number of website visitors.
Let us examine some data to understand the concept of correlation further. Here, we have 10 data points of different heights of employees:
[5’10’’, 6’0’’, 5’11’’, 5’11’’, 6’0’’, 6’1’’, 6’2’’, 6’3’’, 6’4’’, 6’2’’]
And here is weekly sales data for a 10 week period:
[$1220, $1390, $1720, $1960, $2472, $2680, $2969, $3244, $3423, $3495]
If we were to take a data point from each data set and plot them, they would show the following type of relationship:
<Figure 1. Positive correlation between Number of weekly sales and height>
When it comes to correlation, there may be three types:
Positive correlation is when the data shows that an increase in variable A is accompanied by an increase in variable B. The data used for our example in figure 1 shows a positive correlation between height and weekly sales in dollars at Kurvv.
Negative correlation is when the data shows that an increase in variable A is accompanied by a decrease in variable B or vice versa. The data in figure 2 shows a negative correlation between weekly sales and amount spent on advertising in dollars.
<Figure 2. Negative correlation between weekly sales and advertising costs>
No correlation is when the data shows that an increase or decrease in variable A is accompanied by no change in variable B. The data in figure 3 shows no correlation between amount spent on advertising and number of website visitors over the last 10 weeks.
<Figure 3. No correlation between advertisement costs and number of website visitors>
Causation on the other hand, deals with cause and effect. That is, that one event was the reason for another event to follow. This is important because from a business perspective, knowing whether an event such as a new app update caused sales to increase can be an important part of business decisions. The big takeaway is that in causation, the two events always happen together because one always causes the other to follow in sequence. More deeply, the existence of one is what causes the other to occur.
In our example at Kurvv, we would want to be interested when certain decisions led to an increase in revenue or other business goals. For example, we could find data that informs us that the more a customer’s sales team spent talking to customers, the more revenue increased. These insights are very important to businesses and data scientists alike.
Other relevant examples of situations that would require causality to be analyzed include:
Event A Event B
Increase in links to your website Higher ranking in search engine results
Pricing updated Increase in total users
Increased marketing spending Increase in web page traffic
Knowing the difference between correlation and causation is important for any decision-maker. Figuring out whether events are simply correlated or if a causal relationship exists, can be accomplished via experiments and by gathering more data (a separate topic for another time). But as a decision-maker presented with data analysis, asking, “Is that correlation or causation?”, is a step in the right direction.
A site by Tyler Vigen, who published findings on different variables that he found correlated. Some of them are really hilarious. : here
An older article (from 2013) that talks about this topic using Homicide and ice cream sales data: here