Contributor: Lynn Ellis. Lesson ID: 13759
Outliers can disproportionately affect your analysis of your data. Can you just remove them? What happens if you do? In this lesson, we will explore the answers to those questions.
Today, we are going to be data detectives!
We have to be good detectives because our detective work will tell us whether we can eliminate an outlier or not.
(If you need a refresher on want an outlier even is, check out our lesson under the Additional Resources in the right-hand sidebar.)
Outliers can cause problems for evaluating data because they can influence the mean of the data set and give misleading information.
For example, nine students in a statistics class take all of the change out of their pockets and put it on their desks. Here are the amounts of money (in cents) that each student had:
50, 67, 0, 97, 76, 87, 65, 85, 75
When we calculate the upper and lower fences, we find that anything above 128.75 or below 15 is an outlier. So zero is an outlier in my data set.
If I take the mean of the data set with the zero in it, I get a mean of 66.89 cents. However, if I take the mean without the zero in the data set, I get a mean of 75.25.
That one outlier has had a significant influence.
The answer is, it depends. This is where we have to start being detectives.
A good detective wants to answer specific questions. For us as data detectives, those questions are:
If so, we should remove that data point.
If so, we should remove that data point.
If it is, we should not remove it.
If it is, we should not remove it.
Let's apply these questions to our example above.
It's possible that the student had 50 cents in his or her pocket, and it got recorded without the 5. But there are people who do not carry change in their pockets, so that is an assumption that we cannot make without further information.
Since the number is a zero, it is doubtful that someone counted incorrectly. Again, we can't assume a mistake here without further information.
If the person with no change in his or her pockets was not a member of the statistics class, then they would not be a member of the population we are looking at.
For instance, if a visitor came to class that day and had no change but everyone in the class did, that outlier would not be from a member of the population. In that case, we would want to remove the outlier from the data set.
If the person with no money in his or her pocket was part of the class, we must keep the outlier.
In this case, it likely is due to natural variability. Enough people do not carry money in their pockets that we can see the zero as a naturally occurring variation in the data. For this reason, we would not want to remove the data point.
As you can see, removing an outlier is a matter of being a detective and making an informed judgment.
Make some informed judgments about these scenarios to check your understanding.
If you understand those, you are ready to move on to the Got It? section to practice your skills.