Beginner’s Guide to Averages in Data Analysis

As data scientists, we sometimes have to use very complex techniques, involving very clever statistics. But have you ever heard of Occam’s Razor? To paraphrase, it means:

“The simplest solution is often the best.”

You will run into this many times in your day-to-day. 

For us, the simplest way to describe performance, typically is the average. There are a few types of averages, and for such a core statistical metric, these can sometimes be misunderstood. It’s the easiest way to describe a population in terms of the middle value.

This post aims to simply clarify what the different types of average are and when to use them.

We’re going to explain averages in the sense of “finite sets of numbers”, where we have an explicit dataset of numeric values and we can calculate the summary statistics.

Our example set of data

We will have 20 numbers as our example dataset.

X = {2,2,2,3,3,3,3,4,4,4,5,6,6,7,8,9,9,10,10,10}.

Mean

The first average we’re going to learn is the one that you’re probably most familiar with, the one that people generally mean when they mention “the average” – the mean. More formally, we are talking about the arithmetic mean.

In a dataset, the mean is calculated by summing up all the values in the set, and dividing by how many observations there are.

In fancy mathematical terms, if X is our set of values, we use to represent the average. Σ is used to denote the sum, from the value at the bottom of the sign, to the top of the sign.

If we fill in the numbers from our example into the above formula, we arrive at the result:

Because the mean is calculated using every value in the set, if we have an extremely large or small number, it will skew the mean – bias it in a way that is probably less helpful.

If we extend our set of 20 numbers by adding in a 21st value of 100, the arithmetic mean would then become 10

This is almost double  that of the mean before, just by adding one value!

This is very important to bear in mind when you’re analysing distributions of data: the arithmetic mean is a great start to summarising data, but it is heavily influenced by outliers (extreme values).

Median

The median is the middle number in our set, when it’s sorted from smallest to largest. We use the median because it’s less susceptible to outliers. We’re not calculating using all the values, we’re just picking the one in the middle. This can give us a better representation of the “true middle” of the population. 

Luckily, our set is already in ascending order! We have 20 numbers, so the median will have to be between the 10th and 11th numbers; 4 and 5.

Because we have an even number of values in our set, we take the mean of the two middle ones.

The mean of 4 and 5 is 4.5, so that’s our median!

Mode

This one is a pretty simple one.

Given a set of values, the mode is the one that appears most often.

To work this out, you just count up how many each value occurs in the set, and pick the one that has occurred the most.

We can show the results of this in a table for our example:

As we can see, the total number of values is 20, which is a good confirmation that we are using the complete dataset.

The value 3 occurs the most in the dataset, appearing 4 times.

Therefore, the mode is 3.

Geometric Mean

Compared to the arithmetic mean, which uses the sum of all the values to calculate the average value of a finite set, the geometric mean uses the product of their values.

We’ll consider using the geometric mean typically for things like growth rates, interest rates – things that grow by a factor each period. You’ll find it used often in business and finance applications.

The big Omega sign in the brackets means “The product of”, similar to the capital Sigma above that denoted the “Sum of”.

You can see how it looks similar to the arithmetic mean: it uses all the values in the set, and has a scaling factor of 1/n, but in this case it’s multiplying all the values together then taking the n-th root (raising to the power of 1/n).

If we plug the values in and calculate, we arrive at:

Technically the geometric mean of X is not precisely 4.75, but we’ve rounded to two decimal places.


This is probably the most confusing and least intuitive average, but its use is generally limited to very specific situations so you won’t need to think about it all that often.


We just include it here as it is still technically an average!

Thanks!

Hopefully this brief overview has helped you gain an intuitive understanding of averages in data analysis. They are super helpful, and using the right statistic in the right situation can be incredibly enlightening.


We’ll be moving on soon to looking at a few more complex but just as helpful statistics such as standard deviation, variance, and others.

See you there!