The Three Most Common Statistical Tests You Should Deeply Understand
If, like me, you are not a fan of code formatting in LinkedIn articles, you can also read this article on Medium.
Hypothesis testing is one of the most fundamental elements of inferential statistics. In modern languages like Python and R, these tests are easy to conduct — often with a single line of code. But it never fails to puzzle me how few people use them or understand how they work. In this article I want to use an example to show three common hypothesis tests and how they work under the hood, as well as showing how to run them in R and Python and to understand the results.
The general principles and process of hypothesis testing
Hypothesis testing exists because it is almost never the case that we can observe an entire population when trying to make a conclusion or inference about it. Almost always, we are trying to make that inference on the basis of a sample of data from that population.
Given that we only ever have a sample, we can never be 100% certain about the inference we want to make. We can be 90%, 95%, 99%, 99.999% certain, but never 100%.
Hypothesis testing is essentially about calculating how certain we can be about an inference based on our sample. The most common process for calculating this has several steps:
Assume the inference is not true on the population — this is called the null hypothesis
Calculate the statistic of the inference on the sample
Understand the expected distribution of the sampling error around that statistic
Use that distribution to understand the maximum likelihood of your sample statistic being consistent with the null hypothesis
Use a chosen ‘likelihood cutoff’ — known as alpha — to make a binary decision on whether to accept the null hypothesis or reject it. The most commonly used value of alpha is 0.05. That is, we usually reject a null hypothesis if it renders the maximum likelihood of our sample statistic to be less than 1 in 20.
The salespeople data set
To illustrate some common hypothesis tests in this article I will use the salespeople dataset which can be obtained here. Let’s download it in R and take a quick look at the first few rows.
We see four columns of data:
promoted — a binary value indicating if the salesperson was promoted or not in the recent promotion round
sales — the recent sales made by the salesperson in thousands of dollars
customer_rate — the recent average rating by customers of the salesperson on a scale of 1 to 5
performance — the most recent performance rating of the salesperson where a rating of 1 is the lowest and 4 is the highest.
Example 1 — Welch’s t-test
Welch’s t-test is a hypothesis test for determining if two populations have different means. There are a number of varieties of this test, but we will look at the two sample version and we will ask if high performing salespeople generate higher sales than low performing salespeople in the population.
We start by assuming our null hypothesis which is that the difference in mean sales between high performers and low performers in the population is zero or less. Now we calculate our difference in means statistic for our sample.
So we see that in our sample, high performers generate around $155k more in sales than low performers.
Now, we are assuming that sales is a random variable — that is, that the sales of one salesperson is independent of another. Therefore we expect the difference in mean sales between the two groups to also be a random variable. So we expect the true population difference to be on a t-distribution centered around our sample statistic, which is an estimate of a normal distribution based on our sample. To get the precise t-distribution, we need the degrees of freedom — which can be determined based on the Welch-Satterthwaite equation (100.98 in this case). We also need to know the standard deviation of the mean difference, which we call the standard error which we can calculate to be 33.48. See here for more details on these calculations.
Knowing these parameters, we can create a graph of the t-distribution around our sample statistic.
We can now see the expected probability distribution for our true population statistic. We can also mark the maximum position on this distribution that represents a difference of zero or less — which is our null hypothesis statement. By taking the area under this distribution to the left of the red line, we calculate the maximum probability of this sample statistic occurring if the null hypothesis were true. Usually this is calculated by working out the number of standard errors that are needed to get to the red line — known as the t-statistic. In this case it would be
So our red line is 4.63 standard errors away from the sample statistic. We can use some built-in functions in R to calculate the associated area under the curve for this t-statistic on a t-distribution with 100.98 degrees of freedom. This represents the maximum probability of our sample statistic occurring under the null hypothesis, and is known as the p-value of the hypothesis test.
So we determine that the maximum probability of our sample statistic occurring under the null hypothesis is 0.000005 — much less than even a very stringent alpha. In most cases this would be considered too unlikely to accept the null hypothesis and we will reject it in favour of the alternative hypothesis — that high performing salespeople generate higher sales than low performing salespeople.
To run this two sample t-test in R, you use the t.test function with an alternative hypothesis of "greater" . In the output below you’ll see the various statistics that we discussed above.
To run this two sample t-test in Python you can use scipy.stats version 1.6.0 or later.
Example 2 — Correlation test
Another common hypothesis test is a test that two numeric variables have a non-zero correlation.
Let’s ask if there is a non-zero correlation between sales and customer_rate in our salespeople data set. As usual we assume the null hypothesis — that there is a zero correlation between these variables. We then calculate the sample correlation:
Again, we expect the true population correlation to lie in a distribution around this sample statistic. A simple correlation like this is expected to observe a t-distribution with n-2 degrees of freedom (348 in this case) and the standard error is approximately 0.05. As before we can graph this and position our null hypothesis red line:
We see that the red line lies more than 6 standard errors away from the observed statistic and we can this calculate the p-value, which we again expect to be extremely small. Thus we can again reject the null hypothesis.
To run this in R:
To run this in Python:
Example 3 — Chi-square test of difference in proportion
Unlike the previous two examples, data scientists often have to deal with categorical variables. A common question is whether there is a difference in proportion across different categories of a such a variable. A chi-square test is a hypothesis test designed for this purpose.
Let’s ask the question: is there a difference in the proportion of salespeople who are promoted between the different performance categories?
Again, we assume the null hypothesis, that the proportion of salespeople who are promoted is the same across all the performance categories.
Let’s look at the proportion of salespeople who were promoted in each performance category by creating a contingency table or cross table for performance and promotion.
Now let’s assume that there was perfect equality across the categories. We do this by calculating the overall proportion of promoted salespeople and then applying this proportion to the number of salespeople in each category. This would give us the following expected theoretical contingency table:
We then use this formula on each entry of the observed and expected contingency tables and sum up the results to form a statistic known as the chi-square statistic.
In this case the chi-square statistic is calculated to be 25.895.
As with our t-statistic earlier, the chi-square statistic has an expected distribution which is dependent on the degrees of freedom. The degrees of freedom are calculated by subtracting one from the number of rows and the number of columns of the contingency table and multiplying them together — in this case the degrees of freedom is 3.
So, as before, we can graph our chi-square distribution with 3 degrees of freedom, mark where our chi-square statistic falls in that distribution and calculate the area under the distribution curve to the right of that point to find the associated p-value.
Again, we can see that this area is extremely small indicating that we are likely to reject the null hypothesis and confirm the alternative hypothesis that there is a difference in promotion rates between promotion categories.
To run this in R after calculating your contingency table:
To run this in Python — the first three entries in the result represent the chi-square statistic, the p-value and the degrees of freedom respectively:
I hope that you found these explanations and demonstrations useful. If you are interested in diving deeper into some of the underlying method and calculations of these tests, or to learn about other hypothesis tests, please visit my recent Handbook of Regression Modeling, where Chapter 3 focuses on foundational statistics.
AI/LLM Disruptive Leader | GenAI Tech Lab
4moKeith McNulty, I invented a new one blending statistics and number theory: the congruential equidistribution test: https://mltblog.com/4fGDLu0
Ensuring a culture of Quality, Safety, and Environmental Sustainability to refresh the world and make a difference
4moNice article, and good text book. It is more important now that data is easily accessible to know how to understand what data is and isn't saying.
For the correlation test, how did you get the value of 0.05 for the standard error?
International vagabond and vagrant at sprachspiegel.com, Economist and translator - Fisheries Economics Advisor
4moThese are taught in undergraduate courses for both stats majors and non-majors except maybe Welch's test. Usually paired sample t-test is covered and sometimes Welch is covered without mentioning Welch's name. It depends on the textbook used and the course.
Data Scientist at State Farm ® | Ex-perimental Cognitive Psychologist | Drawing actionable insights from data
4moThanks for sharing! It's so important to understand the need for different hypothesis tests based on the data types you're working with.