Most games consist of players opposing each other, competing for the win. We play competitive games to prove, by the way of winning, that we are a better player, and therefore smarter, more skilled, etc. But does winning actually prove that we are a better player?
We know intuitively that winning one game might be a fluke, a stroke of luck. So often, we play several games, and the winner of the majority of games is declared the winner of the whole “tournament” (best of 3, best of 5, etc.)
Does that make sense, mathematically? After all, if we suspect that there might be chance involved in the outcome of one game, there should be chance as well in a series of 3, 5, or 7.
In scientific experimentation, we measure the statistical significance of a result by determining how unlikely it is that it was obtained purely by chance, rather than as the consequence of the phenomenon we try to measure and prove. What happens if we apply this method to games and tournaments?
The experiment
Let’s take a symetrical 2 players game that doesn’t allow a draw. The two players will play a total of N games. We will count how many games were won by each player.
The “null hypothesis” H0 is that the players have the same strength. Or equivalently, that game outcomes are random. Since the game is symmetrical, each player has a 1/2 chance of winning each game.
Our “alternative hypothesis” H1 is that player A is stronger than player B, and that the outcome is not random. I give names to my players to have a so called “one-tailed” test: it’s easier to think in terms of “A being stronger than B” rather than “either A or B is stronger than the other”, and in the end it cuts some numbers in half. But it’s a detail really.
What we’ll do now is check how likely was a certain result, assuming the null hypothesis. If it was quite likely, we can give the benefit of the doubt: the tournament wasn’t much different than a bunch of coin tosses. If the result is very unlikely assuming H0, then there is a reason to believe H1, that is, that player A actually plays better than by pure chance.
Let’s call X the random variable denoting the number of games won by player A, if it was pure chance. The probability of player A winning c games is a standard binomial distribution:
P[X = c] = C(N,c) * p^c * (1-p)^(N-c)
C(n,k) is the combination (the number of ways to select k elements among n), and is equal to n!/(k! * (n-k)!). With p = 1/2, this simplifies to:
P[X = c] = 1/2^N * C(N,c)
We are not really interested in the probability of winning an exact number of games, since a player is declared winner when he wins at least a certain number of games (usually half). Let’s compute the probability of player A winning c or more coin tosses:
P[X >= c] = 1/2^N * sum(C(N,k), k: c -> N)
This is simply the sum of the probability to get c wins, c+1 wins, c+2 wins, etc. until N.
The result
So let’s check how “good” is a player who won a best of 3 tournament. N = 3, c = 2. Probability: 50%. Yes. Someone who wins a best of 3 tournament is just as impressive as someone who guesses a coin toss correctly.
Best of 5 (that is, at least 3 wins on 5 games): 50% as well. Best of 9: 50%. Best of 99 games, take a guess… 50%.
If a game is purely random, the probability of winning at least half the time is exactly 50% (for an odd number of games). Therefore, if you see someone win “best of X” tournament, you have no statistical reason to believe that the outcome is anything else than chance. Or at least you can have a serious doubt.
Careful about the semantics though, we are not proving or disproving anything. We are not even measuring the probability of the game to actually be random. We’re only saying “if the game was random, player A winning best of X would have a 50% chance of happening”.
The improvement
So winning at least half the games is not statistically significant. What would be? Two thirds of the games? Three fourths? All the games? Surely if a player wins all the games, they have to be the best player?
Scientists often use 5%, or even 1% as the highest accepted probability of the measured outcome to be consistent with randomness.
We saw that 2 games on 3 was 50% likely. Winning all 3 games on three by chance has 12.5% chance of happening. That’s still much higher than 1% or 5%. This is not statistically significant.
Winning 4 games on 5 by chance is 18.7% probable. 5 on 5: 3.12%. That’s already much better. It’s significant with the 5% limit.
What you can say when player A wins 5 games on 5, is that “if hypothesis H0 was true, this outcome would have been quite unlikely”. In other words, the game is probably not random, and it’s probable that player A is a better player.
Here’s a table of the minimum number of games that one must win for the outcome to be statistically significant (for various values of N and several levels of significance):
N | 5% | 1% | 0.1% |
---|---|---|---|
3 | |||
5 | 5 | ||
7 | 7 | 7 | |
9 | 8 | 9 | |
10 | 9 | 10 | 10 |
20 | 15 | 16 | 18 |
50 | 32 | 34 | 37 |
100 | 59 | 63 | 66 |
Empty cells mean that no amount of winning will be significant for that total number of games.
So next time a player pretends they are better than you at a game, ask them to prove it by winning for instance 10 games on 10, or at least 16 games on 20. Anything less than that should be declared a draw, for lack of statistical significance on the data :)