What does elo measure?

A definition of elo rating. An analysis of a chess game of mine. An inaugural entry in an amateur chess diary, if you will.

James Carzon https://jamescarzon.github.io/ (Carnegie Mellon University)http://www.stat.cmu.edu/
08-16-2021

It is imperative that you establish your chess skill level as soon as possible. There is simply too much at stake to live your 21st century life in tactical darkness. If you seriously think that you can walk around inside this world with nothing but the minimal knowledge of how the pieces move and still hold your head high at night, then you’ve already lost. Mark my words, you will be the one left sitting on the curb outside Pizza Hut with your confidence shattered and your carpal tunnels throbbing while a confederacy of twelve-year-olds will be standing around you with their hands making knight-jump-shapes on their acne-ridden foreheads. There is no coming back from this disaster.

As a kid, I did not believe myself to be mathematically competent, and I assumed that good play of chess required some sort of talent for calculation. When I started playing chess in college, I marvelled at my opponents’ good fortune. I thought to myself as I lost time and again, I’d done nothing wrong! I traded pieces for pieces, I didn’t walk into check, and I kept my queen next to my king! How could I have lost?

The circumstances of the COVID-19 pandemic kindled a change in my attitude about the game. I found it to be an opportune activity to share remotely with my friends and a skill in which I was certainly capable of improving. The practice of learning to play better chess (as opposed to the practice itself of playing chess) is very satisfying now, because I no longer embarrass myself and have a renewed admiration for my ability to cling to mediocrity in a broad range of skills in which most people either find great success or total disdain.

This post is a reflection of my current state in the hobby of online chess. As an educational exercise and for curious readers, I want to document my thought process qualitatively a little. Additionally, in the interest of learning more about how my progress is measured, I investigate the “elo rating.” This investigation I share with you now.

Elo as proxy

It would be convenient to know your genuine world ranking in chess like how you know your height, but you can’t practically determine such a thing. The only way to decide how good people really are is to make everyone play against everyone else infinitely many times and then line up in such a way that no one is standing behind someone against whom they win more than half of the time. But such a tournament is impractical since we would not have anyone to wash our clothes or to clean after the horses on Mackinac Island, and so we will never know how good we really are.

But it is rather obvious that we may guess how good we are at the game, and to guess with improving certainty with more and more play, without ever actually measuring skill. To have a good guess, let’s just order everyone that you’ve played against according to your win percentage against them. You can stand in front of the people against whom you normally win, but you must stand behind those who beat you. How good is the person standing in front of you? And behind? Average their true rankings to approximate your own.

Of course, this solution is no good since your opponents don’t know their true rankings either. Furthermore, it may be the case that two players against whom you consider yourself equally ranked do not consider themselves equally ranked, so this program of ordering everyone may be poorly defined.

In order to make up numbers to represent skill level, the elo rating system depends on the simple premise that a player’s true ability is not perfectly represented in any one game. Sometimes they play better than average and other times they play worse. Then, for example, if they are supposedly at rating \(X\) and they are winning frequently against players of ratings greater than \(X\), then this is a sign that their rating is underestimated and should be increased. Usually, a player’s performance is assumed to respect a curve that looks like the following graph.

This curve is of an “extreme value distribution,” also known as the Gumbel distribution. That is, it is a curve which shows the theoretical distribution of values of the maximum values of data samples which, in turn, are samples from some other distribution. A classic use case might be the following. Suppose a river runs through your property and you would like to know if the river is likely to rise enough to spill over your garden. You may know how high the river ran every day for the last ten years, but the only information that is relevant to you is the maximum height this year. Taking the maximum heights observed in each of the previous ten years as your data sample, you can compute their average and their standard deviation to find a Gumbel distribution which describes the probabilities of different ranges of river heights.

It is probably not true that anyone’s chess performance respects the Gumbel distribution. One could draw lots of other curves which would be just as believable descriptions of performance. However, the Gumbel distribution is a convenient choice for the purposes of a few calculations.

If I tell you that GM Autobus’ play is distributed like Gumbel random variable \(A\), with mean play at rating \(2550\), and FM Bindowcleaner’s play is distributed like \(B\), with mean play at \(2150\), then we should have the expectation that Autobus will beat Bindowcleaner fairly frequently. For the sake of making up some point of reference to explain how big this difference in rating is in terms of winning percentage, we decide as convention that a \(+400\) rating difference translates to probable winning odds of \(10:1\). So Autobus should be able to score 10 out of 11 points in an 11 game match against Bindowcleaner. In notation, write

\[\frac{Pr(A\text{ beats }B)}{Pr(B\text{ beats }A)} = 10^{\frac{R_A-R_B}{400}}.\]

On the left is denoted the odds of a random variable pretending to be Autobus playing at a higher level than a random variable pretending to be Bindowcleaner. To say that “\(A\) beats \(B\)” is to say that \(A\) takes a number value higher than \(B\). On the right is an expression of those odds in terms of assumed rating; if the difference \(R_A-R_B\) is equal to \(+400\), then the righthand expression evaluates to \(10\). If \(R_A-R_B=+800\), then the righthand expression evaluates to \(100\). This makes sense. If Stockfish 14, the chess playing machine, has an assumed rating of \(2950\), then it should beat Autobus with odds 10 to 1. If it then faced off against Bindowcleaner, then it could do the following. It could say, “Hey GM Autobus! Why don’t you play 101 games for me against Bindowcleaner? If you win, then we’ll just say that I would have also won that game since I generally play better than you do. If you lose, however, then I will play the rematch against Bindowcleaner myself. Since I play better than you in 10 out of 11 games on average, I expect to be able to win 10 out of 11 of those 11 games that you are expected to lose to Bindowcleaner, thus preserving my legacy.” Ignoring the ethical difficulty of convincing a human to play for a computer, much less 101 games, the takeaway is that this equation describes the scenario which we have sought to describe.

Let us algebraically extract some new information. Since it is vanishingly unlikely for \(A\) to equal \(B\) from a probability perspective, we will simply observe that either \(A\) beats \(B\) or \(B\) beats \(A\) in our scenario. Ignore draws. Then

\[Pr(A\text{ beats }B)+Pr(B\text{ beats }A)=100\%=1,\]

so

\[Pr(B\text{ beats }A)=1-Pr(A\text{ beats }B).\]

Combining this with the first equation, we solve for \(Pr(A\text{ beats }B)\) to obtain

\[Pr(A\text{ beats }B) = \frac{1}{1+10^{\frac{R_B-R_A}{400}}}.\]

Conveniently, this expression is related to our assumption that one’s chess performance follows a sort of curve like the Gumbel curve. The Gumbel curve is the graph of a function like

\[f(x) = \frac{1}{\beta}e^{-\left(\frac{x-\mu}{\beta}+e^{\frac{x-\mu}{\beta}}\right)},\]

where the number \(\mu\) determines where the curve is centered and the number \(\beta\) determines the horizontal scale. Let us take it as a fact now that if two random variables \(A\) and \(B\) follow Gumbel distributions with the same scale number \(\beta\), then the random variable \(B-A\) is a random variable with a logistic distribution, meaning that it has a distribution defined by the function

\[g(x) = \frac{e^{-\frac{x-\mu*}{\beta}}}{\beta\left(1+e^{-\frac{x-\mu*}{\beta}}\right)^2},\]

where \(\mu*\) is the center of the curve showing the distribution of \(B-A\). If we perform a clever calculus calculation, then we can obtain

\[Pr(B-A\le x) = \int_{-\infty}^{x}f(y)dy = \frac{1}{1+e^{-\frac{x-\mu*}{\beta}}}.\]

This function, this particular probability, is a function which looks similar to our expression for \(Pr(A\text{ beats }B)\).

In particular, if \(\mu_*=R_A-R_B\) and \(\beta=\frac{400}{\log 10}\), then

\[P(A\text{ beats }B) = P(B-A\leq 0) = \frac{1}{1+e^{\left[-\frac{R_B-R_A}{\frac{400}{\log 10}}\right]}} = \frac{1}{1+10^{-\frac{R_B-R_A}{400}}}.\]

This calculation is repeated from Arthur Berg’s explanation in Chance (Berg 2020). So it seems that this probability model for chess play is somehow justified by the convenience of this calculation. If we pretend that one’s performance follows this Gumbel curve, then we can say that the odds seen in games are structured in the way described concretely above: a \(+400\) rating difference translates to expected \(10:1\) odds of winning. To conclude, Elo doesn’t measure chess ability or ranking, but it is a somewhat sophisticated guess as to one’s relative winning odds against another player.

Don’t trot out the queen

Understanding the technical definition of chess rating has not been practically helpful for playing better chess in my experience, but I have learned a few things about how to adjust my play to opponents of different ratings. For example, I recognized early on in my career as an amateur that low rated players are forever begging to trick someone into falling for the “Scholar’s Mate.” Something like this is what they have in mind:

It’s a quick and annoying loss for the poor soul who doesn’t see it coming. However, in order to achieve this trick, you have to breach a generally true rule: you have to develop our queen piece early. This is typically a bad idea because it is easy for your opponent to simultaneously get their pieces into the center of the board while also attacking your queen. As you run the queen away over and over again, your other pieces don’t move and you end up falling behind. For this reason, it’s better to move your other pieces out before the queen.

I think that it is at or around the 1000 elo rating in online chess play that everyone is smart enough not to try using the Scholar’s Mate as an old, reliable tool. Similar patterns do emerge sometimes. As an example that I may pleasantly share, I submit for your consideration the following game that I played against an opponent by the handle “PunisherFevzican9.”

The initial move 1. … c5 is the Sicilian Defense. Up to 4. Bc4, this game is a common position, but with 4. … f6 the database on Lichess.org is exhausted. This pawn push is considered a mistake because it allows me to break into the center, but it was not obvious to either of us in that moment. I just happened to play the best move next.

We trade pieces in the center on d4 until 7. Qxd4. Now my queen is developed and easily be attacked, and so comes 7. … d5. Unfortunately for my opponent, this gives me the option to attack the vacated f7 square with a bishop-queen battery, almost like how the Scholar’s Mate looks, except now my queen can’t be dislodged. All the pawns are too far advanced. I play 8. Qd5, eying f7. Then 8. … Qe7 defends that square, so I bring my knight to b5, now getting ready to fork the king and rook or maybe just help in the attack. The black queen now runs around trying to find something dangerous to do but can’t figure anything out. With 13. Qxe7+, I trade queens, thinking that I have a strong enough attack without the queens on the board to checkmate soon.

At this point I spend some time looking for a possible checkmate pattern. I saw that the king had two squares to move to: c7 and e8. If I could hold the king in place, then that would make it hard for black to develop the bishop on c8 that’s stuck behind two pawns and, thus, the rooks stuck inside their own caves.

With 14. Rd1, I enable one more piece to participate. Trading a knight and bishop, black makes some room, but this is a distraction from my own plan. As black plays 16. … Ne7, the king is once again limited to the squares c7 and e8, and my bishops are perfectly set up to close these off. 17. Bf7 covers e8. Black simply looks at the bishop with 17. … Rf8. Then 18. Bb6 is mate.

I did not play an excellent game. Many moments of this attack were refutable. For example, if after 17. Bf7 the black king moved to c7, I would have been a bit stuck. However, this was a fast game, played with 5 minutes plus 3 second increments on each move, so my opponent didn’t necessarily have a chance to calculate my plan at each turn. The lesson that I learned from this game is that an aggressive attack can work, even against players who are familiar with some tricks, if the attacking plan is specific and coordinated with the right mistakes made by your opponent. So don’t just trot out the queen! Let your attacks be sound and games be exciting.

Berg, Arthur. 2020. “Statistical Analysis of the Elo Rating System in Chess.” Chance Magazine, September. https://chance.amstat.org/2020/09/chess/.

References