Tournament Analysis on the example of the WTC
I want to give you all an example how an event can be analysed and processed by Tournament Analysis. I will use the WTC in Herford for that. I use it out of several reasons. First many of you tried/did their own analysis and started discussions about it, at least it was so on the HE board. Second it was a really big tourney and should still be one of the five biggest even at the end of the year. As we have real actual data=results of the games played and final ranking and it is a team tourney we can show almost everything which complicates the analysis.
So how do we analyse a tourney?
If it and the results are on www.tourneykeeper.net. www.warscore.net, www.tabletopturniere.de, ranking.wfb-pol.org, the9thagerankings.com/#/events and on ecksen.ddns.net/eto/#/ than we copy paste it into a prepared file where the macros process it into the format we work with. If it is somewhere else (and we find it), we have to enter it in the right format per hand.
But what is the format we mostly work with? It is [(Number of participicants) – (Ranking reached) ] : [(Number of participicants) – 1 ]
And now I will explain it in English.At a small Tourney (6 participicants) with Place 1 SA, Place 2 VC, Place 3 SE, Place 4 ID, Place 5 KoE and Place 6 SA we will award the armies the following points:
Place 1 SA gets (6-1) : (6-1) =1
Place 2 VC gets (6-2) : (6-1) = 0,8
Place 3 SE gets (6-3) : (6-1) =0,6
Place 4 ID gets (6-4) : (6-1) = 0,4
Place 5 KoE gets (6-5) : (6-1) = 0,2
Place 6 SA gets (6-6) : (6-1) = 0
So as SA has both 1 and 0 points the mean would be 0,5 points.
What does what number mean? Which average number shows an army to be over performing and which number shows that the army is underperforming? Well, between 0,45 and 0,55 we consider armies to be balanced. Under 0,45 they underperform and over 0,55 they overperform.
So does in my example ID underperform? Well, I am sure we agree that one single army placement is a bit to few to say exactly how it really is. How much away the true value is assumed from the mean with a certain certainty (in our case 68 %) is measured with the certainty interval. It is calculate like that (here enter formula from arwaker).
With the certainty Interval we can see if the corridor of the assumed true value is inside the corridor of 0,45 – 0,55. It also explains why we don’t go for example for 0,475 - 0,525. How does the certainty Interval explain that? Well we need quite a real lot of tourney ranking results per army to get a certainty Interval which is under 0,05. Everything above would be broader then the corridor between 0,475 -0,525 and so allow us no interpretation.
That is one kind of analysis, but there are others, too. The more rounds are played on a tourney the more likely it is that an army shows it’s real strength. The more players are at a tourney the more likely it I that the pairing process brings players of equal skill to play each other. So naturally we have an additional calculation which takes the size and the length of a tourney into account. Later we compare both to see if tourney size and duration really have an influence and when which one. (Of course that only has a chance to have an effect if we have different tourneys with different size and duration.)
We also used to do some complex calculations to see what would be if every country had the same number of results (= we treated them equal in one analysis). But as that and analysis which tried to capture the competiveness of the different countries scenes proved to influence the Certanity interval in a way which made results uninterpretable, we dropped that.
The way we calculate the performance based on Ranking reached and the number of participicants naturally produces bigger differences between place 1 and place 2 depending on the number of participicants. That is different if we look at the actual games played. Those games always have a 20-0 / 0-20 matrix and so a 20-0 always brings 100 % or 1 and a 19-1 always brings 95% or 0,95. On that sort of analyse a totally balanced army would get an average of 10 points from it’s games. Here between 9.5 and 10.5 points average are the balance corridor. That kind of analyse can in theory potential produce more precise results than the ranking based, but we mostly get data for that analyse from big team tourneys and very few for smaller single tourneys. Later in this article I will go into more details how the difference between single and team tourneys influences our analysis.
What does that mean?
Well if we have the rankings reached by an army in a single tourney and the results of the matches it played, the result of an analysis based on ranking and the analysis of the actual played games natural will provide different numbers. But here again between 0,45 and 0,55 or between 9.5 and 10.5 points achieved on average (equalling 45 % - 55 %) we consider armies to be balanced. The upside of having the results of games is, that we can try to analyse which army performs how versus which opponent and so can identify Rocket-Paper-Scissor Situations between armies. (But that again needs very much results of games of those armies vs. each other in our database.)
How / Where do we get those results of actual games played?
We get the results of actual played games either from www.tourneykeeper.net, www.warscore.net , friendly TOs which send them to us and friendly players which use the report your games Threads in their armies board. And most important we get them from everybody who is actively sending it to us through Jim Morrs Fluxx Card App.
Which kind of tourneys do we analyse?
We analyse all tourneys. But we analyse the different kind of Tourneys separated. So Single-Tourney data is thrown together with Single Tourney Data and not mixed with Team Tourney Data. Later we compare our results. You may ask yourself if that isn’t comparing apples with oranges, but to stay in the metaphor we can still find out if fruits share some resemblances this way. That means if both show the same result (= an army overperforming), than it is much more likely to be reality than if only one of the tourney kinds or none show it.
Are team tourney results relevant for single tourneys?
On Team tourneys one can try to influence which opponent and often which scenario and on which table one plays. Of course the opponent can try to do that, too. Also some teams take armies for specific roles. So one army can be taken and designed as a counter for example monster Mash. So it the team is in every game able to pair the list vs. the list it is designed to counter it should naturally score more than an allcomer list even if the allcomer list performs better against a mixed field.An army can also be taken as a block build which just has the job to play between 7-13 and 10-10 and if paired correctly to block the opponents scorer Armies. A scorer army is designed and meant to crush the opponent realy hard and to get many points, but it usually has some weaknesses which it has to be paired around.
All that can suggest, that the role of an army and the quality of the dude doing the pairing has more influence than the actual army strength. But there exist different theories, too. One is, that a more succesfull team has a better hand picking the armies which fullfill the needed roles better, than a less successful team, that would mean that the dude who is better at pairing is also better at selecting the armies and so better armies for the needed purpose land in his team. Another theory is that better armies are not so dependant on the pairing as worse and so land with higher percentages in higher placing teams than others.
Why do we still collect and analyse them (at least so far)?
Diversity, sample size and comparison are the keywords.
Well our players like to play different kinds of tourneys. So I believe we should do our best to make each kind of tourney as balanced and as fun as we can. That doesn’t necessary mean steps to improve team tournaments automatical have to influence single events and vice versus. Such things could easily be done through the Tournament rules and be specified for team / single events.
Why sample size?
Well, to be honest most of the time our sample size is far smaller than we like to see. So team tourneys provide additional data. Data which can’t be thrown together with singles without thought but atleast additional data which can be used as control and as comparision.
Did I mention out sample sizes are smaller than we like it? Well, the way armies perform in different kinds of tourneys also can tell you something. If for example an army would literally rule in single tourneys but be crap in team tourneys, it would help to identify where the balance problems with that army are (meta, to expensive/specific counters to play in allcomer lists, …)
So let us use all that on the WTC
A look at armies vs. armies seems to show DL, DE, UD and VS as overperforming and EoS+HE as underperforming.
What our normal analysis seems to show:
That would make the corridors DE and VS are to be mostly above the balanced corridor while the corridor DL is in would mostly stay within the balanced corridor. The UD corridor would be mostly below the balanced corridor UD but being between 0,28 and 0,51 be it would still be possible that the real value is within our balanced corridor of 0,45 and 0,55.
What Participication shows
180 participicants divided upon 16 armies would make 11,25 participicants per army if every army would be taken as much as the others. Of course one can’t take an army 0,25 times so that would in reality come down to 11-12 players per army. The real participication was spread the following way:
SA and VC 18, SE 16, DH 14, DL 13, ID and OG 12, HE 11, EoS and OK 9, DE and WdG 8, BH and UD 6.
There is one problem with those numbers, it is that Participication only tells us how popular / available certain armies are for Team tourneys. It says exactly 0 about their strength. So SA, VC, DH, DL were taken more often than it would have happened if all armies would have been equal popular. ID, OG and HE were taken as often as normal and OK, DE, BH and UD were really unpopular.
Is WTC alone relevant for balancing?
As shown above the certainty bars are in most cases so big, that our answer to questions to most armies would have to be: “It is more likely/much more likely to be balanced than to be op/up. But even that answer would only have 68% Chance to be true or in other words it would have a 32 % to be false. To not sound as giving guesses we still would need much more data and to make our calls with 95% or 99% we would even need much more data than even this one tourney as big as it is can provide. And even that would help only if team tourneys produce relevant data, which one can either assume with good reasons or one can deny with good reasons.
Why even the data from such a big tourney brings should be taken with a grin of salt
One example why that is so I take from an explanation of one of our brains behind my neverending execution of predefined tasks:
|Faction||Average VP/game||+/- 3 sigma certainty|
So the results allow us the following statment: If the tournament would be repeated with exact the same basis, DE would be somewhere in between 8.9 and 13.5, while HE would be between 6.4 and 10.6 (both with 99.7% certainty).
This means that it has a higher probability that DE>HE than HE>DE. But both is possible. In fact there have been far too few games to give a clear statement. As long as the error bars are larger than the differences between the armies, a clear statement about balancing cannot be given from the statistical point of view.
In plain words, to be able to make a real, real sure statement that HE are weaker than DE, the average points for DE with the -3 sigma certainty would have to be higher than the HE average points + 3 sigma..
In even more plain words, DE between 8,9 and 13,5 and HE being between 6,4 and 8,8 would be an example for HE being weaker than DE with 99,7 % certainty.
What Team tourneys can do for balance
They can be used as comparisons and they can show extreme lists/ concepts.
Thank you for staying awake during that huge wall of text or for at least scrolling down to here.
533 times read