Data Analysis Report

    This site uses cookies. By continuing to browse this site, you are agreeing to our Cookie Policy.

    The latest issue of the 9th Scroll is here! You can read all about it in the news.

    • Pellegrim wrote:

      if the data show O&G at botom tiers, I fear we can hardly draw any valuable conclusions.

      Pull in team tourny info that track individual results, isolate top 15% results and you get some usefull info.
      Yes! Let's open a whole NEW can of worms!!! :muaha:
      “You can never know everything, and part of what you know is always wrong. Perhaps even the most important part. A portion of wisdom lies in knowing that. A portion of courage lies in going on anyways.” -Lan Mandragoran, EotW


      Dovie’andi se tovya sagain.
    • I was wondering the same thing. In a few weeks time no one will be playing 1.3 any more.

      Also I'm starting to question the current DA strategy, it seems like it takes so long time for the DA team to get enough data that the analysis is only of academic interest when it arrives. 1.3 has been out for almost 12 months now and there still does not seem to be a report that tells us something meaningful.
    • Hoffa wrote:



      Also I'm starting to question the current DA strategy, it seems like it takes so long time for the DA team to get enough data that the analysis is only of academic interest when it arrives.
      Yarp, we need to encourage TOs to submit data to the project.
      Feel free to mention it at any tournaments you attend. Or if you know anyone who does attend tournaments.
      Ask not what the project can do for you, but what you can do for the project :)

      Don't forget that however convinced you are of your opinion on something in the project, or something it should/shouldn't do, there is someone out there holding on to the opposite belief just as strongly :D

      Check out my new ID blog
      Dan ventures into the lands of smoke and fire

      And some basic tactics for beginners (I should develop this properly at some point)
      No 'tactics for beginners' thread?
    • Thanks for posting the data!

      Few quick questions for clarification:

      • Does the "raw" data exist at a more granular level from what is included in the files? It seems that this captures overall tournament results, but it seems to me you could do some really interesting data mining if you had the details that went into those tourny results (for e.g., how each army generally stacks up against other armies). That said, even if the data exists, understand it may not be sufficient enough to exhibit a high level of credibility
      • It looks like the analysis was done on a "points scored out of total possible tourny points" basis - has a similar analysis been performed on a strictly win / loss result basis (in other words, I'd be interested to see how much the margin of victory moves the needle)


      Anyhow, appreciate you sharing the details. Clearly a lot went into the data gathering and analysis to get this done.
    • Hoffa wrote:

      I was wondering the same thing. In a few weeks time no one will be playing 1.3 any more.

      Also I'm starting to question the current DA strategy, it seems like it takes so long time for the DA team to get enough data that the analysis is only of academic interest when it arrives. 1.3 has been out for almost 12 months now and there still does not seem to be a report that tells us something meaningful.
      The main purpose was to be one source of inspiration for BLT and Design Teams.
      Now that I got so many requests to publish my information, I made some graphs and fed the lions.
      I highly appreciate any help with writing texts for explanation.
    • MatRat wrote:

      Thanks for posting the data!

      Few quick questions for clarification:

      • Does the "raw" data exist at a more granular level from what is included in the files? It seems that this captures overall tournament results, but it seems to me you could do some really interesting data mining if you had the details that went into those tourny results (for e.g., how each army generally stacks up against other armies). That said, even if the data exists, understand it may not be sufficient enough to exhibit a high level of credibility
      • It looks like the analysis was done on a "points scored out of total possible tourny points" basis - has a similar analysis been performed on a strictly win / loss result basis (in other words, I'd be interested to see how much the margin of victory moves the needle)


      Anyhow, appreciate you sharing the details. Clearly a lot went into the data gathering and analysis to get this done.
      We have this. No more granularity.
      The analysis is based on the achieved placement, we have not information about tournament points, pairings, specific game results etc.
      And of course we have no idea about who played. So there might also be systematic errors in the data, for example some faction is played by better players it will make them seem overperforming.

      This error bar discussion is always coming up. I tried any known error bar size and decided for this. Anyone with a basic knowledge about statistics can extrapolate the error bar size himself. For all others, it ia enough information if we tell them than in 1 out of 6 cases the real performance level is above and in 1 out of 6 below the error bar. In 4 of 6 cases it is inside. Warhammer players have a basic understanding of how likely specific d6 rolls are.

      What should strike the community, is the crazy country weight. We should get a lot more results from other countries, but we so not get. Maybe this wakes up the tournament organizers.
    • Weighting is only to even out country contribution? Or does it consider other factors too? (And if so, what?)

      Was the assumption of normality tested in the data?

      I have to agree with @Nicreap, using end placement points (which includes sportsmanship, painting, etc...) is distorting the data and obscuring any signal that relates to balance. I'm fully cognizant that this is a data collection problem, but it's a pretty severe one.

      Edit:
      As far as reading the last graph with teh '68% confidence intervals'. Assuming normality, 68% confidence intervals are ~1SE (standard error). To be significantly different (at about the p < 0.5 level), two SE error bars must be farther apart than approximately 50% of their width (if their SE is the same). (With different widths of SE, i presume you take the average of half of each of them, but I won't swear to that without looking it up).

      The 95% confidence interval is ~2SE, or double the displayed error bars, so you could double them productively. Note that 95% confidence intervals which overlap can still be significantly different (but if they don't overlap, they definitely aren't). The amount of overlap possible while still retaining significant difference is approximately 25% of the length of the error bars. (Note that this is dependent on sample sizes - the 25% approximation is for equal sample sizes between the two groups). So if you doubled all the confidence intervals, you'd be looking at approximately 95% confidence intervals, and could judge significance roughly based on that.

      This may be useful to some people who need pictures: scienceblogs.com/cognitivedail…earchers-dont-understa-1/

      (Note that this is something that people who work with statistics daily don't intuitively understand, which is why best practice is to conduct explicit statistical tests and report p-values. But I understand that time is a limited commodity here).
      Just because I'm on the Legal Team doesn't mean I know anything about rules or background in development, and if/when I do, I won't be posting about it. All opinions and speculation are my own - treat them as such.

      Legal

      Playtester

      Chariot Command HQ

      The post was edited 1 time, last by Squirrelloid ().

    • Squirrelloid wrote:

      Was the assumption of normality tested in the data?
      Normality of what? The distribution of results for each faction?

      I'd say this is obviously not normal, and shouldn't be expected to be.
      The total distribution must be flat, and there is no way to get that from summing 16 sensible Gaussians, centred in vaguely the same place.


      I have to agree with @Nicreap, using end placement points (which includes sportsmanship, painting, etc...) is distorting the data and obscuring any signal that relates to balance. I'm fully cognizant that this is a data collection problem, but it's a pretty severe one.
      Yes. To be fair though, this is typically less of a deal outside of the US. Could easily re-analyse without US tournaments and see if it makes a difference. It wouldn't be conclusive of course, but it might be interesting.
      Or you could track down the results and find out which tourneys had soft scores that made a difference and remove just these tournaments.



      Edit:



      The 95% confidence interval is ~2SE, or double the displayed error bars, so you could double them productively. Note that 95% confidence intervals which overlap can still be significantly different (but if they don't overlap, they definitely aren't). The amount of overlap possible while still retaining significant difference is approximately 25% of the length of the error bars. (Note that this is dependent on sample sizes - the 25% approximation is for equal sample sizes between the two groups). So if you doubled all the confidence intervals, you'd be looking at approximately 95% confidence intervals, and could judge significance roughly based on that.
      95% interval being double 68% interval depends on normality, no?

      And all of this is dependent on comparing 2 armies. I'm not sure this should be done, particularly as there is some level of correlation in the data.
      Instead, I would argue it is better to examine the departure of each army from 0.5. At 95% I would expect them all to be within the error bars. At 68% some of them aren't.
      So pick your confidence level according to the result you want
      :P
      Ask not what the project can do for you, but what you can do for the project :)

      Don't forget that however convinced you are of your opinion on something in the project, or something it should/shouldn't do, there is someone out there holding on to the opposite belief just as strongly :D

      Check out my new ID blog
      Dan ventures into the lands of smoke and fire

      And some basic tactics for beginners (I should develop this properly at some point)
      No 'tactics for beginners' thread?
    • Looking at the raw data, I am not sure if we can draw any conclusions from it at all. For example ID have only 110 reported results. There is a pretty high probability that certain match-ups are not included in the data at all.

      How many games are played at a standard tournament? Assuming 30 players with 5 games each would mean 75 matches. That would imply that only 50 tournaments submitted data.
      I guess it is reasonable to assume that the tournament organizers submitting data are not randomly distributed, which would imply that a lot of the data is submitted by organizers with overlapping player groups.
      If all tournaments would have zero overlap in attending players, again assuming 30 players per tournament, that would mean that 1500 players contributed to the data set. Realistically, at least half of the players attending tournaments for which we have data, attended at least one other tournament contributing to the data set. Now we end up with maybe 1125 players contributing to the data. (I think that is a very optimistic guess.) That would mean that we have roughly 30 more or less independent results for ID matches. That is not even two matches per possible match-up.
      That is nowhere close to enough to draw any conclusions from.

      I hope I did not mess up anywhere. It is still early and I only had one cup of coffee so far. :D
    • I think that overall everyone aggrees that this data analysis could potentially contribute to the improvement of the game, and that if it's possible to make usefull analysis of the available data it should be done.

      I think that the discussion about selection bias in the contributing players are important and should be considered. I do believe however, that if the figure 12 shows all armies overlapping using a 95% CI we would a fairly solid argument that the armies are reasonably balanced.

      In my opinion this figure should be the only output published, and it should be extensively explained. I personally don't think that p-value testing in this case would contribute much, but it's a disputed area in academia and honestly not suitable for a miniature wargame forum.

      From what I can understand about the data available no further inferrences about the results can be made, and honestly only army effectiveness comparisons would be of interest for improving the game. (although it's fun to see where people play etc.).

      If some web-developer enthusiasts exist out there, that shares the need for improving the game through data analysis, I would suggest improving the data available by making an online platform where these tournaments could themselves manage the tournaments while they are going on. - thus making it easier for them to handle the results, and contributing the data for further analysis by the 9th age forum. This would also reduce the need for gathering the data, allow for adjusting for individual players etc, head to head army comparisons etc..

      an automaticly updated figure 12 (with 95%) confidence intervals could thus be available at the 9th age webpage.
      But obviously this would require some work (of some one not me).

      And with the current data, I think that the data management team would play their role best, by just making this figure 12, describing what are used for adjusting, how the results are gathered etc. and disregarding all the rest, as it contributes little or none to improving the game. I also suffer from an insatiable urge to present as many analysis as possible, but in my experience, people just loose interest, without understanding the core conclusions.
    • I like the distribution plots too, and the popularity plots.
      When I did my own analysis, these 2 (and the fig 12 type plot) were the 3 metrics I used.

      Some of the discussions during the 2.0 update were about the abilities of different armies to perform at the top tables relative to the bottom tables.
      So the distribution plots are very interesting to look at.
      Ask not what the project can do for you, but what you can do for the project :)

      Don't forget that however convinced you are of your opinion on something in the project, or something it should/shouldn't do, there is someone out there holding on to the opposite belief just as strongly :D

      Check out my new ID blog
      Dan ventures into the lands of smoke and fire

      And some basic tactics for beginners (I should develop this properly at some point)
      No 'tactics for beginners' thread?
    • Arrahed wrote:

      ....Realistically, at least half of the players attending tournaments for which we have data, attended at least one other tournament contributing to the data set. Now we end up with maybe 1125 players contributing to the data. (I think that is a very optimistic guess.) That would mean that we have roughly 30 more or less independent results for ID matches. That is not even two matches per possible match-up.

      That is nowhere close to enough to draw any conclusions from.

      I hope I did not mess up anywhere. It is still early and I only had one cup of coffee so far. :D
      I don't follow your logic, how do you get to 30 independent results? What is that anyway? Assuming that armies are faced up against eachother randomly it should be possible to ascertain each armies likelyhood of victory against a random army. Obviously we wouldn't know much about each army against each other, as some army A could theoretically always loose against army B and alway win against army C, resulting in a victory rate of 0.5 which seems balanced.
    • DanT wrote:

      I like the distribution plots too, and the popularity plots.
      When I did my own analysis, these 2 (and the fig 12 type plot) were the 3 metrics I used.

      Some of the discussions during the 2.0 update were about the abilities of different armies to perform at the top tables relative to the bottom tables.
      So the distribution plots are very interesting to look at.
      OK. Maybe there is something about the tournament dynamics that I don't understand. Are people not matched randomly? Do they not proceed to the "top" tables depending on if they achieved victory in the earlier battles?
    • My assumption is that if only 1125 different players contribute data to the overall 38xx submitted games, most of the recorded data is based on games played by the same player. And since player skill is a significant factor in the performance of an army, we need much more games played by different players to 'even that out'.

      My calculation:
      1125 unique players out of ~4000 recorded games --> ~30 % of the games are played by different players.
      110 recorded ID games. 30% of those are roughly 30 games.
    • Producing any kind of in-depth data analysis for a game such as this must require an insane amount of work so thanks to the team behind the submitted report.

      However, without any kind of rationale or justification for why specific forms of analysis were used or what the intended analysis is meant to show, I think reduces the value of the report and is therefore likely to be lost on most of us.

      People reading this will have varying levels of statistical knowledge and those of us who have had some training in statistical modelling will be coming from varying disciplines who will have different opinions on how such analysis should be attempted. I think some written text aimed at at someone with no prior knowledge of data analysis would be helpful in following the team's thought process here.

      Drawing meaningful conclusions from game results is, in my opinion, almost impossible since you are never comparing like for like and variations in player skill, list composition, scenario and the randomness of dice rolls, means that trying to infer the relative strength of any AB is incredibly unlikely. Even with large sample sizes, the amount of variation within just these 4 variables is likely to preclude any clear trends or statistical significance.

      Superficial conclusions from some of the data may still be useful, but building future decisions as per the direction of ABs and rules changes on these kinds of analysis are likely to be intrinsically flawed.

      Hopefully this only forms a very small part of the decision making process.
    • Not sure how all the graphs are generated. But the army placement graph suggests that all armies are roughly equal in ending up on the number 1-7 spots?

      As for armies under- or over performing. I think it is best to compare those army scores to a fixed number, rather then one another. If there is a gradient, the lower army may end up being (statiscally) worse than the highest scoring army. However, this does not mean it is significantly below the average of all armies. That said. A graph could be made with the average score indicated as a thick line and a grey area above below that indicates the two Standard Deviation range (or whatever we accept as within boundries). All army averages should fall within this grey area (the median would be better here than the average to avoid outlier interference). Armies outside the area are to strong or to weak. Armies within are concidered on point. One army has to be the weaker and another the stronger. Minor changes in these armies could change the meta enough to end up with all armies within one standard deviation.
    • DanT wrote:

      I like the distribution plots too, and the popularity plots.
      When I did my own analysis, these 2 (and the fig 12 type plot) were the 3 metrics I used.

      Some of the discussions during the 2.0 update were about the abilities of different armies to perform at the top tables relative to the bottom tables.
      So the distribution plots are very interesting to look at.
      Perhaps the number of times an army ends up in the top 5 compared to the amount of times it ends up in the last 5?

      Alternatively, one could generate a distribution plot, with bins at 10% intervals... this would generate a spread similar to the placement graph, but then less detailed (as we use 10% intervals. Still, it should show how armies place accros the board.

      EDIT:
      As for top table to lower table. It would be good to know the number of times a player is used in hse results. For example, one excellent player wins a lot with Army X and goes to many tournaments, the skill of the player (not the army strength) will influence the top results.

      Then again, this already happens in all of the analyses. And unless we can ‘average’ each players results, we have to rely on greater numbers to even these out for us
    • Arrahed wrote:

      My assumption is that if only 1125 different players contribute data to the overall 38xx submitted games, most of the recorded data is based on games played by the same player. And since player skill is a significant factor in the performance of an army, we need much more games played by different players to 'even that out'.

      My calculation:
      1125 unique players out of ~4000 recorded games --> ~30 % of the games are played by different players.
      110 recorded ID games. 30% of those are roughly 30 games.
      Yes, well there is something there. But assuming that the individual players skills are not contributing to their AB choice, loosing/disregarding data from the remaining 70% seems like a problematic approach. With a 110 ID games, that gives a standard error of roughly 4,8% assuming a binomial distribution. This means that a 95% CI is about 9.3 percentage points wide, and thus only very skewed armies would pop out. This however is in reality the limitations that we have to live with, and I still think it's a meaningfull (best objective) approach for monitoring game balance. And it would improve over time. Another way is to have people report games through the community - allowing for adjusting for individual players.