View Single Post
Old 05-23-2021, 06:13 PM   #10
xXOpkillerXx
Forever OP
Simfile JudgeFFR Simfile AuthorD8 Godly KeysmasherFFR Veteran
 
xXOpkillerXx's Avatar
 
Join Date: Dec 2008
Location: Canada,Quebec
Age: 28
Posts: 4,171
Default Re: Poll: Which global skill rating system is best ?

Alright, I'll try and make a structured statement.

First of all, there is a concern that many people pointed out, which is that any system would have outliers. While that is true, not all outliers are the same, and that should very much be considered. In all cases should we try to minimize the amount of outliers there are, but it can be very difficult to compare counts for different types of outliers. At that point, a bit of subjectivity is invovled and necessary.


Lets look at what types of outliers the two kinds of system generate:

1. Weighted avg outliers:
These are essentially any and all outliers that come from the fact that our difficulty judgement is inherently flawed, mixed with inevitable imbalance in players' skillsets. The two points in this can be further explained:

1.1. Difficulty
We (FFR) use a single number to represent chart difficulty. Obviously, this has a relatively high and non-negligible degree of subjectivity. Other games like Etterna have attempted to fix this flaw by splitting the difficulty in distinct skills, kind of forcing axioms for what defines difficulty at its core. This method can generally help distinguish between files that are well balanced vs the ones that focus on 1 or 2 specific skillsets throughout. However, we simply dont do that, either because it has its own flaws, or for various reasons unrelated to this topic. So, we have one single number representing the difficulty of each file, be it balanced or not.

1.2. Players skillsets
It's no surprise that each player has their own best and worst skills. Just like the files, some players' skillset are well balanced, while others' are more specific. Comparison of skill between two players can be argued, but my stance is that this statement should hold:

Player A's skillset: 3/3 for jacks, 2/3 for jumpstream, 1/3 for one-handed trills
Player B's skillset: 2/3 for jacks, 2/3 for jumpstream, 2/3 for one-handed trills
(The skills are just an example, but the numbers are important)

Player A and B are equal.

This has subjectivity in it, and I invite anyone to explain why they think player A should be considered the better player in this case. I personally believe that we shouldn't favor specific skill proficiency over general proficiency. Any person that agree with this statement should make sure their preferred system respects it.

1.3. The outliers
Well, in a weighted system, where a non-random sample X of files is used to output a single number representing global skill rating, the above statement can never hold. For any score x1 in X, there will always be a score x2 that is either favored or vice versa. This means that any weighted system (with X of set size !), by definition, will generate unfairness by favoring players with specific skillsets at any given level. When X is of variable size, it becomes -Incredibly- difficult to properly formalize the model, and therefore a lot of guessing is introduced. That is what WirryWoo's model's hyperparameters are. By tweaking these, we adjust X's shape depending on a player's scores, but we can no longer tell what is favored (skillset specificity vs varied skillset) nor to what degree it is. In my opinion, this is sub-optimal.

Again, this mostly revolves around the player comparison statement.



2. Simple avg outliers:
A simple average system also generates outliers. These are much more straightforward. In fact, such a system implies an important statement about skill rating:

Any player that has a rating representative of their actual skill level has optimally filled their top X scores.

This means that if X is of size 50, then a player should have 50 scores of their caliber to be properly ranked. Any player whose top 50 is not that will have their rating be lower than their true skill level.

The main downside to this is pretty simple too:
If, over time, too many players dont optimally fill their top X, then the rankings will be flawed. These are essentially the outliers of this type of system.

My primary argument (subjective) to support this downside is that I absolutely cannot understand why we should think that it's too much to ask from players who want to be ranked. Playing 50, or even 100 songs in your difficulty range should Not be troublesome; if you want to be properly ranked but cannot be bothered fulfill this pretty simple requirement, do you even really care to begin with ? Saying that an unweighted system "favors active players" is quite the overstatement in my opinion. You don't need to be that active of a player to fulfill the requirement.


3. Comparison of outliers
So we have defined the kind of outliers that each system will inevitably have. The main concern I have with saying that "outliers are outliers" is that they're actually drastically different conceptually.

The weighted models' outliers are unfair. Some players will always be favored no matter how weights are arranged. In a variable X size setting, the outliers may be reduced, but only by an undefined amount, and they become hard to model.
The unweighted model's outliers are fair. Any player can easily stop being an outlier by getting some more scores in their difficulty range.

Now obviously the amount of outliers in both cases will differ. Naturally, at the very beginning of a transition to an unweighted system, there would be many more of them. This means that a stabilization period would follow, during which the players will get more scores at their own pace to more optimally fill their top X. There will always be players who will not do it, and retired players may definitely not come back to adjust their scores for this. However, any change to the skill rating computation will require Some adjustment from the players to get a more optimal result, so keeping retired players' rankings as is is just not a possibility (although some systems may yield closer results, the point remains).

3.1. My take on the outliers
At the end of the day, I personally favor fairness over count when it comes to these outliers. That being said, I would totally be ok with moving back to a weighted system if, after an arbitrarily long stabilization period with an unweighted system, there is still not enough effort from the players to make their top X reflect their actual skill level. That would be quite sad, but FFR does have its periods of low activity, and too little of it would indeed mean a weighted system is required. I don't think we have too little currently, but that's mostly subjective and debatable.

4. Common arguments
Here are some arguments people usually make which I'd like to address:

4.1 Rewarding outstanding scores
There is this thought that a weighted system better rewards rare great scores players get every once in a while. While that is definitely true, it doesn't mean that unweighted doesn't reward it; it just does so to a lesser degree to respect the important statement made in 1.2! A great score is still rewarded as the top 1 score in the top X. A player with the same average skill as you will be ranked lower due to that new score you got. If they're not ranked lower despite that sick score you got, that means they're better than you on average, that is all.

4.2 What about the top players who wont have an optimal top X ?
Yes, if Myuka doesn't play more and a top 50 unweighted is implemented, they will have a skill rating far from representative. To be honest, I couldn't care less. There are countless players from all other rhythm games who we know could be in top spots on FFR. Granted they haven't played a single game, the fact that we Know they'd place around a certain spot is also applicable to our current top players who might never "fix" or "fill" their ranked scores. Yes, it looks funny to see Myuka be ranked 100th or whatever, but really that's a small argument to back unfairness in system outliers. Does this mean we reward activity ? No, not really. That means we enforce a (relatively small) minimum of activity over a player's whole "FFR career" in order to have a representative skill rating. Rewarding activity would be done with seasons, where the same concepts are applied to definite, repeating timeframes where stats are reset each iteration.


5. Conclusion
I hope this post clarifies why I believe an unweighted top X (of size 50 or 100) is preferable in our case. I am very aware of the flaws of such a system, but I definitely think they are significantly "better" flaws than a weighted system's flaws.

Last edited by xXOpkillerXx; 05-23-2021 at 08:27 PM..
xXOpkillerXx is offline   Reply With Quote