Flash Flash Revolution - View Single Post - Poll: Which global skill rating system is best ?

**WirryWoo** · 05-23-2021, 10:58 PM

Quote:

Originally Posted by xXOpkillerXx

We (FFR) use a single number to represent chart difficulty. Obviously, this has a relatively high and non-negligible degree of subjectivity. Other games like Etterna have attempted to fix this flaw by splitting the difficulty in distinct skills, kind of forcing axioms for what defines difficulty at its core. This method can generally help distinguish between files that are well balanced vs the ones that focus on 1 or 2 specific skillsets throughout. However, we simply dont do that, either because it has its own flaws, or for various reasons unrelated to this topic. So, we have one single number representing the difficulty of each file, be it balanced or not.

This is fine. Despite potential areas of improvement with how difficulties are determined, we can assume for the sake of conversation that these values are accurate for each file in game.

Quote:

Originally Posted by xXOpkillerXx

It's no surprise that each player has their own best and worst skills. Just like the files, some players' skillset are well balanced, while others' are more specific. Comparison of skill between two players can be argued, but my stance is that this statement should hold:

Player A's skillset: 3/3 for jacks, 2/3 for jumpstream, 1/3 for one-handed trills
Player B's skillset: 2/3 for jacks, 2/3 for jumpstream, 2/3 for one-handed trills
(The skills are just an example, but the numbers are important)

Player A and B are equal.

This has subjectivity in it, and I invite anyone to explain why they think player A should be considered the better player in this case. I personally believe that we shouldn't favor specific skill proficiency over general proficiency. Any person that agree with this statement should make sure their preferred system respects it.

...

Well, in a weighted system, where a non-random sample X of files is used to output a single number representing global skill rating, the above statement can never hold. For any score x1 in X, there will always be a score x2 that is either favored or vice versa. This means that any weighted system (with X of set size !), by definition, will generate unfairness by favoring players with specific skillsets at any given level. When X is of variable size, it becomes -Incredibly- difficult to properly formalize the model, and therefore a lot of guessing is introduced. That is what WirryWoo's model's hyperparameters are. By tweaking these, we adjust X's shape depending on a player's scores, but we can no longer tell what is favored (skillset specificity vs varied skillset) nor to what degree it is. In my opinion, this is sub-optimal.

Again, this mostly revolves around the player comparison statement.

Your proposed experimental design translates to comparing Player A under weighted skill ratings and Player B under unweighted skill ratings with the assumption that Player A and B holds a very similar skillset. From my understanding, this comparison is inconclusive in determining why an unweighted setting is better designed than the weighted variant. If you truly want to design an experiment aimed to compare between the two approaches (weighted vs unweighted), ideally, you'd want to keep all other variables as constant as possible. Specifically, the experiment would have to be something closer to this: (weighted hypothesis vs. unweighted hypothesis)

Player A's skillset: 3/3 for jacks, 2/3 for jumpstream, 1/3 for one-handed trills
Player A's skillset: 2/3 for jacks, 2/3 for jumpstream, 2/3 for one-handed trills
(constants: Player A and their skill set)

The only deduction you can conclude from this experiment is that Player A is clearly more rewarded for having the ability to score well on files demanding jack-patterns in the weighted setting. And yes, this is a valid consequence that cannot be controlled in the weighted setting due to a) the nature of high scores being able to exploit a player's strengths and weaknesses, and b) by definition of weighted, no matter whatever weight assignment you make, there will not be any way to fully resolve giving this "reward". Our current skill rating system does this too drastically to really see the flaws of the weighted system, and quite frankly, I agree that the current weight assignments need a full revamp to design a better system. Going to put an astrisks here because I will refer back to this point (*).

Furthermore, if you want to compare multiple players under a weighted setting, you'd have to design the following experiment as follows (again, keeping everything else constant):

Player A's skillset: 3/3 for jacks, 2/3 for jumpstream, 1/3 for one-handed trills
Player B's skillset: 3/3 for one-handed trills, 2/3 for jacks, 1/3 for jumpstream
Player C's skillset: 3/3 for jumpstream, 2/3 for one-handed trills, 1/3 for jacks
(constants: weighted weight assignments)

From the experiment above, you clearly see that Player A gets rewarded from jacks, B gets rewarded from one-handed trills, and C gets rewarded from jumpstreams in the weighted setting. Although each player is rewarded skill rating for different reasons, the songs made available to each player is for the most part(**), constant (i.e. each player have the same opportunity to try and perform well on each song). So whatever the player's individual performance is per song will be consistently factored into the overall skill rating computation in the weighted setting. The unweighted setting does exactly this too under a more conservative set of coefficients.

(**) The only exception are songs achieved by skill tokens and event tokens, but both weighted and unweighted settings similarly deals with this issue. These songs also represent a small percentage of the total options provided for each user, so its impact in both settings won't be too drastic. Therefore, these minor cases are invariant to the conversation of comparing between weighted vs. unweighted.

In response to "By tweaking these, we adjust X's shape depending on a player's scores", isn't it necessary to tweak in accordance to the data provided? A few examples to showcase the power of the weighted vs unweighted setting are provided:

Player A: https://www.flashflashrevolution.com...me=Chloe_edz15 (Weighted: 0 (flagged as inconclusive), Unweighted: 7.83)
Player B: https://www.flashflashrevolution.com...ername=Soure97 (Weighted: 93.25, Unweighted: 74.67)
Player C: https://www.flashflashrevolution.com...=Guilhermeziat (Weighted: 87.7044, Unweighted: 52.17)
(there are more examples)

It's clear that a mere Top 100 average is not sufficient for players who half-asses their top 100 and barely meet the requirements of being ranked (plays >100 songs). It's fine to enforce a minimum number of games required to be ranked on the leaderboards, but from the three examples above, there needs to be better measures than just relying on Top 100 Average. How do you mathematically define if their Top 100 is not reliable? You call this weighted approach as "justifying their laziness", where I call this "trying to extract the best information possible from limited data".

In terms of "not knowing what is favored (skillset specificity vs varied skillset) nor to what degree it is", this is where hyperparameters are created to set these rules. I created this alpha hyperparameter to simplify a lot of question marks that no one else have collectively been able to address within the community. In this case, do we define skill as being a jack of all trades and a master of none, or being a one trick pony due to successfully being able to maximize your skill ratings? I don't know... This alpha controls how conservative we want this system to be since the answer to the previous question is highly community dependent and cannot be easily determined by the scores given to me. It's only sub-optimal because there is no objective criterion measuring the best way to define skill; it's practically impossible. The best I can do is give the community control to define that, however the hell they want... Despite however optimal or not this approach is, it's the best that we can at least do in an attempt to designing a robust tentative model catered to FFR until rhythm game skill determination and stepfile difficulty measurements are fully standardized across the entire rhythm games community (good luck getting that lmfao).

Quote:

Originally Posted by xXOpkillerXx

A simple average system also generates outliers. These are much more straightforward. In fact, such a system implies an important statement about skill rating:

Any player that has a rating representative of their actual skill level has optimally filled their top X scores.

This means that if X is of size 50, then a player should have 50 scores of their caliber to be properly ranked. Any player whose top 50 is not that will have their rating be lower than their true skill level.

The main downside to this is pretty simple too:
If, over time, too many players dont optimally fill their top X, then the rankings will be flawed. These are essentially the outliers of this type of system.

My primary argument (subjective) to support this downside is that I absolutely cannot understand why we should think that it's too much to ask from players who want to be ranked. Playing 50, or even 100 songs in your difficulty range should Not be troublesome; if you want to be properly ranked but cannot be bothered fulfill this pretty simple requirement, do you even really care to begin with ? Saying that an unweighted system "favors active players" is quite the overstatement in my opinion. You don't need to be that active of a player to fulfill the requirement.

It's perfectly fine to enforce a minimum requirement in both settings (it probably is better in both cases because it's ridiculous to assign a skill rating to having one song played). This is less of a problem to me than what I wrote previously, but one of the main drawbacks I can see with the unweighted system is that it is forced to have this minimum requirement from the players to make the unweighted system work. Because of this forced requirement, you are requiring everyone who hasn't played 50 to 100 songs, to play (ideally seriously) in order to be considered ranked and improve the representation of the unweighted rankings. So there is a huge reliance on the players to play their part in making the unweighted system work. This isn't realistic in practice and this is why I call the unweighted system much more favorable to "active players". The ones who are committed to contributing to the high scores will be the ones who make the unweighted setting work.

The weighted system I designed is a lot more lenient in terms of requiring a minimum requirement (we are freely able to chose this requirement independent of the model's development). You can choose any reasonable minimum requirement for each player to satisfy and regardless if that requirement is met or not, the model attempts to find the best representation of skill using the weighted setting. Those who don't meet the minimum requirement will simply be excluded from the high scores via a defined conditional filter. (e.g. don't show username in high scores if they don't play 50 or 100 songs)

Because the game is set to scale as new stepcharts are continually being added, there are also more opportunities to fluke your scores. Here are the requirements that would be required in order to address and re-tweak the skill ratings for both unweighted and weighted cases:

Unweighted:
• Determine new n for Top n average (otherwise we can hypothetically get 100 fluke scores)
• Ask every player acquainted with the old unweighted system to update their ranks and play seriously (do this, or sacrifice accuracy of skill rating representation in high scores, your call.)

Weighted:
• Determine new n for Top n average (otherwise we can hypothetically get 100 fluke scores)
• Change head and alpha in my code, let the algorithm do its magic independent of player's involvement.

Which one is better suited for scalability reasons?

In terms of design, seasons ratings should be catered to community members who are actively involved in playing the game; skill ratings should be catered to all historic performances done over the course of FFR's age. Many people from Etterna hop on FFR occasionally to play a few songs just to get into FFR leaderboards, so shouldn't their skill be acknowledged under the definition of "skill rating"? If you disagree with this, maybe you should consider hopping on the "make seasons ranking more official than skill rating in FFR" train (this I'm more indifferent of). The name suggests "skill rating", so shouldn't the metric we design only focus on the player's skills given the data (regardless how limited it is) we have?

Quote:

Originally Posted by xXOpkillerXx

3. Comparison of outliers
So we have defined the kind of outliers that each system will inevitably have. The main concern I have with saying that "outliers are outliers" is that they're actually drastically different conceptually.

The weighted models' outliers are unfair. Some players will always be favored no matter how weights are arranged. In a variable X size setting, the outliers may be reduced, but only by an undenifed amount, and they become hard to model.
The unweighted model's outliers are fair. Any player can easily stop being an outlier by getting some more scores in their difficulty range.

Now obviously the amount of outliers in both cases will differ. Naturally, at the very beginning of a transition to an unweighted system, there would be many more of them. This means that a stabilization period would follow, during which the players will get more scores at their own pace to more optimally fill their top X. There will always be players who will not do it, and retired players may definitely not come back to adjust their scores for this. However, any change to the skill rating computation will require Some adjustment from the players to get a more optimal result, so keeping retired players' rankings as is is just not a possibility (although some systems may yield closer results, the point remains).

Based on what I discussed above:
"The weighted models' outliers are unfair." False. It's only unfair if you purposely assign different weighting mechanisms to two different players to further exploit their strengths and hide their weaknesses (like the experiment you proposed in your initial post)
"The unweighted model's outliers are fair." Player dependent. This is within control of the player to make the weighted mechanism work and therefore, to classify themselves as a "fair" or "unfair" outlier. I personally have a strong preference for a robust model independent of the player's involvement and the quality of the data given to me.

Quote:

Originally Posted by xXOpkillerXx

3.1. My take on the outliers
At the end of the day, I personally favor fairness over count when it comes to these outliers. That being said, I would totally be ok with moving back to a weighted system if, after an arbitrarily long stabilization period with an unweighted system, there is still not enough effort from the players to make their top X reflect their actual skill level. That would be quite sad, but FFR does have its periods of low activity, and too little of it would indeed mean a weighted system is required. I don't think we have too little currently, but that's mostly subjective and debatable.

This is your reliance on the players speaking here. You need the players to do their part to make the unweighted setting work. No need for this in the weighted setting.

Quote:

Originally Posted by xXOpkillerXx

4. Common arguments
Here are some arguments people usually make which I'd like to address:

4.1 Rewarding outstanding scores
There is this thought that a weighted system better rewards rare great scores players get every once in a while. While that is definitely true, it doesn't mean that unweighted doesn't reward it; it just does so to a lesser degree to respect the important statement made in 1.2! A great score is still rewarded as the top 1 score in the top X. A player with the same average skill as you will be ranked lower due to that new score you got. If they're not ranked lower despite that sick score you got, that means they're better than you on average, that is all.

4.2 What about the top players who wont have an optimal top X ?
Yes, if Myuka doesn't play more and a top 50 unweighted is implemented, they will have a skill rating far from representative. To be honest, I couldn't care less. There are countless players from all other rhythm games who we know could be in top spots on FFR. Granted they haven't played a single game, the fact that we Know they'd place around a certain spot is also applicable to out current top players who might never "fix" or "fill" their ranked scores. Yes, it looks funny to see Myuka be ranked 100th or whatever, but really that's a small argument to back unfairness in system outliers. Does this mean we reward activity ? No, not really. That means we enforce a (relatively small) minimum of activity over a player's whole "FFR career" in order to have a representative skill rating. Rewarding activity would be done with seasons, where the same concepts are applied to definite, repeating timeframes where stats are reset each iteration.

5. Conclusion
I hope this post clarifies why I believe an unweighted top X (of size 50 or 100) is preferable in our case. I am very aware of the flaws of such a system, but I definitely think they are significantly "better" flaws than a weighted system's flaws.

4.1: Agreed. At the end of the day, both weighted and unweighted systems reward players for outstanding scores that lie in your Top 100. It's just a matter on how much you want to assign that reward. Should your #1 feel equally as rewarding as your #100 based on the weights given, or more? I think for most people, the answer to that question is more. Let's design a system that honors that for the player's hard work.

4.2: This is a valid argument for the question "Should seasons rating be more official than skill ratings?" Seeing Myuka ranked as 100 would be very frustrating from the player's experience. Every now and then, you'll see a high D7 player post "I just beat Myuka's skill rating lmfaoooo!!" on the forums. Is this sort of the dynamic you want skill ratings to be on FFR? Yeah... I don't think so.

A few more thoughts:

Quote:

Originally Posted by WirryWoo

• If Top X songs are set to define skill rating (after many discussions on what X should be) and if skill rating is consistently used as a comparative tool to measure performance between two files demanding different skills and requirements, then the X-th and (X+1)th songs should hold very similar weights and (X+1)th song is weighted at 0 by definition of Top X.

This is also well aligned to the point I made about 4.1. Specifically, since your 101th score receives weight 0, your 100th score should receive weight ~0. You probably don't even know it if you receive a score that barely made it in the Top 100, so the weighting should also reflect how "significant" those scores are to you as a player (chances are, you probably don't care as much about your #100 because you know you can improve your #100 if you play more, so it should receive the least weighting in terms of skill rating). This aligns to my definition of a "well-designed" system, where the proposed weights are most reflective of the player's experience while maintaining the representation of the player's skills. I recently scored high teens on do i smile? (my current #2) and I was fucking proud because I worked hard to get that. I also scored low teens on LeaF Style Super Shredder 3 (my current #97), but I didn't care because I knew I can do better if I played more. I even had to look up what some of my 50-100 songs are because I don't care about them just as much as I cared about my top scores. I didn't even remember scoring my #97 lmfao. Intuitively, those weights need to be reflective of the player's overall experience while accurately reflecting their skill set.

Quote:

Originally Posted by WirryWoo

• Although weighted mechanism rewards you for "biases" encoded in the performance of files in your Top N more than the unweighted mechanism, a regulated (key word here) solution should still capture many benefits that the unweighted solution provides: specifically and most importantly, a stronger representation of lower ranked files in your Top X is needed to determine a user's skill rating.

Back to (*), this is where the word regulated plays the biggest role in my statement here. The current system highly favors the Top 10ish songs. That is not regulated because you get rewarded wayyy tooo much for scoring your #1 and pretty much nothing from scoring your #15. So what does "regulated" here mean? All it means is that, we need to control the weights appropriately so each song has a representable piece in contributing to the skill ratings metric while maintaining how reflective it is to the player's experience. Controlling the weights includes dealing with any outliers like people not completing their Top 100 and people half assing Top 100. This is why I proposed a linear progression of the weightings because although my satisfaction between my #20 and #15 will be different than someone else's satisfaction between their #20 and #15, the linear progression distributes the weights as consistently as possible without introducing how #15 is significantly much more important than #20 (like our current system which indicates that #1 is >140 times more important than your #15 lmfao)

Last but not least, the partial analogy I see between weighted vs. unweighted conversation is similar but not exactly the same to chess's elo system. Are you judged by your win percentage where each win counts as a +1 and each loss counts as a 0, or are you judged by how consistently you beat players better than you when measuring someone's chess rating? Although not exactly the same reasons why I'm advocating for weighted, but there is a reason why the first few games of chess is weighted the most and is most contributive to your elo rating. If you lose games you are expected to win, you need to win a number of similar level games to prove that you deserve a higher rating. Your skill is not judged by win percentage. Are there any settings you can think of where your skill is determined by your win percentage? I can't think of many myself.

Quote:

Originally Posted by trumaestro

Spitballing: how about a bit of both sides?

Equal weights for top X scores. Decreasing weights to next Y scores.

I'm not math-y enough to work out whether that addresses any of the issues here, but it seems to me that combining sides here could help mitigate the downsides of each.

Cool suggestion, but I personally don't think this would be reflective of the player's overall experience and an accurate measure of skill. Specifically, would you feel equally accomplished scoring your #15 vs. your #1? Does it take the same amount of skill to successfully perform between your #1 vs. #15? These are the questions that would need to be thought of when thinking about this hybrid setting. In my opinion, I don't think this will work for previous reasons discussed.