Flash Flash Revolution - View Single Post - Poll: Which global skill rating system is best ?

**xXOpkillerXx** · 05-24-2021, 07:55 AM

Let me go through this point by point.

1. The definition of skill with examples

Quote:

Your proposed experimental design translates to comparing Player A under weighted skill ratings and Player B under unweighted skill ratings with the assumption that Player A and B holds a very similar skillset.

Sadly, this is all but true. I thought I made the example simple enough to be understood by everyone but I guess I failed to do that. Firstly, the experiment was 100% independent of rating system. It solely compared a "skill-specific" player to a "generalist" player, and made the claim that both should be rated equally. So not only did you poorly interpret this, you also gave a bad example which almost exactly demonstrate what is wrong with weighted:

Quote:

Furthermore, if you want to compare multiple players under a weighted setting, you'd have to design the following experiment as follows (again, keeping everything else constant):

Player A's skillset: 3/3 for jacks, 2/3 for jumpstream, 1/3 for one-handed trills
Player B's skillset: 3/3 for one-handed trills, 2/3 for jacks, 1/3 for jumpstream
Player C's skillset: 3/3 for jumpstream, 2/3 for one-handed trills, 1/3 for jacks
(constants: weighted weight assignments)

In this example you give, there are only skill-specific players. Add to that the generalist:

Quote:

Player D's skillset: 2/3 for jumpstream, 2/3 for one-handed trills, 2/3 for jacks

Suddenly, the skill rating ordering of those 4 players become the following (in a weighted system):

A ~= B ~= C > D

Whereas in an unweighted setting, this is what it'd look like:

A ~= B ~= C ~= D

This is mathematically unavoidable, and is the very definition of what I call unfair.

The following paragraph then asks how skill should be defined:

Quote:

In terms of "not knowing what is favored (skillset specificity vs varied skillset) nor to what degree it is", this is where hyperparameters are created to set these rules. I created this alpha hyperparameter to simplify a lot of question marks that no one else have collectively been able to address within the community. In this case, do we define skill as being a jack of all trades and a master of none, or being a one trick pony due to successfully being able to maximize your skill ratings? I don't know... This alpha controls how conservative we want this system to be since the answer to the previous question is highly community dependent and cannot be easily determined by the scores given to me. It's only sub-optimal because there is no objective criterion measuring the best way to define skill; it's practically impossible. The best I can do is give the community control to define that, however the hell they want... Despite however optimal or not this approach is, it's the best that we can at least do in an attempt to designing a robust tentative model catered to FFR until rhythm game skill determination and stepfile difficulty measurements are fully standardized across the entire rhythm games community (good luck getting that lmfao).

Well, it should be defined (imo) as the equality I mentioned above (for optimally filled top X).

2. Scaling of systems over time

2.1. Top X size
Say we put X at 50. A player would have their skill level biased toward a single skill (in any system) once they have achieved at least 25 scores

Now say a player's level should be based mostly on files that are ±5 levels around their skill level (assuming enough files provided in that range). We also know that files difficulty ranges between 1 and ~120 (for simplicity).

This means that every new file has a (5+5) / 120 = 8.333% chance of being in your range.

FFR releases files at a rate of ~4 files per week + an additional 80 files ish for events yearly. Per year, that's around 4 * 52 + 80 = 288 files, so lets make this 300 to account for events I might be forgetting (a higher number favors your argument). This means that every year, there are about 288 * 8.333% = 24 files in your specific range (assuming you stay at the same level).

To reach the necessary 25 skill-specific files, you would need a minimum of 1-ish year of content if all files were specifically biased toward your strong skill. If we consider the fact that there are various skills, lets say 5 (a bit less than etterna), that means every 5 years there is a high probability that some players will have enough files to still generate a biased skill rating despite it being a top 50.

However, we all know that there aren't that many extremely biased files after nearly 20 years of content. This simply shows that files are on a wide spectrum regarding what skill they test. That being said, we can see that in theory, we Should scale any system's top X size every 5 years or so, but that in practice, it's probably something closer to 50+ years. Why 50+ ? Because I dare you to find a handful of players with relatively optimal top 50 scores where at least 25 scores are clearly focused around a single skill.

2.2. Effect of new files on short term skill rating evolution
Although you seem to very much consider long term effects of new content, you dont really address the short term effects. In a weighted system where bias is significantly greater than in an unweighted system, every single new file that is biased enough towards one skill will create more unfairness in its specific difficulty range.

In order to have some fairness, you'd need enough of these biased files in each specific skill for anyone to fill optimal 25 scores with any random combination of these files in their difficulty range. Mathematically, this means you need:

25 (majority of 50) * 5 (number of skills) * 12 (minimum number of distinct difficulty ranges in a 1-120 system) = 1500 skill specific files
(assuming perfect distribution between skills and difficulty ranges, which is even more unrealistic)

This number will take Far longer to achieve than the 50+ years needed to make top X size a serious concern (when X >= 50).

2.3. Scaling conclusion
It honestly doesn't seem adequate to focus too much on scaling issues, as both system would be fine for a very long time. My problem with weighted however, is that it will forever be unfair.

NOTE: The following arguments present a new idea that is independent of either system.

3. On rewarding top scores, irrelevant of rating system
I am very aware of the fact that many of you can't accept seeing players with a few great scores being ranked too low due to unoptimal top X. I agree that this is subjective and that every player has their right to assign as much importance they want to that flaw. For that reason, I will propose a slight change in FFR's design to hopefully fix that.

Do keep in mind that although I suggest this new idea to complement an unweighted skill rating system, I also believe it should be implemented even if a weighted system is chosen.

3.1. The suggestion
Some of you may or may not have noticed that, in a player's leaderboard page, their Top 5 unweighted average and their Top 100 unweighted averages can already be seen. This is essentially the first step of what I think is a great step forward.

A Top 5 metric fully embraces skillset bias and fluke scores, as these are inevitable over time for a non-negligible number of players. Not only does it suffer no scaling issues, it also takes into account All players, retired or not, since the very beginning of FFR. This metric basically reflects the current weighted top 15, but removes the unnecessary weights and simplifies the process.

A Top 50 (or Top 100) metric would do everything I've been arguing for, which is maximized fairness and simplicity of outliers.

3.2. User friendliness
In all honesty, I despise the argument "But players might prefer a single number to represent their rating". First of all, we do have a unique number that represents one's solo level and it's called just that: Level. There is absolutely no reason to enforce a unique metric for player comparison, because I could just as easily say "But some players might prefer having different options to compete for", which is equally valid and subjective.

There is also the current issue of "how do I compute my skill rating ?", something that can be seen pretty often in either discord or multiplayer. Then, some experienced player may decide to take their time to explain weights and stuff, and eventually it takes quite some time to compute manually anyway (if you want to see the effect of a potential change). This issue should definitely be less apparent with the proposition I make. We can expect people to ask "Why are there 2 ratings and what do they mean ?", but clearly it should not take longer to explain two simple (no weights) averages; I'd say it should even be shorter to explain tbh.

Also, as a quick note, they could definitely have cooler names such as Peak Rating & General Rating, or something like that.

3.3. Appearance on the website and game
I think both metrics should have their respective leaderboard, and that everywhere that the current "Skill Rating" is listed should be split in 2 cells of Top 5 and Top 100. This involves a bit more development, but pretty minor changes afaik, as there is nothing drastically new to implement.

3.4. On concerns of single metric systems
This new idea should definitely address the following concern:

Quote:

4.2: This is a valid argument for the question "Should seasons rating be more official than skill ratings?" Seeing Myuka ranked as 100 would be very frustrating from the player's experience. Every now and then, you'll see a high D7 player post "I just beat Myuka's skill rating lmfaoooo!!" on the forums. Is this sort of the dynamic you want skill ratings to be on FFR? Yeah... I don't think so.

For the 2 following points, this approach doesn't fully address them but does help:

Quote:

4.1: Agreed. At the end of the day, both weighted and unweighted systems reward players for outstanding scores that lie in your Top 100. It's just a matter on how much you want to assign that reward. Should your #1 feel equally as rewarding as your #100 based on the weights given, or more? I think for most people, the answer to that question is more. Let's design a system that honors that for the player's hard work.

Quote:

This is also well aligned to the point I made about 4.1. Specifically, since your 101th score receives weight 0, your 100th score should receive weight ~0. You probably don't even know it if you receive a score that barely made it in the Top 100, so the weighting should also reflect how "significant" those scores are to you as a player (chances are, you probably don't care as much about your #100 because you know you can improve your #100 if you play more, so it should receive the least weighting in terms of skill rating). This aligns to my definition of a "well-designed" system, where the proposed weights are most reflective of the player's experience while maintaining the representation of the player's skills. I recently scored high teens on do i smile? (my current #2) and I was fucking proud because I worked hard to get that. I also scored low teens on LeaF Style Super Shredder 3 (my current #97), but I didn't care because I knew I can do better if I played more. I even had to look up what some of my 50-100 songs are because I don't care about them just as much as I cared about my top scores. I didn't even remember scoring my #97 lmfao. Intuitively, those weights need to be reflective of the player's overall experience while accurately reflecting their skill set.

I personally don't think it's ok to assume that a player's scores will on average match a linear curve when it comes to effort put into each of the scores. However I do agree that it can be great to reward the top scores. Therefore, the 2-metric system would not have that decay you suggest, but it would still give significant value to your top 5 scores.

4. Conclusion
I hope we can move forward with this new idea I brought forth. I think had I definitely undervalued the importance of top scores that many people mentionned. This quote is right in impying that the answer is "no" for many players.

Quote:

Specifically, would you feel equally accomplished scoring your #15 vs. your #1?

That being said, many people seem to agree with me that there needs to be some rating that is as robust as can be regarding skillset fairness. The proposed 2-metric system does its best to cater to these players, while still allowing for peak performance competition.