Poll: Which global skill rating system is best ?

**xXOpkillerXx** · 05-22-2021, 08:51 AM

Hello FFR. Give your thoughts. Votes without explanation dont help much, keep that in mind.

PS: The weighted average choice includes any weighting scheme you can think of.

**WirryWoo** · 05-22-2021, 10:11 AM

To start some conversations. I can highlight some thoughts.

Primary reasons on why I prefer weighted averages:

• If Top X songs are set to define skill rating (after many discussions on what X should be) and if skill rating is consistently used as a comparative tool to measure performance between two files demanding different skills and requirements, then the X-th and (X+1)th songs should hold very similar weights and (X+1)th song is weighted at 0 by definition of Top X.

• Although weighted mechanism rewards you for "biases" encoded in the performance of files in your Top N more than the unweighted mechanism, a regulated (key word here) solution should still capture many benefits that the unweighted solution provides: specifically and most importantly, a stronger representation of lower ranked files in your Top X is needed to determine a user's skill rating.

• If we want to reward users for activity, why shouldn't the season's ratings be used there? Whether skill rating vs. seasons rating is assigned as the more "official" metric can be reserved as another conversation. Point is, there is a solution aimed to reward players who consistently play the game.

I get it. Our current weighted system does not do it well, but this doesn't necessarily translate to "any weighted solution cannot do that". It's a tradeoff between "improving representation of lower ranked files" and "rewarding performance for songs subjectively more challenging than what your current skill rating suggests", and in my opinion, that should be respected.

I've written a first iteration of what a weighted setting would look like. Attached is a Colab notebook for reference. There's a pandas dataframe containing the new projected rankings, the username, their projected weighted skill rating, and their current rank in game.

For fun: If you want to determine your projected skill rating under a weighted mechanism I designed, scroll up to the "Determine your skill rating" section, replace my username with yours, then scroll all the way to the top, and click on the play buttons for the first seven cells.

-

**Zageron** · 05-22-2021, 01:22 PM

As a victim of weighted averages, I stand in solidarity with Simple average of top X equivs.

**Matthia** · 05-22-2021, 01:23 PM

I prefer any system that will boost me to #1

**Gradiant** · 05-22-2021, 01:56 PM

Simple average better deals with issue of fluke scores on poorly rated files

**xXOpkillerXx** · 05-23-2021, 08:31 AM

More arguments from the weighted side please

**gold stinger** · 05-23-2021, 02:56 PM

huh.

**FlynnMac** · 05-23-2021, 04:23 PM

I guess I'll give my take on this

So I choose weighted mainly from my experience with other rhythm games where they have a more successful weighted system. It doesn't highlight your top play a large amount ahead of the rest not allowing entirely for outliers and it also uses a large size of files in order to give the most accurate rating possible. While simple average gives the average of all your top x ratings, weighted can still have your best plays give more of an impact than plays you aren't as happy with. Wirry's system had felt accurate to me because of the fact that the weights input on it were better than FFR's current weights. There are a lot of high level players who had lower ranks than they should have that got bumped up, and a lot of lower level players that got their ranks bumped down (me included). The ratings really felt like they defined who had better ranks over having an average rating system that could still have it's outliers. Outliers will not be fixed either way, but with the right weights, it could be fixed better than a simple average could do it.

**Zlyice** · 05-23-2021, 04:48 PM

There are two main reasons I'm in favor of a weighted average. One, as Flynn mentioned, is that a weighted average does a better job of giving resolution to a player's top level of play. The current system does give a pretty strong weight to a player's top score, but I see this more of an issue with the current weights as opposed to a weighted average in general. WirryWoo's calculation earlier in the thread seems pretty reasonable to me personally.

Secondly, an unweighted skill rating is only going to be as representative as the full scope of scores going into the calculation. If we're considering, for example, an unweighted average of 100 songs, this would require a player to play enough things for these 100 scores to be reasonably representative of their level of skill, which could take a considerable amount of time. There's a lot of potential for an unweighted average to disproportionately rank more active players ahead of players who play a bit less but are ultimately a little more skilled.

**xXOpkillerXx** · 05-23-2021, 06:13 PM

Alright, I'll try and make a structured statement.

First of all, there is a concern that many people pointed out, which is that any system would have outliers. While that is true, not all outliers are the same, and that should very much be considered. In all cases should we try to minimize the amount of outliers there are, but it can be very difficult to compare counts for different types of outliers. At that point, a bit of subjectivity is invovled and necessary.

Lets look at what types of outliers the two kinds of system generate:

1. Weighted avg outliers:
These are essentially any and all outliers that come from the fact that our difficulty judgement is inherently flawed, mixed with inevitable imbalance in players' skillsets. The two points in this can be further explained:

1.1. Difficulty
We (FFR) use a single number to represent chart difficulty. Obviously, this has a relatively high and non-negligible degree of subjectivity. Other games like Etterna have attempted to fix this flaw by splitting the difficulty in distinct skills, kind of forcing axioms for what defines difficulty at its core. This method can generally help distinguish between files that are well balanced vs the ones that focus on 1 or 2 specific skillsets throughout. However, we simply dont do that, either because it has its own flaws, or for various reasons unrelated to this topic. So, we have one single number representing the difficulty of each file, be it balanced or not.

1.2. Players skillsets
It's no surprise that each player has their own best and worst skills. Just like the files, some players' skillset are well balanced, while others' are more specific. Comparison of skill between two players can be argued, but my stance is that this statement should hold:

Player A's skillset: 3/3 for jacks, 2/3 for jumpstream, 1/3 for one-handed trills
Player B's skillset: 2/3 for jacks, 2/3 for jumpstream, 2/3 for one-handed trills
(The skills are just an example, but the numbers are important)

Player A and B are equal.

This has subjectivity in it, and I invite anyone to explain why they think player A should be considered the better player in this case. I personally believe that we shouldn't favor specific skill proficiency over general proficiency. Any person that agree with this statement should make sure their preferred system respects it.

1.3. The outliers
Well, in a weighted system, where a non-random sample X of files is used to output a single number representing global skill rating, the above statement can never hold. For any score x1 in X, there will always be a score x2 that is either favored or vice versa. This means that any weighted system (with X of set size !), by definition, will generate unfairness by favoring players with specific skillsets at any given level. When X is of variable size, it becomes -Incredibly- difficult to properly formalize the model, and therefore a lot of guessing is introduced. That is what WirryWoo's model's hyperparameters are. By tweaking these, we adjust X's shape depending on a player's scores, but we can no longer tell what is favored (skillset specificity vs varied skillset) nor to what degree it is. In my opinion, this is sub-optimal.

Again, this mostly revolves around the player comparison statement.

2. Simple avg outliers:
A simple average system also generates outliers. These are much more straightforward. In fact, such a system implies an important statement about skill rating:

Any player that has a rating representative of their actual skill level has optimally filled their top X scores.

This means that if X is of size 50, then a player should have 50 scores of their caliber to be properly ranked. Any player whose top 50 is not that will have their rating be lower than their true skill level.

The main downside to this is pretty simple too:
If, over time, too many players dont optimally fill their top X, then the rankings will be flawed. These are essentially the outliers of this type of system.

My primary argument (subjective) to support this downside is that I absolutely cannot understand why we should think that it's too much to ask from players who want to be ranked. Playing 50, or even 100 songs in your difficulty range should Not be troublesome; if you want to be properly ranked but cannot be bothered fulfill this pretty simple requirement, do you even really care to begin with ? Saying that an unweighted system "favors active players" is quite the overstatement in my opinion. You don't need to be that active of a player to fulfill the requirement.

3. Comparison of outliers
So we have defined the kind of outliers that each system will inevitably have. The main concern I have with saying that "outliers are outliers" is that they're actually drastically different conceptually.

The weighted models' outliers are unfair. Some players will always be favored no matter how weights are arranged. In a variable X size setting, the outliers may be reduced, but only by an undefined amount, and they become hard to model.
The unweighted model's outliers are fair. Any player can easily stop being an outlier by getting some more scores in their difficulty range.

Now obviously the amount of outliers in both cases will differ. Naturally, at the very beginning of a transition to an unweighted system, there would be many more of them. This means that a stabilization period would follow, during which the players will get more scores at their own pace to more optimally fill their top X. There will always be players who will not do it, and retired players may definitely not come back to adjust their scores for this. However, any change to the skill rating computation will require Some adjustment from the players to get a more optimal result, so keeping retired players' rankings as is is just not a possibility (although some systems may yield closer results, the point remains).

3.1. My take on the outliers
At the end of the day, I personally favor fairness over count when it comes to these outliers. That being said, I would totally be ok with moving back to a weighted system if, after an arbitrarily long stabilization period with an unweighted system, there is still not enough effort from the players to make their top X reflect their actual skill level. That would be quite sad, but FFR does have its periods of low activity, and too little of it would indeed mean a weighted system is required. I don't think we have too little currently, but that's mostly subjective and debatable.

4. Common arguments
Here are some arguments people usually make which I'd like to address:

4.1 Rewarding outstanding scores
There is this thought that a weighted system better rewards rare great scores players get every once in a while. While that is definitely true, it doesn't mean that unweighted doesn't reward it; it just does so to a lesser degree to respect the important statement made in 1.2! A great score is still rewarded as the top 1 score in the top X. A player with the same average skill as you will be ranked lower due to that new score you got. If they're not ranked lower despite that sick score you got, that means they're better than you on average, that is all.

4.2 What about the top players who wont have an optimal top X ?
Yes, if Myuka doesn't play more and a top 50 unweighted is implemented, they will have a skill rating far from representative. To be honest, I couldn't care less. There are countless players from all other rhythm games who we know could be in top spots on FFR. Granted they haven't played a single game, the fact that we Know they'd place around a certain spot is also applicable to our current top players who might never "fix" or "fill" their ranked scores. Yes, it looks funny to see Myuka be ranked 100th or whatever, but really that's a small argument to back unfairness in system outliers. Does this mean we reward activity ? No, not really. That means we enforce a (relatively small) minimum of activity over a player's whole "FFR career" in order to have a representative skill rating. Rewarding activity would be done with seasons, where the same concepts are applied to definite, repeating timeframes where stats are reset each iteration.

5. Conclusion
I hope this post clarifies why I believe an unweighted top X (of size 50 or 100) is preferable in our case. I am very aware of the flaws of such a system, but I definitely think they are significantly "better" flaws than a weighted system's flaws.

trumaestro · 05-23-2021, 08:43 PM

Spitballing: how about a bit of both sides?

Equal weights for top X scores. Decreasing weights to next Y scores.

I'm not math-y enough to work out whether that addresses any of the issues here, but it seems to me that combining sides here could help mitigate the downsides of each.

**xXOpkillerXx** · 05-23-2021, 08:56 PM

Quote:

Originally Posted by trumaestro

Spitballing: how about a bit of both sides?

Equal weights for top X scores. Decreasing weights to next Y scores.

I'm not math-y enough to work out whether that addresses any of the issues here, but it seems to me that combining sides here could help mitigate the downsides of each.

Not a bad take tbh. I'm not entirely sure yet what to think of it, but here's my quick thoughts.

A top X (in any system) should require significantly more scores than the current system in order to minimize the chance of skillset bias. In other words, the more files are taken into account (at equal weights, and to some extent obviously), the lower the probability of having one or two skills being overly representative of one's skill level. In extensive discussion on discord, the size of X has been mostly agreed to be between 30 and 100. I personally would be fine with 50 (as I suggest with seasons ratings too), but I'm also ok with 100 given the fact that it's not limited in time.

That being said, if you agree with my statements in the previous post, there should be at the very least a top 30 files with equal weights, after which it would start decaying until either 50 or 100 probably.

Again, I'm not sure what I think of it, but I'm probably more ok with it than not. Would be nice to hear from the people who were against unweighted.

**WirryWoo** · 05-23-2021, 10:58 PM

Quote:

Originally Posted by xXOpkillerXx

We (FFR) use a single number to represent chart difficulty. Obviously, this has a relatively high and non-negligible degree of subjectivity. Other games like Etterna have attempted to fix this flaw by splitting the difficulty in distinct skills, kind of forcing axioms for what defines difficulty at its core. This method can generally help distinguish between files that are well balanced vs the ones that focus on 1 or 2 specific skillsets throughout. However, we simply dont do that, either because it has its own flaws, or for various reasons unrelated to this topic. So, we have one single number representing the difficulty of each file, be it balanced or not.

This is fine. Despite potential areas of improvement with how difficulties are determined, we can assume for the sake of conversation that these values are accurate for each file in game.

Quote:

Originally Posted by xXOpkillerXx

It's no surprise that each player has their own best and worst skills. Just like the files, some players' skillset are well balanced, while others' are more specific. Comparison of skill between two players can be argued, but my stance is that this statement should hold:

Player A's skillset: 3/3 for jacks, 2/3 for jumpstream, 1/3 for one-handed trills
Player B's skillset: 2/3 for jacks, 2/3 for jumpstream, 2/3 for one-handed trills
(The skills are just an example, but the numbers are important)

Player A and B are equal.

This has subjectivity in it, and I invite anyone to explain why they think player A should be considered the better player in this case. I personally believe that we shouldn't favor specific skill proficiency over general proficiency. Any person that agree with this statement should make sure their preferred system respects it.

...

Well, in a weighted system, where a non-random sample X of files is used to output a single number representing global skill rating, the above statement can never hold. For any score x1 in X, there will always be a score x2 that is either favored or vice versa. This means that any weighted system (with X of set size !), by definition, will generate unfairness by favoring players with specific skillsets at any given level. When X is of variable size, it becomes -Incredibly- difficult to properly formalize the model, and therefore a lot of guessing is introduced. That is what WirryWoo's model's hyperparameters are. By tweaking these, we adjust X's shape depending on a player's scores, but we can no longer tell what is favored (skillset specificity vs varied skillset) nor to what degree it is. In my opinion, this is sub-optimal.

Again, this mostly revolves around the player comparison statement.

Your proposed experimental design translates to comparing Player A under weighted skill ratings and Player B under unweighted skill ratings with the assumption that Player A and B holds a very similar skillset. From my understanding, this comparison is inconclusive in determining why an unweighted setting is better designed than the weighted variant. If you truly want to design an experiment aimed to compare between the two approaches (weighted vs unweighted), ideally, you'd want to keep all other variables as constant as possible. Specifically, the experiment would have to be something closer to this: (weighted hypothesis vs. unweighted hypothesis)

Player A's skillset: 3/3 for jacks, 2/3 for jumpstream, 1/3 for one-handed trills
Player A's skillset: 2/3 for jacks, 2/3 for jumpstream, 2/3 for one-handed trills
(constants: Player A and their skill set)

The only deduction you can conclude from this experiment is that Player A is clearly more rewarded for having the ability to score well on files demanding jack-patterns in the weighted setting. And yes, this is a valid consequence that cannot be controlled in the weighted setting due to a) the nature of high scores being able to exploit a player's strengths and weaknesses, and b) by definition of weighted, no matter whatever weight assignment you make, there will not be any way to fully resolve giving this "reward". Our current skill rating system does this too drastically to really see the flaws of the weighted system, and quite frankly, I agree that the current weight assignments need a full revamp to design a better system. Going to put an astrisks here because I will refer back to this point (*).

Furthermore, if you want to compare multiple players under a weighted setting, you'd have to design the following experiment as follows (again, keeping everything else constant):

Player A's skillset: 3/3 for jacks, 2/3 for jumpstream, 1/3 for one-handed trills
Player B's skillset: 3/3 for one-handed trills, 2/3 for jacks, 1/3 for jumpstream
Player C's skillset: 3/3 for jumpstream, 2/3 for one-handed trills, 1/3 for jacks
(constants: weighted weight assignments)

From the experiment above, you clearly see that Player A gets rewarded from jacks, B gets rewarded from one-handed trills, and C gets rewarded from jumpstreams in the weighted setting. Although each player is rewarded skill rating for different reasons, the songs made available to each player is for the most part(**), constant (i.e. each player have the same opportunity to try and perform well on each song). So whatever the player's individual performance is per song will be consistently factored into the overall skill rating computation in the weighted setting. The unweighted setting does exactly this too under a more conservative set of coefficients.

(**) The only exception are songs achieved by skill tokens and event tokens, but both weighted and unweighted settings similarly deals with this issue. These songs also represent a small percentage of the total options provided for each user, so its impact in both settings won't be too drastic. Therefore, these minor cases are invariant to the conversation of comparing between weighted vs. unweighted.

In response to "By tweaking these, we adjust X's shape depending on a player's scores", isn't it necessary to tweak in accordance to the data provided? A few examples to showcase the power of the weighted vs unweighted setting are provided:

Player A: https://www.flashflashrevolution.com...me=Chloe_edz15 (Weighted: 0 (flagged as inconclusive), Unweighted: 7.83)
Player B: https://www.flashflashrevolution.com...ername=Soure97 (Weighted: 93.25, Unweighted: 74.67)
Player C: https://www.flashflashrevolution.com...=Guilhermeziat (Weighted: 87.7044, Unweighted: 52.17)
(there are more examples)

It's clear that a mere Top 100 average is not sufficient for players who half-asses their top 100 and barely meet the requirements of being ranked (plays >100 songs). It's fine to enforce a minimum number of games required to be ranked on the leaderboards, but from the three examples above, there needs to be better measures than just relying on Top 100 Average. How do you mathematically define if their Top 100 is not reliable? You call this weighted approach as "justifying their laziness", where I call this "trying to extract the best information possible from limited data".

In terms of "not knowing what is favored (skillset specificity vs varied skillset) nor to what degree it is", this is where hyperparameters are created to set these rules. I created this alpha hyperparameter to simplify a lot of question marks that no one else have collectively been able to address within the community. In this case, do we define skill as being a jack of all trades and a master of none, or being a one trick pony due to successfully being able to maximize your skill ratings? I don't know... This alpha controls how conservative we want this system to be since the answer to the previous question is highly community dependent and cannot be easily determined by the scores given to me. It's only sub-optimal because there is no objective criterion measuring the best way to define skill; it's practically impossible. The best I can do is give the community control to define that, however the hell they want... Despite however optimal or not this approach is, it's the best that we can at least do in an attempt to designing a robust tentative model catered to FFR until rhythm game skill determination and stepfile difficulty measurements are fully standardized across the entire rhythm games community (good luck getting that lmfao).

Quote:

Originally Posted by xXOpkillerXx

A simple average system also generates outliers. These are much more straightforward. In fact, such a system implies an important statement about skill rating:

Any player that has a rating representative of their actual skill level has optimally filled their top X scores.

This means that if X is of size 50, then a player should have 50 scores of their caliber to be properly ranked. Any player whose top 50 is not that will have their rating be lower than their true skill level.

The main downside to this is pretty simple too:
If, over time, too many players dont optimally fill their top X, then the rankings will be flawed. These are essentially the outliers of this type of system.

My primary argument (subjective) to support this downside is that I absolutely cannot understand why we should think that it's too much to ask from players who want to be ranked. Playing 50, or even 100 songs in your difficulty range should Not be troublesome; if you want to be properly ranked but cannot be bothered fulfill this pretty simple requirement, do you even really care to begin with ? Saying that an unweighted system "favors active players" is quite the overstatement in my opinion. You don't need to be that active of a player to fulfill the requirement.

It's perfectly fine to enforce a minimum requirement in both settings (it probably is better in both cases because it's ridiculous to assign a skill rating to having one song played). This is less of a problem to me than what I wrote previously, but one of the main drawbacks I can see with the unweighted system is that it is forced to have this minimum requirement from the players to make the unweighted system work. Because of this forced requirement, you are requiring everyone who hasn't played 50 to 100 songs, to play (ideally seriously) in order to be considered ranked and improve the representation of the unweighted rankings. So there is a huge reliance on the players to play their part in making the unweighted system work. This isn't realistic in practice and this is why I call the unweighted system much more favorable to "active players". The ones who are committed to contributing to the high scores will be the ones who make the unweighted setting work.

The weighted system I designed is a lot more lenient in terms of requiring a minimum requirement (we are freely able to chose this requirement independent of the model's development). You can choose any reasonable minimum requirement for each player to satisfy and regardless if that requirement is met or not, the model attempts to find the best representation of skill using the weighted setting. Those who don't meet the minimum requirement will simply be excluded from the high scores via a defined conditional filter. (e.g. don't show username in high scores if they don't play 50 or 100 songs)

Because the game is set to scale as new stepcharts are continually being added, there are also more opportunities to fluke your scores. Here are the requirements that would be required in order to address and re-tweak the skill ratings for both unweighted and weighted cases:

Unweighted:
• Determine new n for Top n average (otherwise we can hypothetically get 100 fluke scores)
• Ask every player acquainted with the old unweighted system to update their ranks and play seriously (do this, or sacrifice accuracy of skill rating representation in high scores, your call.)

Weighted:
• Determine new n for Top n average (otherwise we can hypothetically get 100 fluke scores)
• Change head and alpha in my code, let the algorithm do its magic independent of player's involvement.

Which one is better suited for scalability reasons?

In terms of design, seasons ratings should be catered to community members who are actively involved in playing the game; skill ratings should be catered to all historic performances done over the course of FFR's age. Many people from Etterna hop on FFR occasionally to play a few songs just to get into FFR leaderboards, so shouldn't their skill be acknowledged under the definition of "skill rating"? If you disagree with this, maybe you should consider hopping on the "make seasons ranking more official than skill rating in FFR" train (this I'm more indifferent of). The name suggests "skill rating", so shouldn't the metric we design only focus on the player's skills given the data (regardless how limited it is) we have?

Quote:

Originally Posted by xXOpkillerXx

3. Comparison of outliers
So we have defined the kind of outliers that each system will inevitably have. The main concern I have with saying that "outliers are outliers" is that they're actually drastically different conceptually.

The weighted models' outliers are unfair. Some players will always be favored no matter how weights are arranged. In a variable X size setting, the outliers may be reduced, but only by an undenifed amount, and they become hard to model.
The unweighted model's outliers are fair. Any player can easily stop being an outlier by getting some more scores in their difficulty range.

Now obviously the amount of outliers in both cases will differ. Naturally, at the very beginning of a transition to an unweighted system, there would be many more of them. This means that a stabilization period would follow, during which the players will get more scores at their own pace to more optimally fill their top X. There will always be players who will not do it, and retired players may definitely not come back to adjust their scores for this. However, any change to the skill rating computation will require Some adjustment from the players to get a more optimal result, so keeping retired players' rankings as is is just not a possibility (although some systems may yield closer results, the point remains).

Based on what I discussed above:
"The weighted models' outliers are unfair." False. It's only unfair if you purposely assign different weighting mechanisms to two different players to further exploit their strengths and hide their weaknesses (like the experiment you proposed in your initial post)
"The unweighted model's outliers are fair." Player dependent. This is within control of the player to make the weighted mechanism work and therefore, to classify themselves as a "fair" or "unfair" outlier. I personally have a strong preference for a robust model independent of the player's involvement and the quality of the data given to me.

Quote:

Originally Posted by xXOpkillerXx

3.1. My take on the outliers
At the end of the day, I personally favor fairness over count when it comes to these outliers. That being said, I would totally be ok with moving back to a weighted system if, after an arbitrarily long stabilization period with an unweighted system, there is still not enough effort from the players to make their top X reflect their actual skill level. That would be quite sad, but FFR does have its periods of low activity, and too little of it would indeed mean a weighted system is required. I don't think we have too little currently, but that's mostly subjective and debatable.

This is your reliance on the players speaking here. You need the players to do their part to make the unweighted setting work. No need for this in the weighted setting.

Quote:

Originally Posted by xXOpkillerXx

4. Common arguments
Here are some arguments people usually make which I'd like to address:

4.1 Rewarding outstanding scores
There is this thought that a weighted system better rewards rare great scores players get every once in a while. While that is definitely true, it doesn't mean that unweighted doesn't reward it; it just does so to a lesser degree to respect the important statement made in 1.2! A great score is still rewarded as the top 1 score in the top X. A player with the same average skill as you will be ranked lower due to that new score you got. If they're not ranked lower despite that sick score you got, that means they're better than you on average, that is all.

4.2 What about the top players who wont have an optimal top X ?
Yes, if Myuka doesn't play more and a top 50 unweighted is implemented, they will have a skill rating far from representative. To be honest, I couldn't care less. There are countless players from all other rhythm games who we know could be in top spots on FFR. Granted they haven't played a single game, the fact that we Know they'd place around a certain spot is also applicable to out current top players who might never "fix" or "fill" their ranked scores. Yes, it looks funny to see Myuka be ranked 100th or whatever, but really that's a small argument to back unfairness in system outliers. Does this mean we reward activity ? No, not really. That means we enforce a (relatively small) minimum of activity over a player's whole "FFR career" in order to have a representative skill rating. Rewarding activity would be done with seasons, where the same concepts are applied to definite, repeating timeframes where stats are reset each iteration.

5. Conclusion
I hope this post clarifies why I believe an unweighted top X (of size 50 or 100) is preferable in our case. I am very aware of the flaws of such a system, but I definitely think they are significantly "better" flaws than a weighted system's flaws.

4.1: Agreed. At the end of the day, both weighted and unweighted systems reward players for outstanding scores that lie in your Top 100. It's just a matter on how much you want to assign that reward. Should your #1 feel equally as rewarding as your #100 based on the weights given, or more? I think for most people, the answer to that question is more. Let's design a system that honors that for the player's hard work.

4.2: This is a valid argument for the question "Should seasons rating be more official than skill ratings?" Seeing Myuka ranked as 100 would be very frustrating from the player's experience. Every now and then, you'll see a high D7 player post "I just beat Myuka's skill rating lmfaoooo!!" on the forums. Is this sort of the dynamic you want skill ratings to be on FFR? Yeah... I don't think so.

A few more thoughts:

Quote:

Originally Posted by WirryWoo

• If Top X songs are set to define skill rating (after many discussions on what X should be) and if skill rating is consistently used as a comparative tool to measure performance between two files demanding different skills and requirements, then the X-th and (X+1)th songs should hold very similar weights and (X+1)th song is weighted at 0 by definition of Top X.

This is also well aligned to the point I made about 4.1. Specifically, since your 101th score receives weight 0, your 100th score should receive weight ~0. You probably don't even know it if you receive a score that barely made it in the Top 100, so the weighting should also reflect how "significant" those scores are to you as a player (chances are, you probably don't care as much about your #100 because you know you can improve your #100 if you play more, so it should receive the least weighting in terms of skill rating). This aligns to my definition of a "well-designed" system, where the proposed weights are most reflective of the player's experience while maintaining the representation of the player's skills. I recently scored high teens on do i smile? (my current #2) and I was fucking proud because I worked hard to get that. I also scored low teens on LeaF Style Super Shredder 3 (my current #97), but I didn't care because I knew I can do better if I played more. I even had to look up what some of my 50-100 songs are because I don't care about them just as much as I cared about my top scores. I didn't even remember scoring my #97 lmfao. Intuitively, those weights need to be reflective of the player's overall experience while accurately reflecting their skill set.

Quote:

Originally Posted by WirryWoo

• Although weighted mechanism rewards you for "biases" encoded in the performance of files in your Top N more than the unweighted mechanism, a regulated (key word here) solution should still capture many benefits that the unweighted solution provides: specifically and most importantly, a stronger representation of lower ranked files in your Top X is needed to determine a user's skill rating.

Back to (*), this is where the word regulated plays the biggest role in my statement here. The current system highly favors the Top 10ish songs. That is not regulated because you get rewarded wayyy tooo much for scoring your #1 and pretty much nothing from scoring your #15. So what does "regulated" here mean? All it means is that, we need to control the weights appropriately so each song has a representable piece in contributing to the skill ratings metric while maintaining how reflective it is to the player's experience. Controlling the weights includes dealing with any outliers like people not completing their Top 100 and people half assing Top 100. This is why I proposed a linear progression of the weightings because although my satisfaction between my #20 and #15 will be different than someone else's satisfaction between their #20 and #15, the linear progression distributes the weights as consistently as possible without introducing how #15 is significantly much more important than #20 (like our current system which indicates that #1 is >140 times more important than your #15 lmfao)

Last but not least, the partial analogy I see between weighted vs. unweighted conversation is similar but not exactly the same to chess's elo system. Are you judged by your win percentage where each win counts as a +1 and each loss counts as a 0, or are you judged by how consistently you beat players better than you when measuring someone's chess rating? Although not exactly the same reasons why I'm advocating for weighted, but there is a reason why the first few games of chess is weighted the most and is most contributive to your elo rating. If you lose games you are expected to win, you need to win a number of similar level games to prove that you deserve a higher rating. Your skill is not judged by win percentage. Are there any settings you can think of where your skill is determined by your win percentage? I can't think of many myself.

Quote:

Originally Posted by trumaestro

Spitballing: how about a bit of both sides?

Equal weights for top X scores. Decreasing weights to next Y scores.

I'm not math-y enough to work out whether that addresses any of the issues here, but it seems to me that combining sides here could help mitigate the downsides of each.

Cool suggestion, but I personally don't think this would be reflective of the player's overall experience and an accurate measure of skill. Specifically, would you feel equally accomplished scoring your #15 vs. your #1? Does it take the same amount of skill to successfully perform between your #1 vs. #15? These are the questions that would need to be thought of when thinking about this hybrid setting. In my opinion, I don't think this will work for previous reasons discussed.

**xXOpkillerXx** · 05-24-2021, 07:55 AM

Let me go through this point by point.

1. The definition of skill with examples

Quote:

Your proposed experimental design translates to comparing Player A under weighted skill ratings and Player B under unweighted skill ratings with the assumption that Player A and B holds a very similar skillset.

Sadly, this is all but true. I thought I made the example simple enough to be understood by everyone but I guess I failed to do that. Firstly, the experiment was 100% independent of rating system. It solely compared a "skill-specific" player to a "generalist" player, and made the claim that both should be rated equally. So not only did you poorly interpret this, you also gave a bad example which almost exactly demonstrate what is wrong with weighted:

Quote:

Furthermore, if you want to compare multiple players under a weighted setting, you'd have to design the following experiment as follows (again, keeping everything else constant):

Player A's skillset: 3/3 for jacks, 2/3 for jumpstream, 1/3 for one-handed trills
Player B's skillset: 3/3 for one-handed trills, 2/3 for jacks, 1/3 for jumpstream
Player C's skillset: 3/3 for jumpstream, 2/3 for one-handed trills, 1/3 for jacks
(constants: weighted weight assignments)

In this example you give, there are only skill-specific players. Add to that the generalist:

Quote:

Player D's skillset: 2/3 for jumpstream, 2/3 for one-handed trills, 2/3 for jacks

Suddenly, the skill rating ordering of those 4 players become the following (in a weighted system):

A ~= B ~= C > D

Whereas in an unweighted setting, this is what it'd look like:

A ~= B ~= C ~= D

This is mathematically unavoidable, and is the very definition of what I call unfair.

The following paragraph then asks how skill should be defined:

Quote:

In terms of "not knowing what is favored (skillset specificity vs varied skillset) nor to what degree it is", this is where hyperparameters are created to set these rules. I created this alpha hyperparameter to simplify a lot of question marks that no one else have collectively been able to address within the community. In this case, do we define skill as being a jack of all trades and a master of none, or being a one trick pony due to successfully being able to maximize your skill ratings? I don't know... This alpha controls how conservative we want this system to be since the answer to the previous question is highly community dependent and cannot be easily determined by the scores given to me. It's only sub-optimal because there is no objective criterion measuring the best way to define skill; it's practically impossible. The best I can do is give the community control to define that, however the hell they want... Despite however optimal or not this approach is, it's the best that we can at least do in an attempt to designing a robust tentative model catered to FFR until rhythm game skill determination and stepfile difficulty measurements are fully standardized across the entire rhythm games community (good luck getting that lmfao).

Well, it should be defined (imo) as the equality I mentioned above (for optimally filled top X).

2. Scaling of systems over time

2.1. Top X size
Say we put X at 50. A player would have their skill level biased toward a single skill (in any system) once they have achieved at least 25 scores

Now say a player's level should be based mostly on files that are ±5 levels around their skill level (assuming enough files provided in that range). We also know that files difficulty ranges between 1 and ~120 (for simplicity).

This means that every new file has a (5+5) / 120 = 8.333% chance of being in your range.

FFR releases files at a rate of ~4 files per week + an additional 80 files ish for events yearly. Per year, that's around 4 * 52 + 80 = 288 files, so lets make this 300 to account for events I might be forgetting (a higher number favors your argument). This means that every year, there are about 288 * 8.333% = 24 files in your specific range (assuming you stay at the same level).

To reach the necessary 25 skill-specific files, you would need a minimum of 1-ish year of content if all files were specifically biased toward your strong skill. If we consider the fact that there are various skills, lets say 5 (a bit less than etterna), that means every 5 years there is a high probability that some players will have enough files to still generate a biased skill rating despite it being a top 50.

However, we all know that there aren't that many extremely biased files after nearly 20 years of content. This simply shows that files are on a wide spectrum regarding what skill they test. That being said, we can see that in theory, we Should scale any system's top X size every 5 years or so, but that in practice, it's probably something closer to 50+ years. Why 50+ ? Because I dare you to find a handful of players with relatively optimal top 50 scores where at least 25 scores are clearly focused around a single skill.

2.2. Effect of new files on short term skill rating evolution
Although you seem to very much consider long term effects of new content, you dont really address the short term effects. In a weighted system where bias is significantly greater than in an unweighted system, every single new file that is biased enough towards one skill will create more unfairness in its specific difficulty range.

In order to have some fairness, you'd need enough of these biased files in each specific skill for anyone to fill optimal 25 scores with any random combination of these files in their difficulty range. Mathematically, this means you need:

25 (majority of 50) * 5 (number of skills) * 12 (minimum number of distinct difficulty ranges in a 1-120 system) = 1500 skill specific files
(assuming perfect distribution between skills and difficulty ranges, which is even more unrealistic)

This number will take Far longer to achieve than the 50+ years needed to make top X size a serious concern (when X >= 50).

2.3. Scaling conclusion
It honestly doesn't seem adequate to focus too much on scaling issues, as both system would be fine for a very long time. My problem with weighted however, is that it will forever be unfair.

NOTE: The following arguments present a new idea that is independent of either system.

3. On rewarding top scores, irrelevant of rating system
I am very aware of the fact that many of you can't accept seeing players with a few great scores being ranked too low due to unoptimal top X. I agree that this is subjective and that every player has their right to assign as much importance they want to that flaw. For that reason, I will propose a slight change in FFR's design to hopefully fix that.

Do keep in mind that although I suggest this new idea to complement an unweighted skill rating system, I also believe it should be implemented even if a weighted system is chosen.

3.1. The suggestion
Some of you may or may not have noticed that, in a player's leaderboard page, their Top 5 unweighted average and their Top 100 unweighted averages can already be seen. This is essentially the first step of what I think is a great step forward.

A Top 5 metric fully embraces skillset bias and fluke scores, as these are inevitable over time for a non-negligible number of players. Not only does it suffer no scaling issues, it also takes into account All players, retired or not, since the very beginning of FFR. This metric basically reflects the current weighted top 15, but removes the unnecessary weights and simplifies the process.

A Top 50 (or Top 100) metric would do everything I've been arguing for, which is maximized fairness and simplicity of outliers.

3.2. User friendliness
In all honesty, I despise the argument "But players might prefer a single number to represent their rating". First of all, we do have a unique number that represents one's solo level and it's called just that: Level. There is absolutely no reason to enforce a unique metric for player comparison, because I could just as easily say "But some players might prefer having different options to compete for", which is equally valid and subjective.

There is also the current issue of "how do I compute my skill rating ?", something that can be seen pretty often in either discord or multiplayer. Then, some experienced player may decide to take their time to explain weights and stuff, and eventually it takes quite some time to compute manually anyway (if you want to see the effect of a potential change). This issue should definitely be less apparent with the proposition I make. We can expect people to ask "Why are there 2 ratings and what do they mean ?", but clearly it should not take longer to explain two simple (no weights) averages; I'd say it should even be shorter to explain tbh.

Also, as a quick note, they could definitely have cooler names such as Peak Rating & General Rating, or something like that.

3.3. Appearance on the website and game
I think both metrics should have their respective leaderboard, and that everywhere that the current "Skill Rating" is listed should be split in 2 cells of Top 5 and Top 100. This involves a bit more development, but pretty minor changes afaik, as there is nothing drastically new to implement.

3.4. On concerns of single metric systems
This new idea should definitely address the following concern:

Quote:

4.2: This is a valid argument for the question "Should seasons rating be more official than skill ratings?" Seeing Myuka ranked as 100 would be very frustrating from the player's experience. Every now and then, you'll see a high D7 player post "I just beat Myuka's skill rating lmfaoooo!!" on the forums. Is this sort of the dynamic you want skill ratings to be on FFR? Yeah... I don't think so.

For the 2 following points, this approach doesn't fully address them but does help:

Quote:

4.1: Agreed. At the end of the day, both weighted and unweighted systems reward players for outstanding scores that lie in your Top 100. It's just a matter on how much you want to assign that reward. Should your #1 feel equally as rewarding as your #100 based on the weights given, or more? I think for most people, the answer to that question is more. Let's design a system that honors that for the player's hard work.

Quote:

This is also well aligned to the point I made about 4.1. Specifically, since your 101th score receives weight 0, your 100th score should receive weight ~0. You probably don't even know it if you receive a score that barely made it in the Top 100, so the weighting should also reflect how "significant" those scores are to you as a player (chances are, you probably don't care as much about your #100 because you know you can improve your #100 if you play more, so it should receive the least weighting in terms of skill rating). This aligns to my definition of a "well-designed" system, where the proposed weights are most reflective of the player's experience while maintaining the representation of the player's skills. I recently scored high teens on do i smile? (my current #2) and I was fucking proud because I worked hard to get that. I also scored low teens on LeaF Style Super Shredder 3 (my current #97), but I didn't care because I knew I can do better if I played more. I even had to look up what some of my 50-100 songs are because I don't care about them just as much as I cared about my top scores. I didn't even remember scoring my #97 lmfao. Intuitively, those weights need to be reflective of the player's overall experience while accurately reflecting their skill set.

I personally don't think it's ok to assume that a player's scores will on average match a linear curve when it comes to effort put into each of the scores. However I do agree that it can be great to reward the top scores. Therefore, the 2-metric system would not have that decay you suggest, but it would still give significant value to your top 5 scores.

4. Conclusion
I hope we can move forward with this new idea I brought forth. I think had I definitely undervalued the importance of top scores that many people mentionned. This quote is right in impying that the answer is "no" for many players.

Quote:

Specifically, would you feel equally accomplished scoring your #15 vs. your #1?

That being said, many people seem to agree with me that there needs to be some rating that is as robust as can be regarding skillset fairness. The proposed 2-metric system does its best to cater to these players, while still allowing for peak performance competition.

**katanaeyegaming** · 05-24-2021, 08:06 AM

My opinion on this is simple.

Neither they both suck

**xXOpkillerXx** · 05-24-2021, 08:11 AM

I'd like to quickly entertain the idea that in a 2-metric system, something like a "linearly weighted top 10" (trying to keep it a low number to still represent peak performance) could be interesting too. The general rating would remain a simple average of top 100.

**Gradiant** · 05-24-2021, 12:27 PM

Quote:

Originally Posted by WirryWoo

It's perfectly fine to enforce a minimum requirement in both settings (it probably is better in both cases because it's ridiculous to assign a skill rating to having one song played). This is less of a problem to me than what I wrote previously, but one of the main drawbacks I can see with the unweighted system is that it is forced to have this minimum requirement from the players to make the unweighted system work. Because of this forced requirement, you are requiring everyone who hasn't played 50 to 100 songs, to play (ideally seriously) in order to be considered ranked and improve the representation of the unweighted rankings. So there is a huge reliance on the players to play their part in making the unweighted system work. This isn't realistic in practice and this is why I call the unweighted system much more favorable to "active players". The ones who are committed to contributing to the high scores will be the ones who make the unweighted setting work.

The weighted system I designed is a lot more lenient in terms of requiring a minimum requirement (we are freely able to chose this requirement independent of the model's development). You can choose any reasonable minimum requirement for each player to satisfy and regardless if that requirement is met or not, the model attempts to find the best representation of skill using the weighted setting. Those who don't meet the minimum requirement will simply be excluded from the high scores via a defined conditional filter. (e.g. don't show username in high scores if they don't play 50 or 100 songs)

Confused with the difference here. With both of these systems, players aren't going to be listed if they haven't hit whatever minimum is in place.

Also in general, don't really like the 'people are going to have to play' argument against average system. What is the whole point of this game anyway but to play files to get scores they think are good? I mentioned this in the discord when op brought it up, but the token requirement for coactive is to AAA 50 different files in a day. So playing 50 different files just to best of ability not even AAA'ing shouldn't take longer than a day either. Don't think this is too much to ask for at all for the benefit of being on a leaderboard. And if they don't care enough about playing the game, then they're not listed in the leaderboard like the bolded part of that 2nd part in the quote.

Also thinking of games like moba's or stuff like starcraft where you go through placement matches before being ranked; those games a match could go anywhere from like 30min to an hour, compared to an ffr file being like 2 minutes or so. The times required for the placement matches would be similar to hitting whatever minimum number of files played to be on ffr's leaderboard.

**WirryWoo** · 05-24-2021, 01:32 PM

Quote:

Originally Posted by xXOpkillerXx

Sadly, this is all but true. I thought I made the example simple enough to be understood by everyone but I guess I failed to do that. Firstly, the experiment was 100% independent of rating system. It solely compared a "skill-specific" player to a "generalist" player, and made the claim that both should be rated equally. So not only did you poorly interpret this, you also gave a bad example which almost exactly demonstrate what is wrong with weighted:

The experiments provided were confusing because 1/3 + 2/3 + 3/3 != 1 and 2/3 + 2/3 + 2/3 != 1 for the experiments. Are these the weightings assigned from the system or some amount of skill that each player hold for each pattern? I assumed they were weightings because we're having the conversation about if a weighted system is better or not than the unweighted setting. If that's true, what I initially said applies.

If it's the latter, this goes to my other point about comparing skills. Specifically:

Quote:

Originally Posted by WirryWoo

In terms of "not knowing what is favored (skillset specificity vs varied skillset) nor to what degree it is", this is where hyperparameters are created to set these rules. I created this alpha hyperparameter to simplify a lot of question marks that no one else have collectively been able to address within the community. In this case, do we define skill as being a jack of all trades and a master of none, or being a one trick pony due to successfully being able to maximize your skill ratings? I don't know... This alpha controls how conservative we want this system to be since the answer to the previous question is highly community dependent and cannot be easily determined by the scores given to me. It's only sub-optimal because there is no objective criterion measuring the best way to define skill; it's practically impossible. The best I can do is give the community control to define that, however the hell they want... Despite however optimal or not this approach is, it's the best that we can at least do in an attempt to designing a robust tentative model catered to FFR until rhythm game skill determination and stepfile difficulty measurements are fully standardized across the entire rhythm games community (good luck getting that lmfao).

So what's the answer here? We cannot compare skills unless we have a solid definition of the word "skill". Specifically, who is more skilled than the other. You responded:

Quote:

Originally Posted by xXOpkillerXx

Suddenly, the skill rating ordering of those 4 players become the following (in a weighted system):

A ~= B ~= C > D

Whereas in an unweighted setting, this is what it'd look like:

A ~= B ~= C ~= D

This is mathematically unavoidable, and is the very definition of what I call unfair.

I agree with what is stated here, but this is only one dimension of the current problem. We can easily hyperfocus on this definition and make sure that this equality condition is being met. However, in the unweighted setting, there are a number of examples you need to "sacrifice" in order to fully make A ~= B ~= C ~= D work, including:

Quote:

Originally Posted by WirryWoo

Player A: https://www.flashflashrevolution.com...me=Chloe_edz15 (Weighted: 0 (flagged as inconclusive), Unweighted: 7.83)
Player B: https://www.flashflashrevolution.com...ername=Soure97 (Weighted: 93.25, Unweighted: 74.67)
Player C: https://www.flashflashrevolution.com...=Guilhermeziat (Weighted: 87.7044, Unweighted: 52.17)
(there are more examples)

So when you make these sacrifices, do you really get A ~= B ~= C ~= D? This goes back to my other point about the unweighted system's flaw:

Quote:

Originally Posted by WirryWoo

This is less of a problem to me than what I wrote previously, but one of the main drawbacks I can see with the unweighted system is that it is forced to have this minimum requirement from the players to make the unweighted system work. Because of this forced requirement, you are requiring everyone who hasn't played 50 to 100 songs, to play (ideally seriously) in order to be considered ranked and improve the representation of the unweighted rankings. So there is a huge reliance on the players to play their part in making the unweighted system work. This isn't realistic in practice and this is why I call the unweighted system much more favorable to "active players". The ones who are committed to contributing to the high scores will be the ones who make the unweighted setting work.

And realistically, encouraging many players to fix their ranks will not happen unless if there is a strong incentive for them to do so.

So the following things will most likely have to happen if we move forward with unweighted:
• We change the definition of what "skill rating" means because now, we'll have a poorer definition of skill (examples provided above). People can easily say:

Quote:

Originally Posted by WirryWoo

Seeing Myuka ranked as 100 would be very frustrating from the player's experience. Every now and then, you'll see a high D7 player post "I just beat Myuka's skill rating lmfaoooo!!" on the forums. Is this sort of the dynamic you want skill ratings to be on FFR? Yeah... I don't think so.

• We need to design a mechanism to toggle between when Top 5 vs. Top 100 averages are more accurate, all based on sentiments like "this person is bsing, let's rely on Top 5". Most cases, they are obvious, but in terms of model design, they are subjective.
• We need create more and higher incentives for players to play their high scores as optimally as possible.

But my counter to each point (in respective order)
• Having a skill rating not truly reflective of skill is paradoxical. It will only be as reflective as the most active players. We can do better...
• Why rely on two models when we can rely on one? Introducing any external assessment to choose if unweighted Top 5 or unweighted Top 100 makes more sense, introduces someone else's bias into the system. Do we want that to define skill ratings? I don't. I'd rather rely on my scores to do that determination for me.
• Why bother relying on the inactive players when we have the scores to help us define the skill ratings? We don't need them, and it's likely that they don't give any shits about us too lol.

It's clear that there are many issues with unweighted. The biggest reasons for this is due to the fact that the unweighted mechanism is located on one extreme end of all possible solutions (let's call this 'black' solution). The 'white' solution would be the case where your skill rating is extremely weighted and is defined by the performance of your #1 score. Our current system is set using a "very very light grey" solution, and clearly, it's easy to see the issues of that weighted system; the current system understandably emphasizes the many flaws of a "white" solution. Specifically, they're all of your arguments against weighted, and they're for the most part, valid. There are also many flaws of a "black" solution (e.g. reasons I posted).

The best solution is one that calls for a tradeoff between a "white" and "black" solution where the pros of each extremes are emphasized and the cons of each extremes are minimized. This is why it's obvious that our solution demands a "darker grey" solution since we clearly see the issues of the "nearly white" solution (e.g. our current system).

I've been advocating for this "grey" solution numerous of times:

Quote:

Originally Posted by WirryWoo

I get it. Our current weighted system does not do it well, but this doesn't necessarily translate to "any weighted solution cannot do that". It's a tradeoff between "improving representation of lower ranked files" and "rewarding performance for songs subjectively more challenging than what your current skill rating suggests", and in my opinion, that should be respected.

Quote:

Originally Posted by WirryWoo

Back to (*), this is where the word regulated plays the biggest role in my statement here. The current system highly favors the Top 10ish songs. That is not regulated because you get rewarded wayyy tooo much for scoring your #1 and pretty much nothing from scoring your #15. So what does "regulated" here mean? All it means is that, we need to control the weights appropriately so each song has a representable piece in contributing to the skill ratings metric while maintaining how reflective it is to the player's experience. Controlling the weights includes dealing with any outliers like people not completing their Top 100 and people half assing Top 100. This is why I proposed a linear progression of the weightings because although my satisfaction between my #20 and #15 will be different than someone else's satisfaction between their #20 and #15, the linear progression distributes the weights as consistently as possible without introducing how #15 is significantly much more important than #20 (like our current system which indicates that #1 is >140 times more important than your #15 lmfao)

For some reason, the way you seem to view this is a binary option between black and white. This assumes that every two weighted options will equally perform when compared against each other. Basically any weighted option I propose will yield no change compared to our current weighted system. Do you agree with this? I sure hope not. The Colab notebook provided showed a ton of shifting and recalculations of skill ratings, so some things are changing...

Here are a few examples:

Format: Username (Current, Regulated Weighted, Unweighted)

RadiantVibe (97.53, 92.2631, 90.91)
Andrew WCY (96.67, 94.0343, 93.01)

CammyGoesRawr (93.80, 88.0389, 86.59)
Hakulyte (93.25, 90.5301, 89.72)

Currently, both RadiantVibe and CammyGoesRawr are rewarded for their top scores much more than Andrew WCY and Hakulyte. You see this under "current".

Both regulated weighted and unweighted settings agree that Andrew WCY > RadiantVibe and Hakulyte > CammyGoesRawr. This is the regulated solution acknowledging the pros of the unweighted solution and factoring that into the overall calculations.

I'm happy to chat through more examples if two users want to do a comparative analysis...

However, where the unweighted solution fails is based on the counterexamples provided above:

Quote:

Originally Posted by WirryWoo

Player A: https://www.flashflashrevolution.com...me=Chloe_edz15 (Weighted: 0 (flagged as inconclusive), Unweighted: 7.83)
Player B: https://www.flashflashrevolution.com...ername=Soure97 (Weighted: 93.25, Unweighted: 74.67)
Player C: https://www.flashflashrevolution.com...=Guilhermeziat (Weighted: 87.7044, Unweighted: 52.17)
(there are more examples)

This is the regulated solution acknowledging the pros of the weighted solution. Do you now see how valuable it is to look at the "greys" rather than hyperfocusing between the "black" and "white" as only options?

For Zageron's case:

Zageron (61.80, 54.517, 43.31)

Clearly, the weighted solution suggests that Zageron can score at the competency of someone who can AAA around a difficulty ~54, and low and behold... he did (Rat Twist)! Was that a fluke run? I don't know. All the model did was capture the relevant signals seen in his Top 100 high scores.

If you still think ~54 is too generous for Zageron, this is where alpha comes to play. We can tweak alpha to make this algorithm more or less conservative. I give the developers and the community power to define these standards in accordance to what they think is best for all players moving forward. As an individiual, it's not my right to define this on behalf of the community. That's the value of where alpha and head comes into play in the model.

The closer alpha is to 0, the more conservative the weighted model is, and therefore the more representative top 15 is to their skill rating. The closer alpha is to 1, the less conservative the weighted model is, and therefore the more representative top 100 is to their skill. Alpha is a tradeoff parameter between the overall value of top 15 vs. top 100 (in machine learning speak, think regularization). This alpha has to be set standard for all players unless if we devise a new algorithm estimating alpha for each player as a function of their high scores.

Quote:

Originally Posted by xXOpkillerXx

2.1. Top X size
Say we put X at 50. A player would have their skill level biased toward a single skill (in any system) once they have achieved at least 25 scores

Now say a player's level should be based mostly on files that are ±5 levels around their skill level (assuming enough files provided in that range). We also know that files difficulty ranges between 1 and ~120 (for simplicity).

This means that every new file has a (5+5) / 120 = 8.333% chance of being in your range.

FFR releases files at a rate of ~4 files per week + an additional 80 files ish for events yearly. Per year, that's around 4 * 52 + 80 = 288 files, so lets make this 300 to account for events I might be forgetting (a higher number favors your argument). This means that every year, there are about 288 * 8.333% = 24 files in your specific range (assuming you stay at the same level).

To reach the necessary 25 skill-specific files, you would need a minimum of 1-ish year of content if all files were specifically biased toward your strong skill. If we consider the fact that there are various skills, lets say 5 (a bit less than etterna), that means every 5 years there is a high probability that some players will have enough files to still generate a biased skill rating despite it being a top 50.

However, we all know that there aren't that many extremely biased files after nearly 20 years of content. This simply shows that files are on a wide spectrum regarding what skill they test. That being said, we can see that in theory, we Should scale any system's top X size every 5 years or so, but that in practice, it's probably something closer to 50+ years. Why 50+ ? Because I dare you to find a handful of players with relatively optimal top 50 scores where at least 25 scores are clearly focused around a single skill.

"This means that every new file has a (5+5) / 120 = 8.333% chance of being in your range." This quote makes the false assumption of every new file having equal probability being rated between difficulty 1 to 120, which is honestly pretty silly and untrue. It also assumes that every batch has the same representation of difficulties but it's clear that just by looking at Official Tournaments, there's a higher bias for harder songs. (I also think a lot of stepartists have the self-interest to step harder songs in general, but this is just my personal sentiment that maybe many people can agree) Therefore this computation is inaccurate. It's realistically dependent on the player's skill and the contributing stepartist's submissions in the batch. Therefore your analysis to 50+ years is unreliable.

Instead, let's look at past data where we know for a fact that songs are released in some sequential order. Before 2004, when files are held in Legacy engine, there is a good amount of opportunities to score well on songs requiring trilling: One Minute Waltz, Flight of the Bumblebee, and Runny Mornings (SGX Mix), debatably Molto Vivace. Players who excel at trilling will perform relatively well on these files and will highly benefit from any system imposed. You can consider their performance as "flukes" (similar to Zag's performance on Club and AIM Anthem). Due to smaller file frequency in the engine, a well designed skill rating system back then would require something like Top 10, regardless if is weighted or unweighted.

Fast forward to today, we get La Campanella, Giselle, MAX Forever, and other trilly songs that I can't think of right off the top of my head. But I'm very confident that there are at least 10 songs currently in the engine that emphasizes trilling. This is equivalent to Zageron having more files similar to Club and AIM Anthem, and therefore, has more opportunities to fluke. Top 10 will easily be filled with trilly songs, so there is a need to scale out. This time span is less than 20 years at least for trilling files. For other patterns, this length will vary depending on previously mentioned variables (stepartist song submissions, batches, events, etc.).

Regardless of how long it takes for the need of scalability to exist, the main point to make about scalability is that there is a point in time somewhere in the future where we need to revamp the system. Maybe Top 100 would be too small of a sample size, so we need Top 150, or Top 200, etc. in order to maintain the accuracy of skill ratings.

The issue posed with the unweighted setting is that it will be difficult to retrack these players who joined FFR at 2002 and then stopped playing the game. Your solution (which I personally characterize as "hacky") is to filter out these players so that their ratings don't get factored into the overall high scores. This goes back to my thoughts about the minimum requirement:

Quote:

Originally Posted by WirryWoo

It's perfectly fine to enforce a minimum requirement in both settings (it probably is better in both cases because it's ridiculous to assign a skill rating to having one song played). This is less of a problem to me than what I wrote previously, but one of the main drawbacks I can see with the unweighted system is that it is forced to have this minimum requirement from the players to make the unweighted system work. Because of this forced requirement, you are requiring everyone who hasn't played 50 to 100 songs, to play (ideally seriously) in order to be considered ranked and improve the representation of the unweighted rankings. So there is a huge reliance on the players to play their part in making the unweighted system work. This isn't realistic in practice and this is why I call the unweighted system much more favorable to "active players". The ones who are committed to contributing to the high scores will be the ones who make the unweighted setting work.

The weighted system I designed is a lot more lenient in terms of requiring a minimum requirement (we are freely able to chose this requirement independent of the model's development). You can choose any reasonable minimum requirement for each player to satisfy and regardless if that requirement is met or not, the model attempts to find the best representation of skill using the weighted setting. Those who don't meet the minimum requirement will simply be excluded from the high scores via a defined conditional filter. (e.g. don't show username in high scores if they don't play 50 or 100 songs)

When you filter these players out, this also changes the definition of "skill rating". Imagine if Usain Bolt did not participate in Summer Olympics this year but attended four years ago. Has his skill rating changed? Maybe, maybe not. Point is, he still should relatively have the skills to perform at the Olympics-level if he were to attend. Your suggestion is to mark him as "no skill due to not participating", where my solution respects his skills given his past performance and attempts to make that acknowledged despite not seeing his performance this summer. Which one better measures skill?

Quote:

Originally Posted by xXOpkillerXx

Although you seem to very much consider long term effects of new content, you dont really address the short term effects. In a weighted system where bias is significantly greater than in an unweighted system, every single new file that is biased enough towards one skill will create more unfairness in its specific difficulty range.

In order to have some fairness, you'd need enough of these biased files in each specific skill for anyone to fill optimal 25 scores with any random combination of these files in their difficulty range. Mathematically, this means you need:

25 (majority of 50) * 5 (number of skills) * 12 (minimum number of distinct difficulty ranges in a 1-120 system) = 1500 skill specific files
(assuming perfect distribution between skills and difficulty ranges, which is even more unrealistic)

This number will take Far longer to achieve than the 50+ years needed to make top X size a serious concern (when X >= 50).

The short term effects are addressed due to the regulated system's goal of accounting both pros and cons of both weighted and unweighted settings. The results above speak for themselves.

Quote:

Originally Posted by xXOpkillerXx

2.3. Scaling conclusion
It honestly doesn't seem adequate to focus too much on scaling issues, as both system would be fine for a very long time. My problem with weighted however, is that it will forever be unfair.

This is under your definition of "fair". There are multiple definitions to consider as mentioned previously, but you're so hyperfocused on one and try to make amends to resolve the faults of the unweighted setting via filtering inactive players, relying on their involvement, etc.

My definition of fair incorporates both fairness criterions offered by weighted and unweighted settings and tradeoffs between the two to overall generalize that definition of "fairness" without needing to rely on other factors except the scores given to me. In my opinion, this defines skill rating.

It's also important to focus on building any scalable solution possible. The earlier, the better. Otherwise, we'll have this conversation again when the hacky solution fails.

Quote:

Originally Posted by xXOpkillerXx

3. On rewarding top scores, irrelevant of rating system[/b]
I am very aware of the fact that many of you can't accept seeing players with a few great scores being ranked too low due to unoptimal top X. I agree that this is subjective and that every player has their right to assign as much importance they want to that flaw. For that reason, I will propose a slight change in FFR's design to hopefully fix that.

Do keep in mind that although I suggest this new idea to complement an unweighted skill rating system, I also believe it should be implemented even if a weighted system is chosen.

3.1. The suggestion
Some of you may or may not have noticed that, in a player's leaderboard page, their Top 5 unweighted average and their Top 100 unweighted averages can already be seen. This is essentially the first step of what I think is a great step forward.

A Top 5 metric fully embraces skillset bias and fluke scores, as these are inevitable over time for a non-negligible number of players. Not only does it suffer no scaling issues, it also takes into account All players, retired or not, since the very beginning of FFR. This metric basically reflects the current weighted top 15, but removes the unnecessary weights and simplifies the process.

A Top 50 (or Top 100) metric would do everything I've been arguing for, which is maximized fairness and simplicity of outliers.

My biggest issue with this is that you are now injecting your own personal bias into the skill rating. Specifically, you're now making the conscious decision of answering the question "when should I rely on "Top 5 vs. Top 50/100"?" This is a choice that you make, not the model.

This is like choosing conditionally if a chess player's elo rating vs. win percentage is more definitive of their skills. I disagree with this completely because skill is measured by your performance and scores, not someone else's conscious decision to choose between two different metrics.

Quote:

Originally Posted by xXOpkillerXx

3.2. User friendliness
In all honesty, I despise the argument "But players might prefer a single number to represent their rating". First of all, we do have a unique number that represents one's solo level and it's called just that: Level. There is absolutely no reason to enforce a unique metric for player comparison, because I could just as easily say "But some players might prefer having different options to compete for", which is equally valid and subjective.

Isn't this what we're arguing about though? Specifically how is Level computed? Do we want the weighted vs. unweighted settings to perform the computation of Level?

""But some players might prefer having different options to compete for", which is equally valid and subjective." Because choosing between two different options as the "official rating" is subjective from the modeling standpoint in the conversation of tracking skill rating, it's not valid because now you are comparing apples to oranges. This is a paradoxical statement. Do we want skill to be defined subjectively by someone's brief look at your level ranks or objectively by the scores that you produce? I personally prefer the latter.

Quote:

Originally Posted by xXOpkillerXx

There is also the current issue of "how do I compute my skill rating ?", something that can be seen pretty often in either discord or multiplayer. Then, some experienced player may decide to take their time to explain weights and stuff, and eventually it takes quite some time to compute manually anyway (if you want to see the effect of a potential change). This issue should definitely be less apparent with the proposition I make. We can expect people to ask "Why are there 2 ratings and what do they mean ?", but clearly it should not take longer to explain two simple (no weights) averages; I'd say it should even be shorter to explain tbh.

I agree that it's easier to explain the unweighted setting no matter how you engineer the weighted configuration. Do we care more about being transparant or being more accurate in defining skill though? This is a tradeoff we need to sacrifice because the simplier the approach is, the more suceptible the model will become to performing poorly on the outliers defined previously. The code I wrote is really simple to explain as well (not as simple as unweighted but still relatively easy to understand). It can be easily done with pictures.

Quote:

Originally Posted by xXOpkillerXx

3.3. Appearance on the website and game
I think both metrics should have their respective leaderboard, and that everywhere that the current "Skill Rating" is listed should be split in 2 cells of Top 5 and Top 100. This involves a bit more development, but pretty minor changes afaik, as there is nothing drastically new to implement.

Terrible design idea in my opinion. A bystander will simply think that you might as well have a Top 10, 100, 1000, etc. Do you see any high scores that contains two different scoring metrics? No, neither have I.

The two rating system reserved to defining skill rating will not address the "I just beat Myuka's skill rating lmfaoooo!!" issue. The next natural question for someone new to the game is "Which one is better?" Shouldn't the rating system be one centralized system that easily allow the user to make valid comparisons to the people surrounding them in the high scores? How does ranking even work in this case? lol

Quote:

Originally Posted by xXOpkillerXx

I personally don't think it's ok to assume that a player's scores will on average match a linear curve when it comes to effort put into each of the scores. However I do agree that it can be great to reward the top scores. Therefore, the 2-metric system would not have that decay you suggest, but it would still give significant value to your top 5 scores.

I only created a linear progression because I see each song having a representive contribution (with respect to placement in your high scores) to the skill rating without encoding any additional bias when comparing between two songs. Specifically, the delta between your #1 and #2 weight percentages is the same as the delta between your #51 and #52 weight percentages.

Quote:

Originally Posted by Gradiant

Confused with the difference here. With both of these systems, players aren't going to be listed if they haven't hit whatever minimum is in place.

Also in general, don't really like the 'people are going to have to play' argument against average system. What is the whole point of this game anyway but to play files to get scores they think are good? I mentioned this in the discord when op brought it up, but the token requirement for coactive is to AAA 50 different files in a day. So playing 50 different files just to best of ability not even AAA'ing shouldn't take longer than a day either. Don't think this is too much to ask for at all for the benefit of being on a leaderboard. And if they don't care enough about playing the game, then they're not listed in the leaderboard like the bolded part of that 2nd part in the quote.

Also thinking of games like moba's or stuff like starcraft where you go through placement matches before being ranked; those games a match could go anywhere from like 30min to an hour, compared to an ffr file being like 2 minutes or so. The times required for the placement matches would be similar to hitting whatever minimum number of files played to be on ffr's leaderboard.

The difference is that there's going to be a huge dependency on relying on these filters to make the unweighted setting work. For the weighted setting, they're completely independent from the filters assigned. You can freely choose how you want to filter the scoreboard without relying on the weighted model.

You're right that it's not a huge ask. I get that. What I'm saying is that when both systems call for a revamp after more files get added into the engine, you will need to rely on everyone who previously met that requirement to make that unweighted system continually work over time. For the weighted setting, you don't need their involvement at all because the data is already there, so why not try to make the best value out of that information? The unweighted setting is too restrictive to make value of the information provided from inactive players, and filtering them out from the scoreboard is only a hacky way of hiding the deficiencies present in the unweighted setting.

I agree with the needing a minimum requirement similar to ranked queues like League, Starcraft, etc. in order to qualify on the high scores. I disagree with how dependent the unweighted system is on user activity (i.e. their involvement to make the unweighted configuration more reliable and representative) to measure skill rating. You don't need to do this in the weighted setting.

**Gradiant** · 05-24-2021, 02:06 PM

Quote:

Originally Posted by WirryWoo

What I'm saying is that when both systems call for a revamp after more files get added into the engine, you will need to rely on everyone who previously met that requirement to make that unweighted system continually work over time.

How so? A player wouldn't really need to play any new files after their top 50/100/whatever were filled, unless they felt the new file would be a promising one to add to their top scores. The only people anything would be asked of really would be newer players, which boils back down to the minimum number of files played being a basic requirement.

**WirryWoo** · 05-24-2021, 02:44 PM

Quote:

Originally Posted by Gradiant

How so? A player wouldn't really need to play any new files after their top 50/100/whatever were filled, unless they felt the new file would be a promising one to add to their top scores. The only people anything would be asked of really would be newer players, which boils back down to the minimum number of files played being a basic requirement.

See below:

Quote:

Originally Posted by WirryWoo

Instead, let's look at past data where we know for a fact that songs are released in some sequential order. Before 2004, when files are held in Legacy engine, there is a good amount of opportunities to score well on songs requiring trilling: One Minute Waltz, Flight of the Bumblebee, and Runny Mornings (SGX Mix), debatably Molto Vivace. Players who excel at trilling will perform relatively well on these files and will highly benefit from any system imposed. You can consider their performance as "flukes" (similar to Zag's performance on Club and AIM Anthem). Due to smaller file frequency in the engine, a well designed skill rating system back then would require something like Top 10, regardless if is weighted or unweighted.

Fast forward to today, we get La Campanella, Giselle, MAX Forever, and other trilly songs that I can't think of right off the top of my head. But I'm very confident that there are at least 10 songs currently in the engine that emphasizes trilling. This is equivalent to Zageron having more files similar to Club and AIM Anthem, and therefore, has more opportunities to fluke. Top 10 will easily be filled with trilly songs, so there is a need to scale out. This time span is less than 20 years at least for trilling files. For other patterns, this length will vary depending on previously mentioned variables (stepartist song submissions, batches, events, etc.).

Regardless of how long it takes for the need of scalability to exist, the main point to make about scalability is that there is a point in time somewhere in the future where we need to revamp the system. Maybe Top 100 would be too small of a sample size, so we need Top 150, or Top 200, etc. in order to maintain the accuracy of skill ratings.

The issue posed with the unweighted setting is that it will be difficult to retrack these players who joined FFR at 2002 and then stopped playing the game. Your solution (which I personally characterize as "hacky") is to filter out these players so that their ratings don't get factored into the overall high scores. This goes back to my thoughts about the minimum requirement:

When you filter these players out, this also changes the definition of "skill rating". Imagine if Usain Bolt did not participate in Summer Olympics this year but attended four years ago. Has his skill rating changed? Maybe, maybe not. Point is, he still should relatively have the skills to perform at the Olympics-level if he were to attend. Your suggestion is to mark him as "no skill due to not participating", where my solution respects his skills given his past performance and attempts to make that acknowledged despite not seeing his performance this summer. Which one better measures skill?

View Poll Results: Which system is best ?
Simple average of top X equivs	13	32.50%
Weighted average of top X equivs	27	67.50%
Voters: 40. You may not vote on this poll

05-22-2021, 01:22 PM	#3
Zageron Zageron E. Tazaterra RRR Developer & DevOps Support Join Date: Apr 2007 Location: BC Age: 32 Posts: 6,587	Re: Poll: Which global skill rating system is best ? As a victim of weighted averages, I stand in solidarity with Simple average of top X equivs. __________________ Help Develop FFR

05-22-2021, 01:23 PM	#4
Matthia 🍍Pineapple Man🍍 Join Date: Nov 2017 Location: Pacific Timezone, USA Age: 22 Posts: 505	Re: Poll: Which global skill rating system is best ? I prefer any system that will boost me to #1 __________________

05-23-2021, 08:31 AM	#6
xXOpkillerXx ✘ Forever OP✘ Join Date: Dec 2008 Location: Canada,Quebec Age: 29 Posts: 4,171	Re: Poll: Which global skill rating system is best ? More arguments from the weighted side please

05-23-2021, 04:23 PM	#8
FlynnMac Boom. Local FFR Person Join Date: May 2019 Age: 21 Posts: 534	Re: Poll: Which global skill rating system is best ? I guess I'll give my take on this So I choose weighted mainly from my experience with other rhythm games where they have a more successful weighted system. It doesn't highlight your top play a large amount ahead of the rest not allowing entirely for outliers and it also uses a large size of files in order to give the most accurate rating possible. While simple average gives the average of all your top x ratings, weighted can still have your best plays give more of an impact than plays you aren't as happy with. Wirry's system had felt accurate to me because of the fact that the weights input on it were better than FFR's current weights. There are a lot of high level players who had lower ranks than they should have that got bumped up, and a lot of lower level players that got their ranks bumped down (me included). The ratings really felt like they defined who had better ranks over having an average rating system that could still have it's outliers. Outliers will not be fixed either way, but with the right weights, it could be fixed better than a simple average could do it. __________________

05-22-2021, 08:51 AM	#1
xXOpkillerXx ✘ Forever OP✘ Join Date: Dec 2008 Location: Canada,Quebec Age: 29 Posts: 4,171	Poll: Which global skill rating system is best ? Hello FFR. Give your thoughts. Votes without explanation dont help much, keep that in mind. PS: The weighted average choice includes any weighting scheme you can think of. Last edited by xXOpkillerXx; 05-22-2021 at 09:00 AM..

05-22-2021, 10:11 AM	#2
WirryWoo Forever Derbyless Join Date: Aug 2020 Age: 33 Posts: 240	Re: Poll: Which global skill rating system is best ? To start some conversations. I can highlight some thoughts. Primary reasons on why I prefer weighted averages: • If Top X songs are set to define skill rating (after many discussions on what X should be) and if skill rating is consistently used as a comparative tool to measure performance between two files demanding different skills and requirements, then the X-th and (X+1)th songs should hold very similar weights and (X+1)th song is weighted at 0 by definition of Top X. • Although weighted mechanism rewards you for "biases" encoded in the performance of files in your Top N more than the unweighted mechanism, a regulated (key word here) solution should still capture many benefits that the unweighted solution provides: specifically and most importantly, a stronger representation of lower ranked files in your Top X is needed to determine a user's skill rating. • If we want to reward users for activity, why shouldn't the season's ratings be used there? Whether skill rating vs. seasons rating is assigned as the more "official" metric can be reserved as another conversation. Point is, there is a solution aimed to reward players who consistently play the game. I get it. Our current weighted system does not do it well, but this doesn't necessarily translate to "any weighted solution cannot do that". It's a tradeoff between "improving representation of lower ranked files" and "rewarding performance for songs subjectively more challenging than what your current skill rating suggests", and in my opinion, that should be respected. I've written a first iteration of what a weighted setting would look like. Attached is a Colab notebook for reference. There's a pandas dataframe containing the new projected rankings, the username, their projected weighted skill rating, and their current rank in game. For fun: If you want to determine your projected skill rating under a weighted mechanism I designed, scroll up to the "Determine your skill rating" section, replace my username with yours, then scroll all the way to the top, and click on the play buttons for the first seven cells. - __________________

05-22-2021, 01:56 PM	#5
Gradiant FFR's Resident Trashpanda Join Date: Sep 2012 Location: Michigan Age: 29 Posts: 1,095	Re: Poll: Which global skill rating system is best ? Simple average better deals with issue of fluke scores on poorly rated files

05-23-2021, 04:48 PM	#9
Zlyice Slightly unpronounceable Lead Difficulty Consultant Join Date: Dec 2009 Location: Massachusetts Age: 33 Posts: 264	Re: Poll: Which global skill rating system is best ? There are two main reasons I'm in favor of a weighted average. One, as Flynn mentioned, is that a weighted average does a better job of giving resolution to a player's top level of play. The current system does give a pretty strong weight to a player's top score, but I see this more of an issue with the current weights as opposed to a weighted average in general. WirryWoo's calculation earlier in the thread seems pretty reasonable to me personally. Secondly, an unweighted skill rating is only going to be as representative as the full scope of scores going into the calculation. If we're considering, for example, an unweighted average of 100 songs, this would require a player to play enough things for these 100 scores to be reasonably representative of their level of skill, which could take a considerable amount of time. There's a lot of potential for an unweighted average to disproportionately rank more active players ahead of players who play a bit less but are ultimately a little more skilled.

05-23-2021, 06:13 PM	#10
xXOpkillerXx ✘ Forever OP✘ Join Date: Dec 2008 Location: Canada,Quebec Age: 29 Posts: 4,171	Re: Poll: Which global skill rating system is best ? Alright, I'll try and make a structured statement. First of all, there is a concern that many people pointed out, which is that any system would have outliers. While that is true, not all outliers are the same, and that should very much be considered. In all cases should we try to minimize the amount of outliers there are, but it can be very difficult to compare counts for different types of outliers. At that point, a bit of subjectivity is invovled and necessary. Lets look at what types of outliers the two kinds of system generate: 1. Weighted avg outliers: These are essentially any and all outliers that come from the fact that our difficulty judgement is inherently flawed, mixed with inevitable imbalance in players' skillsets. The two points in this can be further explained: 1.1. Difficulty We (FFR) use a single number to represent chart difficulty. Obviously, this has a relatively high and non-negligible degree of subjectivity. Other games like Etterna have attempted to fix this flaw by splitting the difficulty in distinct skills, kind of forcing axioms for what defines difficulty at its core. This method can generally help distinguish between files that are well balanced vs the ones that focus on 1 or 2 specific skillsets throughout. However, we simply dont do that, either because it has its own flaws, or for various reasons unrelated to this topic. So, we have one single number representing the difficulty of each file, be it balanced or not. 1.2. Players skillsets It's no surprise that each player has their own best and worst skills. Just like the files, some players' skillset are well balanced, while others' are more specific. Comparison of skill between two players can be argued, but my stance is that this statement should hold: Player A's skillset: 3/3 for jacks, 2/3 for jumpstream, 1/3 for one-handed trills Player B's skillset: 2/3 for jacks, 2/3 for jumpstream, 2/3 for one-handed trills (The skills are just an example, but the numbers are important) Player A and B are equal. This has subjectivity in it, and I invite anyone to explain why they think player A should be considered the better player in this case. I personally believe that we shouldn't favor specific skill proficiency over general proficiency. Any person that agree with this statement should make sure their preferred system respects it. 1.3. The outliers Well, in a weighted system, where a non-random sample X of files is used to output a single number representing global skill rating, the above statement can never hold. For any score x1 in X, there will always be a score x2 that is either favored or vice versa. This means that any weighted system (with X of set size !), by definition, will generate unfairness by favoring players with specific skillsets at any given level. When X is of variable size, it becomes -Incredibly- difficult to properly formalize the model, and therefore a lot of guessing is introduced. That is what WirryWoo's model's hyperparameters are. By tweaking these, we adjust X's shape depending on a player's scores, but we can no longer tell what is favored (skillset specificity vs varied skillset) nor to what degree it is. In my opinion, this is sub-optimal. Again, this mostly revolves around the player comparison statement. 2. Simple avg outliers: A simple average system also generates outliers. These are much more straightforward. In fact, such a system implies an important statement about skill rating: Any player that has a rating representative of their actual skill level has optimally filled their top X scores. This means that if X is of size 50, then a player should have 50 scores of their caliber to be properly ranked. Any player whose top 50 is not that will have their rating be lower than their true skill level. The main downside to this is pretty simple too: If, over time, too many players dont optimally fill their top X, then the rankings will be flawed. These are essentially the outliers of this type of system. My primary argument (subjective) to support this downside is that I absolutely cannot understand why we should think that it's too much to ask from players who want to be ranked. Playing 50, or even 100 songs in your difficulty range should Not be troublesome; if you want to be properly ranked but cannot be bothered fulfill this pretty simple requirement, do you even really care to begin with ? Saying that an unweighted system "favors active players" is quite the overstatement in my opinion. You don't need to be that active of a player to fulfill the requirement. 3. Comparison of outliers So we have defined the kind of outliers that each system will inevitably have. The main concern I have with saying that "outliers are outliers" is that they're actually drastically different conceptually. The weighted models' outliers are unfair. Some players will always be favored no matter how weights are arranged. In a variable X size setting, the outliers may be reduced, but only by an undefined amount, and they become hard to model. The unweighted model's outliers are fair. Any player can easily stop being an outlier by getting some more scores in their difficulty range. Now obviously the amount of outliers in both cases will differ. Naturally, at the very beginning of a transition to an unweighted system, there would be many more of them. This means that a stabilization period would follow, during which the players will get more scores at their own pace to more optimally fill their top X. There will always be players who will not do it, and retired players may definitely not come back to adjust their scores for this. However, any change to the skill rating computation will require Some adjustment from the players to get a more optimal result, so keeping retired players' rankings as is is just not a possibility (although some systems may yield closer results, the point remains). 3.1. My take on the outliers At the end of the day, I personally favor fairness over count when it comes to these outliers. That being said, I would totally be ok with moving back to a weighted system if, after an arbitrarily long stabilization period with an unweighted system, there is still not enough effort from the players to make their top X reflect their actual skill level. That would be quite sad, but FFR does have its periods of low activity, and too little of it would indeed mean a weighted system is required. I don't think we have too little currently, but that's mostly subjective and debatable. 4. Common arguments Here are some arguments people usually make which I'd like to address: 4.1 Rewarding outstanding scores There is this thought that a weighted system better rewards rare great scores players get every once in a while. While that is definitely true, it doesn't mean that unweighted doesn't reward it; it just does so to a lesser degree to respect the important statement made in 1.2! A great score is still rewarded as the top 1 score in the top X. A player with the same average skill as you will be ranked lower due to that new score you got. If they're not ranked lower despite that sick score you got, that means they're better than you on average, that is all. 4.2 What about the top players who wont have an optimal top X ? Yes, if Myuka doesn't play more and a top 50 unweighted is implemented, they will have a skill rating far from representative. To be honest, I couldn't care less. There are countless players from all other rhythm games who we know could be in top spots on FFR. Granted they haven't played a single game, the fact that we Know they'd place around a certain spot is also applicable to our current top players who might never "fix" or "fill" their ranked scores. Yes, it looks funny to see Myuka be ranked 100th or whatever, but really that's a small argument to back unfairness in system outliers. Does this mean we reward activity ? No, not really. That means we enforce a (relatively small) minimum of activity over a player's whole "FFR career" in order to have a representative skill rating. Rewarding activity would be done with seasons, where the same concepts are applied to definite, repeating timeframes where stats are reset each iteration. 5. Conclusion I hope this post clarifies why I believe an unweighted top X (of size 50 or 100) is preferable in our case. I am very aware of the flaws of such a system, but I definitely think they are significantly "better" flaws than a weighted system's flaws. Last edited by xXOpkillerXx; 05-23-2021 at 08:27 PM..

05-23-2021, 08:43 PM	#11
trumaestro I don't get no respect Join Date: Jun 2006 Age: 32 Posts: 1,332	Re: Poll: Which global skill rating system is best ? Spitballing: how about a bit of both sides? Equal weights for top X scores. Decreasing weights to next Y scores. I'm not math-y enough to work out whether that addresses any of the issues here, but it seems to me that combining sides here could help mitigate the downsides of each.

05-24-2021, 08:06 AM	#15
katanaeyegaming #FearTheWyvern Join Date: Aug 2019 Age: 20 Posts: 342	Re: Poll: Which global skill rating system is best ? My opinion on this is simple. Neither they both suck __________________ It is known that wherever you may go we will follow you #Fearthewyvern

05-24-2021, 08:11 AM	#16
xXOpkillerXx ✘ Forever OP✘ Join Date: Dec 2008 Location: Canada,Quebec Age: 29 Posts: 4,171	Re: Poll: Which global skill rating system is best ? I'd like to quickly entertain the idea that in a 2-metric system, something like a "linearly weighted top 10" (trying to keep it a low number to still represent peak performance) could be interesting too. The general rating would remain a simple average of top 100.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)