We have a problem with ChatBot Arena.

23 часа назад

9,521 Просмотров

Комментарии:

@xl0xl0xl0 - 08.06.2025 08:26

For sampling, keep in mind that the users need to get a high % of top model responses, or they will just leave. So it's not surprising that a few top models get sampled a lot more than the bottom ones, same for open-source vs closed-source - the later still score higher, so they should be sampled more often, to keep the users engaged. Withouut the user engagement, the whole concept does not work.

Ответить

@holdingW0 - 08.06.2025 08:33

common andreesen & horowitz L

Ответить

@isajoha9962 - 08.06.2025 09:15

"US actors" are not mainly known for their outstanding level of "epistemic trust" at this point in time. 😆 So I am not very surprised over that tilting the playing field is one step to gain a temporary "Hollywood glamour" type of victory. All this might not matter very much perhaps a year from now, if the credibility is ultimate low for the Arena or human opinions are no longer needed in the process to evaluate LLMs. Manipulated voting is another aspect of the whole process, having bots similar to social networks might be a great way to achieve great results. Eg every time someone logs in to eg a social network they also vote for a specific model (or updates whatever stats makes money for the provider), by clicking the ok button.

Ответить

@BrianMosleyUK - 08.06.2025 09:39

Love the quality of thinking, content and production values are developing nicely. Holding the big microphone - not sure about that. Superb channel, thank you. 🙏👍

Ответить

@MrBillythefisherman - 08.06.2025 10:46

Come on though the lmsys arena scores roughly correlate with the benchmark scores from SimpleBench to Arc Prize. Its not some wildly out of line score system. Yes we can improve benchmarking but no this is not some conspiracy that large tech companies are gaming the system with poor models that are outcompeted by better models from relatively unknown sources.

Ответить

@superresistant0 - 08.06.2025 11:02

Captain obvious being obvious

Ответить

@tom9380 - 08.06.2025 11:15

Another "non-profit" doing an 180 turn into full profit, ... the associated scandals and biased results seemed already leading up to it.

Ответить

@SCampbell86 - 08.06.2025 11:28

Which lens? Nice impressionist rendering.

Ответить

@maloukemallouke9735 - 08.06.2025 11:29

It’s shocking to learn that results could have been faked just to climb the Chatbot Arena rankings. With billions of dollars at stake and such significant implications, this situation feels a lot like the wild early days of the AI boom.

Ответить

@spinningaround - 08.06.2025 11:33

Gemini merely displays a wider range of possible answers.

Ответить

@burnytech - 08.06.2025 11:43

Ответить

@TheReferrer72 - 08.06.2025 12:01

It is not a problem, anyone using these LLM's in anger can script their own private benchmarks to test LLM's in the domain that they care about.

Claude, DeepSeek R1, Gemini, o3, o4 will do 99% of this work for you!

Ответить

@tom9380 - 08.06.2025 13:01

One of the best ML (AI) channels based on substance and depth.

Ответить

@dot1298 - 08.06.2025 13:12

fortunately, there is an objective way to test the ability of chatbots:
letting them generate computer program code to solve certain problems with algorithms (Deepmind uses it for its AlphaEvolve AI system)

Ответить

@maboiteaspamspammaboite9670 - 08.06.2025 13:26

cutthroat race for AI dominance ? what do you mean exactly ? this looks likes an interesting set of words arranged with some specific weirdness.

Ответить

@maboiteaspamspammaboite9670 - 08.06.2025 13:37

regarding your arena, do you think it is smart to blindly compare vote without annotating the aim of the voter ? are two votes perfectly equal in this arena. nahsosure.

Ответить

@maboiteaspamspammaboite9670 - 08.06.2025 13:39

when you compare chess players, you compare people that shares the exact same goal. to win. for this thing, the arena, j'ai le doute.

Ответить

@byrnemeister2008 - 08.06.2025 13:56

Totally aligns with my experience of LLMarena scores. Some sort of guide but not a lot of meaning in the real world. It feels like there is a lot of over fitting to benchmarks that then show big improvements. But this is not being reflected with real world usage. Are the models better than 12 months ago. For sure. But I personally don’t see the massive improvements in the underlying models.

Ответить

@scotthjackson5651 - 08.06.2025 14:12

Wow a pareto distribution, what a surprise

Ответить

@Crytoma - 08.06.2025 16:39

Constantly moving the goalposts but the goalposts being metrics.

Ответить

@Renvoxan - 08.06.2025 16:54

they moneybrained are capable of destroying anything of value

Ответить

@harriehausenman8623 - 08.06.2025 18:34

Classic instance of xkcd2899 😆

Ответить

@imimmyra - 08.06.2025 18:45

We hebben een serieus probleem

Ответить

@JurekOK - 08.06.2025 18:49

You are missing the point here,
The business of the free service is that the free users are the product. In this particular case, the users are providing training data to tune LLMs for user preference. There isn't anything wrong in this approach in principle; the free users receive a free service of getting the answers and an entertainment of the leaderboard. win-win. The sole fact that you cannot use this leaderboard for science, is an insignificant side effect, important only for science professionals.

Ответить

@Ivan.Wright - 08.06.2025 19:24

Are you intentionally adding those cellphone connection interference noises? Just about every other transition has them

Ответить

@gubzs - 08.06.2025 19:58

ELO also does not work well in more complex video games where "skill" is not as deterministic due to individual playstyle preferences that, by themselves, have some rock paper scissors advantages over one another. ELO does not account for this at all. Consider "scissors" players, where "rock" players are more common than "paper" players. Scissors has an innate disadvantage to rock, and an advantage against paper, even assuming equal skill. This is just because of the way the players prefer to engage with the game. ELO will make scissors players, who lose to the rock strategy, seem less skilled for simply being a stochastic minority against a preference choice that outscales them. Chatbot Arena very clearly has the exact same problem.

Ответить

@Ella-s2v - 08.06.2025 20:38

Go easy on David ! He is in hospital , not.(yet) posteconomic 😮

Ответить

@mahakleung6992 - 08.06.2025 21:03

Very revealing. Thank you.

Ответить

@makhalid1999 - 08.06.2025 21:32

Why do you have a pineapple in your logo?

Ответить

@xaxfixho - 08.06.2025 21:58

first person to quote David Shapiro loses !!
;-/

Ответить

@michaeltraynor5893 - 08.06.2025 23:49

The obvious solution is to ignore chat bot arena. Even if it were fair and well calibrated it's kind of a dumb metric...

Ответить

@svicpodcast - 09.06.2025 04:07

Lol david shapiro effect, what about the Kook effect? Solid video!

Ответить

@robbiero368 - 09.06.2025 10:50

Clearly far more research is needed into model developer alignment

Ответить

@MH-kj9hh - 09.06.2025 15:39

Its been fairly apparent for those of us who pay attention that long standing benchmarks are becoming worthless - benchmarks have a shelf life now, make a new one, run the latest models through it and any that appear shortly afterwards and then throw it away and make a new one, its the only way to get valuable information out of benchmarks.

Ответить

@PerNystedt - 09.06.2025 16:40

Followed you for years, forgot to tell you how absolutely awesome you guys at MLST are. Thank you! 🙏 you are the best! Should have told you years ago...

Ответить

@pw7225 - 09.06.2025 17:21

I like these in-depth videos even more than interviews / conversations.

Ответить

@czinn327 - 09.06.2025 18:22

fine tuning for the arena is not "hacking" the arena.

Ответить

@allurbase - 09.06.2025 22:12

My dude, the prompts are the same not because people write the same, but because they are botting their own platform to sell the fake content to meta and others.

Ответить

@Metatron1983 - 10.06.2025 01:59

hyped like we knew it would be to make it more than it was and to make a few people rich. Do you still believe in the ai paradigm that changes the planet? Still believe the doomers? Still believe the AI-Ceos?

Ответить

@SebastianSalgado11 - 10.06.2025 19:04

Props to zuck for coming clean

Ответить

@AICoffeeBreak - 10.06.2025 21:51

Great breakdown and awesome visuals! ❤

Ответить

@AlgoNudger - 11.06.2025 06:09

Tbh, so much hype and so many snake oil salesmen in this field from its beginning. 😂

Ответить

@ArchonExMachina - 11.06.2025 14:45

Let's not throw Zuck under the bus. This is exactly what needs to be done; hack the competitions, show their weaknessess. It forces better benchmarks.

Ответить