Комментарии:
For sampling, keep in mind that the users need to get a high % of top model responses, or they will just leave. So it's not surprising that a few top models get sampled a lot more than the bottom ones, same for open-source vs closed-source - the later still score higher, so they should be sampled more often, to keep the users engaged. Withouut the user engagement, the whole concept does not work.
Ответитьcommon andreesen & horowitz L
Ответить"US actors" are not mainly known for their outstanding level of "epistemic trust" at this point in time. 😆 So I am not very surprised over that tilting the playing field is one step to gain a temporary "Hollywood glamour" type of victory. All this might not matter very much perhaps a year from now, if the credibility is ultimate low for the Arena or human opinions are no longer needed in the process to evaluate LLMs. Manipulated voting is another aspect of the whole process, having bots similar to social networks might be a great way to achieve great results. Eg every time someone logs in to eg a social network they also vote for a specific model (or updates whatever stats makes money for the provider), by clicking the ok button.
ОтветитьLove the quality of thinking, content and production values are developing nicely. Holding the big microphone - not sure about that. Superb channel, thank you. 🙏👍
ОтветитьCome on though the lmsys arena scores roughly correlate with the benchmark scores from SimpleBench to Arc Prize. Its not some wildly out of line score system. Yes we can improve benchmarking but no this is not some conspiracy that large tech companies are gaming the system with poor models that are outcompeted by better models from relatively unknown sources.
ОтветитьCaptain obvious being obvious
ОтветитьAnother "non-profit" doing an 180 turn into full profit, ... the associated scandals and biased results seemed already leading up to it.
ОтветитьWhich lens? Nice impressionist rendering.
ОтветитьIt’s shocking to learn that results could have been faked just to climb the Chatbot Arena rankings. With billions of dollars at stake and such significant implications, this situation feels a lot like the wild early days of the AI boom.
ОтветитьGemini merely displays a wider range of possible answers.
Ответить<3
ОтветитьIt is not a problem, anyone using these LLM's in anger can script their own private benchmarks to test LLM's in the domain that they care about.
Claude, DeepSeek R1, Gemini, o3, o4 will do 99% of this work for you!
One of the best ML (AI) channels based on substance and depth.
Ответитьfortunately, there is an objective way to test the ability of chatbots:
letting them generate computer program code to solve certain problems with algorithms (Deepmind uses it for its AlphaEvolve AI system)
cutthroat race for AI dominance ? what do you mean exactly ? this looks likes an interesting set of words arranged with some specific weirdness.
Ответитьregarding your arena, do you think it is smart to blindly compare vote without annotating the aim of the voter ? are two votes perfectly equal in this arena. nahsosure.
Ответитьwhen you compare chess players, you compare people that shares the exact same goal. to win. for this thing, the arena, j'ai le doute.
ОтветитьTotally aligns with my experience of LLMarena scores. Some sort of guide but not a lot of meaning in the real world. It feels like there is a lot of over fitting to benchmarks that then show big improvements. But this is not being reflected with real world usage. Are the models better than 12 months ago. For sure. But I personally don’t see the massive improvements in the underlying models.
ОтветитьWow a pareto distribution, what a surprise
ОтветитьConstantly moving the goalposts but the goalposts being metrics.
Ответитьthey moneybrained are capable of destroying anything of value
ОтветитьClassic instance of xkcd2899 😆
ОтветитьWe hebben een serieus probleem
ОтветитьYou are missing the point here,
The business of the free service is that the free users are the product. In this particular case, the users are providing training data to tune LLMs for user preference. There isn't anything wrong in this approach in principle; the free users receive a free service of getting the answers and an entertainment of the leaderboard. win-win. The sole fact that you cannot use this leaderboard for science, is an insignificant side effect, important only for science professionals.
Are you intentionally adding those cellphone connection interference noises? Just about every other transition has them
ОтветитьELO also does not work well in more complex video games where "skill" is not as deterministic due to individual playstyle preferences that, by themselves, have some rock paper scissors advantages over one another. ELO does not account for this at all. Consider "scissors" players, where "rock" players are more common than "paper" players. Scissors has an innate disadvantage to rock, and an advantage against paper, even assuming equal skill. This is just because of the way the players prefer to engage with the game. ELO will make scissors players, who lose to the rock strategy, seem less skilled for simply being a stochastic minority against a preference choice that outscales them. Chatbot Arena very clearly has the exact same problem.
ОтветитьGo easy on David ! He is in hospital , not.(yet) posteconomic 😮
ОтветитьVery revealing. Thank you.
ОтветитьWhy do you have a pineapple in your logo?
Ответитьfirst person to quote David Shapiro loses !!
;-/
The obvious solution is to ignore chat bot arena. Even if it were fair and well calibrated it's kind of a dumb metric...
ОтветитьLol david shapiro effect, what about the Kook effect? Solid video!
ОтветитьClearly far more research is needed into model developer alignment
ОтветитьIts been fairly apparent for those of us who pay attention that long standing benchmarks are becoming worthless - benchmarks have a shelf life now, make a new one, run the latest models through it and any that appear shortly afterwards and then throw it away and make a new one, its the only way to get valuable information out of benchmarks.
ОтветитьFollowed you for years, forgot to tell you how absolutely awesome you guys at MLST are. Thank you! 🙏 you are the best! Should have told you years ago...
ОтветитьI like these in-depth videos even more than interviews / conversations.
Ответитьfine tuning for the arena is not "hacking" the arena.
ОтветитьMy dude, the prompts are the same not because people write the same, but because they are botting their own platform to sell the fake content to meta and others.
Ответитьhyped like we knew it would be to make it more than it was and to make a few people rich. Do you still believe in the ai paradigm that changes the planet? Still believe the doomers? Still believe the AI-Ceos?
ОтветитьProps to zuck for coming clean
ОтветитьGreat breakdown and awesome visuals! ❤
ОтветитьTbh, so much hype and so many snake oil salesmen in this field from its beginning. 😂
ОтветитьLet's not throw Zuck under the bus. This is exactly what needs to be done; hack the competitions, show their weaknessess. It forces better benchmarks.
ОтветитьWhy is he standing outside?
Ответить"scientist" has no integrity/ethics, lying, scamming, faking common people to get more money, when the bubble burst It will be all over
Ответить