> The model, called VibeThinker-3B, scored 94.3 on AIME 2026 — the American Invitational Mathematics Examination, one of the most demanding standardized math competitions in the world. That figure places it alongside DeepSeek V3.2, a model with 671 billion parameters
Overfitting, no need to argue about anything I think?
The rest of the article seems to echoing people's misunderstanding of pretty elementary stuff.
That's the obvious answer, yes. But if they are doing it, why should anyone assume the competition isn't doing it?
If it is possible to cheat on the benchmarks used to judge AI performance, how can the general population be certain that any of the AI "innovation" is genuine? Is there true development here worth the many-billion-dollar investments, or are we seeing an industry-wide case of them doing a Theranos by faking the results and hoping they can do real innovation before anyone finds out?
Overfitting, no need to argue about anything I think?
The rest of the article seems to echoing people's misunderstanding of pretty elementary stuff.
If it is possible to cheat on the benchmarks used to judge AI performance, how can the general population be certain that any of the AI "innovation" is genuine? Is there true development here worth the many-billion-dollar investments, or are we seeing an industry-wide case of them doing a Theranos by faking the results and hoping they can do real innovation before anyone finds out?