Meta recently unveiled its Llama 4 series of AI models, drawing attention for outperforming GPT-4o and Gemini 2.0 Pro on Chatbot Arena (formerly LMSYS). The spotlight was on the Llama 4 Maverick model an MoE (Mixture of Experts) system that activates only 17 billion parameters out of a massive 400B spread across 128 experts. According to Meta, Maverick achieved an impressive ELO score of 1,417 on the Chatbot Arena benchmark.
However, the AI community was quick to question the results, as this relatively smaller MoE model appeared to outperform heavyweight LLMs like GPT-4.5 and Grok 3. Curious about the claims, many began testing the model independently only to find that its real-world performance, especially in coding tasks, didn’t align with Meta’s benchmark figures.
A major revelation surfaced on 1Point3Acres, a well-known forum among the Chinese community in North America. A user claiming to be a former Meta employee alleged that Meta’s leadership manipulated benchmark results. According to the post later translated into English on Reddit the company reportedly included “test sets of various benchmarks in the post-training process” to artificially boost Llama 4’s benchmark scores and meet internal performance targets.
The former Meta employee considered the alleged benchmark manipulation unethical and decided to resign. They also requested that their name be removed from the Llama 4 technical report. Additionally, the user suggested that the recent departure of Joelle Pineau, Meta’s Head of AI Research, was directly tied to the controversy surrounding the Llama 4 benchmark results.
In light of the escalating allegations, Ahmad Al-Dahle, head of Meta’s Generative AI division, responded with a post on X. He strongly denied the accusations, stating that Llama 4 was not post-trained on any benchmark test sets. Al-Dahle wrote:
We’ve also heard claims that we trained on test sets — that’s simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations.
Al-Dahle acknowledged that Llama 4’s performance may vary across different platforms, attributing it to ongoing implementation adjustments. He urged the AI community to be patient, asking for a few days to allow the deployment to get properly “dialed in.”
LMSYS Addresses Allegations of Llama 4 Benchmark Manipulation
In response to growing concerns from the AI community, LMSYS he team behind the Chatbot Arena leaderboard released a statement aimed at promoting greater transparency. They clarified that the model submitted by Meta was labeled “Llama-4-Maverick-03-26-Experimental,” a custom variant specifically tuned for human preference.
LMSYS acknowledged that factors like response tone and style played a significant role in user rankings, potentially giving this variant an unfair edge. They also admitted that Meta did not clearly disclose this customization. Furthermore, LMSYS stated, “Meta’s interpretation of our policy did not match what we expect from model providers.”
To be fair, Meta did note in its official Llama 4 blog that “an experimental chat version” achieved a score of 1,417 on Chatbot Arena but stopped short of providing any further details or context.
In a move to enhance transparency, LMSYS has since added the Hugging Face version of Llama 4 Maverick to Chatbot Arena. Additionally, it released over 2,000 head-to-head battle results for public review, including the prompts, model outputs, and user preferences.
Upon reviewing the battle data, it was surprising to see users consistently favoring Llama 4’s responses even when they were often inaccurate or overly verbose. This raises broader concerns about the reliability of community-driven benchmarks like Chatbot Arena and the influence of response style on perceived performance.
Not the First Time Meta Gaming Benchmarks
This isn’t the first time Meta has faced accusations of gaming benchmarks through data contamination specifically, by including benchmark datasets in the training corpus. In February of this year, Susan Zhang, a former Meta AI researcher now at Google DeepMind, shared a revealing study in response to a post by Yann LeCun, Meta AI’s Chief Scientist.
The study revealed that more than 50% of test samples from major benchmarks were found in the pretraining data for Meta’s Llama 1. Specifically, datasets like Big Bench Hard, HumanEval, HellaSwag, MMLU, PiQA, and TriviaQA showed significant levels of contamination across both corpora.
In light of the recent Llama 4 benchmark manipulation allegations, Susan Zhang took a sarcastic jab at Meta, suggesting the company should at least cite their “previous work” from Llama 1 for this “unique approach.” Her comment implies that benchmark manipulation isn’t just a mishap, but a deliberate strategy by Meta under Mark Zuckerberg’s leadership to artificially inflate model performance.