Meta’s new flagship AI language mannequin Llama 4 got here immediately over the weekend, with the dad or mum firm of Fb, Instagram, WhatsApp and Quest VR (amongst different companies and merchandise) revealing not one, not two, however three variations — all upgraded to be extra highly effective and performant utilizing the favored “Mixture-of-Experts” structure and a brand new coaching technique involving fastened hyperparameters, referred to as MetaP.
Additionally, all three are outfitted with large context home windows — the quantity of knowledge that an AI language mannequin can deal with in a single enter/output alternate with a person or instrument.
However following the shock announcement and public launch of two of these fashions for obtain and utilization — the lower-parameter Llama 4 Scout and mid-tier Llama 4 Maverick — on Saturday, the response from the AI group on social media has been lower than adoring.
Llama 4 sparks confusion and criticism amongst AI customers
An unverified publish on the North American Chinese language language group discussion board 1point3acres made its means over to the r/LocalLlama subreddit on Reddit alleging to be from a researcher at Meta’s GenAI group who claimed that the mannequin carried out poorly on third-party benchmarks internally and that firm management “suggested blending test sets from various benchmarks during the post-training process, aiming to meet the targets across various metrics and produce a ‘presentable’ result.”
However different customers discovered causes to doubt the benchmarks regardless.
Referencing the ten million-token context window Meta boasted for Llama 4 Scout, AI PhD and writer Andriy Burkov wrote on X partly that: “The declared 10M context is virtual because no model was trained on prompts longer than 256k tokens. This means that if you send more than 256k tokens to it, you will get low-quality output most of the time.”
Additionally on the r/LocalLlama subreddit, person Dr_Karminski wrote that “I’m incredibly disappointed with Llama-4,” and demonstrated its poor efficiency in comparison with DeepSeek’s non-reasoning V3 mannequin on coding duties akin to simulating balls bouncing round a heptagon.
Former Meta researcher and present AI2 (Allen Institute for Synthetic Intelligence) Senior Analysis Scientist Nathan Lambert took to his Interconnects Substack weblog on Monday to level out {that a} benchmark comparability posted by Meta to its personal Llama obtain web site of Llama 4 Maverick to different fashions, based mostly on cost-to-performance on the third-party head-to-head comparability instrument LMArena ELO aka Chatbot Area, truly used a unique model of Llama 4 Maverick than the corporate itself had made publicly out there — one “optimized for conversationality.”
As Lambert wrote: “Sneaky. The results below are fake, and it is a major slight to Meta’s community to not release the model they used to create their major marketing push. We’ve seen many open models that come around to maximize on ChatBotArena while destroying the model’s performance on important skills like math or code.”
Lambert went on to notice that whereas this specific mannequin on the sector was “tanking the technical reputation of the release because its character is juvenile,” together with plenty of emojis and frivolous emotive dialog, “The actual model on other hosting providers is quite smart and has a reasonable tone!”
In response to the torrent of criticism and accusations of benchmark cooking, Meta’s VP and Head of GenAI Ahmad Al-Dahle took to X to state:
“We’re glad to start out getting Llama 4 in all of your arms. We’re already listening to plenty of nice outcomes persons are getting with these fashions.
That mentioned, we’re additionally listening to some reviews of blended high quality throughout completely different companies. Since we dropped the fashions as quickly as they have been prepared, we count on it’ll take a number of days for all the general public implementations to get dialed in. We’ll maintain working by way of our bug fixes and onboarding companions.
We’ve additionally heard claims that we skilled on check units — that’s merely not true and we might by no means do this. Our greatest understanding is that the variable high quality persons are seeing is because of needing to stabilize implementations.
We consider the Llama 4 fashions are a major development and we’re trying ahead to working with the group to unlock their worth.“
But even that response was met with many complaints of poor efficiency and requires additional info, akin to extra technical documentation outlining the Llama 4 fashions and their coaching processes, in addition to further questions on why this launch in comparison with all prior Llama releases was notably riddled with points.
It additionally comes on the heels of the quantity two at Meta’s VP of Analysis Joelle Pineau, who labored within the adjoining Meta Foundational Synthetic Intelligence Analysis (FAIR) group, saying her departure from the corporate on LinkedIn final week with “nothing but admiration and deep gratitude for each of my managers.” Pineau, it needs to be famous additionally promoted the discharge of the Llama 4 mannequin household this weekend.
Llama 4 continues to unfold to different inference suppliers with blended outcomes, nevertheless it’s secure to say the preliminary launch of the mannequin household has not been a slam dunk with the AI group.
And the upcoming Meta LlamaCon on April 29, the primary celebration and gathering for third-party builders of the mannequin household, will doubtless have a lot fodder for dialogue. We’ll be monitoring all of it, keep tuned.
Every day insights on enterprise use circumstances with VB Every day
If you wish to impress your boss, VB Every day has you lined. We provide the inside scoop on what firms are doing with generative AI, from regulatory shifts to sensible deployments, so you’ll be able to share insights for optimum ROI.
An error occured.