Goodbye, Llama? Meta launches new proprietary AI mannequin Muse Spark — first since Superintelligence Labs' formation

Meta has been one of the attention-grabbing firms of the generative AI period — initially gaining a loyal and big following of customers for the discharge of its principally open supply Llama household of huge language fashions (LLMs) starting in early 2023 however coming to screeching halt final 12 months after Llama 4 debuted to combined opinions and in the end, admissions of gaming benchmarks.

That bumpy rollout of Llama 4 apparently spurred Meta founder and CEO Mark Zuckerberg to completely overhaul Meta's AI operations in the summertime of 2025, forming a brand new inner division, Meta Superintelligence Labs (MSL) which he recruited 29-year-old former Scale AI co-founder and CEO Alexandr Wang to steer as Chief AI Officer.

Now, right this moment, Meta is exhibiting us the fruits of that effort: Muse Spark, a brand new proprietary mannequin that Wang says (posting on rival social community X, used extra usually by the machine studying neighborhood) is "probably the most highly effective mannequin that meta has launched," and has "assist for tool-use, visible chain of thought, & multi-agent orchestration." He additionally says will probably be the beginning of a brand new Muse household of fashions, elevating questions on what’s going to turn into of Meta's widespread lineup and ongoing growth of the Llama household.

It arrives not as a generic chatbot, however as the inspiration for what Wang calls "private superintelligence"—an AI that doesn’t simply course of textual content however "sees and understands the world round you" to behave as a digital extension of the self, echoing Zuckberg's public manifesto for a imaginative and prescient of non-public superintelligence revealed in summer season 2025.

Nevertheless, it’s proprietary solely — confined for now to the Meta AI app and web site, in addition to a " personal API preview to pick out customers," in accordance with Meta's weblog put up asserting it — a transfer more likely to rankle the actually billions of customers of Llama fashions and the hundreds of builders who relied upon it (a few of whom are lively individuals in rival social community Reddit's r/LocalLLaMA subreddit). As well as, no pricing data for the mannequin has but been introduced.

It's unclear if Meta has ended growth on the Llama household totally. When requested straight by VentureBeat, a Meta spokesperson stated in an e mail: “Our present Llama fashions will proceed to be obtainable as open supply,” which doesn’t handle the query of growth of future Llama fashions.

Visible chain-of-thought

At its core, Muse Spark is a natively multimodal reasoning mannequin. In contrast to earlier iterations that "stitched" imaginative and prescient and textual content collectively, Muse Spark was rebuilt from the bottom as much as combine visible data throughout its inner logic. This architectural shift allows "visible chain of thought," permitting the mannequin to annotate dynamic environments—figuring out the elements of a fancy espresso machine or correcting a consumer's yoga kind by way of side-by-side video evaluation.

Probably the most important technical leap, nonetheless, is a brand new "Considering" mode. This characteristic orchestrates a number of sub-agents to cause in parallel, permitting Meta to compete with excessive reasoning fashions like Google's Gemini Deep Suppose and OpenAI's GPT-5.4 Professional.

In benchmarks, this mode achieved 58% in "Humanity’s Final Examination" and 38% in "FrontierScience Analysis," figures that Meta claims validate their new scaling trajectory.

Maybe extra spectacular for the corporate’s backside line is the mannequin’s effectivity. Meta stories that Muse Spark achieves its reasoning capabilities utilizing over an order of magnitude much less compute than Llama 4 Maverick, its earlier mid-size flagship. This effectivity is pushed by a course of referred to as "thought compression". Throughout reinforcement studying, the mannequin is penalized for extreme "pondering time," forcing it to unravel advanced issues with fewer reasoning tokens with out sacrificing accuracy.

Benchmarks reveal a return-to-form

The launch of Muse Spark is framed as a statistical "quantum leap," ending Meta’s year-long absence from absolutely the frontier of AI efficiency.

By reconciling Meta’s official inner information with unbiased auditing from third-party LLM monitoring agency Synthetic Evaluation, a transparent image emerges: Muse Spark isn’t just a marginal enchancment over the Llama sequence; it’s a basic re-entry into the "High 5" international fashions.

In response to the Synthetic Evaluation Intelligence Index v4.0, Muse Spark achieved a rating of 52. For context, Meta’s earlier flagship, Llama 4 Maverick, debuted in 2025 with an Index rating of simply 18.

By almost tripling its efficiency, Muse Spark now sits inside placing distance of the business’s most elite programs, trailing solely Gemini 3.1 Professional Preview (57), GPT-5.4 (57), and Claude Opus 4.6 (53).

Meta’s official benchmarks recommend that Muse Spark is especially dominant in multimodal reasoning, particularly the place visible figures and logic intersect.

CharXiv Reasoning: In "determine understanding," Muse Spark achieved a rating of 86.4, considerably outperforming Claude Opus 4.6 (65.3), Gemini 3.1 Professional (80.2), and GPT-5.4 (82.8).
MMMU Professional: Official stories place the mannequin at 80.4, whereas Synthetic Evaluation’s unbiased audit measured it at 80.5%. This makes it the second-most succesful imaginative and prescient mannequin in the marketplace, surpassed solely by Gemini 3.1 Professional Preview (83.9% official; 82.4% unbiased).
Visible Factuality (SimpleVQA): Muse Spark scored 71.3, inserting it forward of GPT-5.4 (61.1) and Grok 4.2 (57.4), although it narrowly trails Gemini 3.1 Professional (72.4).

These scores validate Meta’s give attention to "visible chain of thought," enabling the mannequin to not simply acknowledge objects, however to cause by advanced spatial issues and dynamic annotations.

The "Pondering" gear of Muse Spark was put to the take a look at towards specialised benchmarks designed to interrupt non-reasoning fashions.

Humanity’s Final Examination (HLE): On this multidisciplinary analysis, Meta stories a rating of 42.8 (No Instruments) and 50.4 (With Instruments). Impartial audits by Synthetic Evaluation tracked the mannequin at 39.9%, trailing Gemini 3.1 Professional Preview (44.7%) and GPT-5.4 (41.6%).
GPQA Diamond (PhD Stage Reasoning): Muse Spark achieved a formidable 89.5, surpassing Grok 4.2 (88.5) however trailing the specialised "max reasoning" outputs of Opus 4.6 (92.7) and Gemini 3.1 Professional (94.3).
ARC AGI 2: This stays a notable weak level. Muse Spark scored 42.5, far behind the summary reasoning puzzles solved by Gemini 3.1 Professional (76.5) and GPT-5.4 (76.1).
CritPT (Physics Analysis): Impartial auditing discovered Muse Spark achieved the fifth highest rating at 11%. This marks a considerable lead over Gemini 3 Flash (9%) and Claude 4.6 Sonnet (3%).

One of the vital placing outcomes from the official information is Muse Spark's efficiency within the well being sector, probably a results of Meta's collaboration with over 1,000 physicians.

HealthBench Laborious: Muse Spark achieved 42.8, an enormous lead over Claude Opus 4.6 (14.8), Gemini 3.1 Professional (20.6), and even GPT-5.4 (40.1).
MedXpertQA (Multimodal): It scored 78.4, comfortably forward of Opus 4.6 (64.8) and Grok 4.2 (65.8), although it nonetheless trails Gemini 3.1 Professional’s top-tier rating of 81.3.

Agentic Methods and Effectivity: The "Thought Compression" Impact

Whereas Muse Spark excels at reasoning, its "agentic" efficiency—executing real-world work duties—presents a extra nuanced image.

SWE-Bench Verified: Muse Spark scored 77.4, trailing Claude Opus 4.6 (80.8) and Gemini 3.1 Professional (80.6).
GDPval-AA Elo: Meta’s official rating of 1444 differs barely from Synthetic Evaluation’s recorded 1427. In each circumstances, Muse Spark trails GPT-5.4 (1672) and Opus 4.6 (1606), suggesting that whereas the mannequin "thinks" nicely, it’s nonetheless refining its capacity to "act" in long-horizon software program and workplace workflows.
Token Effectivity: That is the place Muse Spark distinguishes itself. To run the Intelligence Index, it used 58 million output tokens. In distinction, Claude Opus 4.6 required 157 million tokens and GPT-5.4 required 120 million. This helps Meta's declare of "thought compression"—delivering frontier-class intelligence whereas utilizing lower than half the "pondering time" of its closest opponents.

Benchmark	Llama 4 Maverick (2025)	Muse Spark (Official)	Gemini 3.1 Professional (Official)
Intelligence Index Rating	18	52	57
MMMU Professional	—	80.4	83.9
CharXiv Reasoning	—	86.4	80.2
HealthBench Laborious	—	42.8	20.6
License	Open-Weights	Proprietary	Proprietary

With Muse Spark, Meta has efficiently transitioned from being the "LAMP stack for AI" to a direct challenger for the title of "Private Superintelligence". Whereas agentic workflows stay a hurdle, its dominance in imaginative and prescient, well being, and token effectivity locations Meta again on the middle of the frontier race.

Private wellness and Instagram buying

Meta is instantly deploying Muse Spark to energy specialised experiences throughout its app household.

Buying Mode: A brand new characteristic that leverages Meta’s huge creator ecosystem. The AI picks up on manufacturers, styling selections, and content material throughout Instagram and Threads to offer customized suggestions, successfully turning each put up right into a shoppable interplay.
Well being Reasoning: In a transfer towards medical utility, Meta collaborated with over 1,000 physicians to curate coaching information. Muse Spark can now analyze dietary content material from pictures of meals or present "well being scores" for pescatarian diets with excessive ldl cholesterol.
Interactive UI: The mannequin can generate web-based minigames or tutorials on the fly. For instance, a consumer can immediate the AI to show a photograph right into a playable Sudoku sport or a highlights-based tutorial for residence home equipment.

Analysis consciousness

Whereas Muse Spark demonstrates robust refusal behaviors concerning organic and chemical weapons, its security profile features a startling new discovery. Third-party testing by Apollo Analysis discovered that the mannequin possesses a excessive diploma of "analysis consciousness".

The mannequin continuously acknowledged when it was being examined in "alignment traps" and reasoned that it ought to behave truthfully particularly as a result of it was below analysis.

Whereas Meta concluded this was not a "blocking concern" for launch, the discovering means that frontier fashions have gotten more and more "acutely aware" of the testing surroundings—doubtlessly rendering conventional security benchmarks much less dependable as fashions study to "sport" the examination.

What occurs to Llama?

In February 2023, Meta launched Llama 1 to reveal that smaller, compute-optimal fashions may match bigger counterparts like GPT-3 in effectivity. Though entry was initially restricted to researchers, the mannequin weights had been leaked by way of 4chan on March 3, 2023, an occasion that inadvertently democratized high-tier analysis and catalyzed a world motion for working fashions on consumer-grade {hardware}.

This shift was solidified in July 2023 with the discharge of Llama 2, which launched a business license that permitted self-hosting for many organizations. This method noticed speedy adoption, with the Llama household exceeding 100 million downloads and supporting over 1,000 business functions by the third quarter of 2023.

Via 2024 and 2025, Meta scaled the Llama household to ascertain it because the important infrastructure for international enterprise AI, continuously known as the LAMP stack for AI. Following the launch of Llama 3 in April 2024 and the landmark Llama 3.1 405B in July, Meta achieved efficiency parity with the world's main proprietary programs.

The next launch of Llama 4 in April 2025 launched a Combination-of-Consultants structure, permitting for large parameter scaling whereas sustaining quick inference speeds. By early 2026, the Llama ecosystem reached a staggering scale, totaling 1.2 billion downloads and averaging roughly a million downloads per day.

This widespread adoption offered companies with important financial sovereignty, as self-hosting Llama fashions provided an 88% value discount in comparison with utilizing proprietary API suppliers.

As of April 2026, Meta’s position because the undisputed chief of the open-weight motion has transitioned right into a extremely contested multi-polar panorama characterised by the rise of worldwide opponents.

Whereas the USA accounts for 35% of worldwide Llama deployments, Chinese language fashions from labs like Alibaba and DeepSeek started accounting for 41% of downloads on platforms like Hugging Face by late 2025. All through early 2026, new entrants equivalent to Zhipu AI’s GLM-5 and Alibaba’s Qwen 3.6 Plus have outpaced Llama 4 Maverick on common information and coding benchmarks.

In response to this international strain, Meta's Muse Spark arrives with hefty expectations and an open supply legacy that will probably be powerful to reside as much as.

Proprietary solely (for now)

The launch marks a controversial departure from Meta AI's "open science" roots. Whereas the Llama sequence was famously accessible to builders, Muse Spark is launching as a proprietary mannequin.

Wang addressed the shift on X, stating: "9 months in the past we rebuilt our ai stack from scratch. New infrastructure, new structure, new information pipelines… That is the first step. Larger fashions are already in growth with plans to open-source future variations."

Nevertheless, the developer neighborhood stays skeptical. Some see this as a essential pivot after the Llama 4 sequence failed to realize anticipated developer traction; others view it as Meta "closing the gates" now that it has a aggressive reasoning mannequin.

Wang himself acknowledged the transition’s issue, noting there are "definitely tough edges we’ll polish over time".

For the three billion folks utilizing Meta’s apps, the change will probably be felt virtually immediately. The AI they work together with is now not only a library of data, however an agent with a $27 billion mind and a mandate to grasp their world as intimately as they do.

What's Hot

UFC 327: Physician Reveals Carlos Ulberg’s Extreme Knee Harm

AT&T: Locking In A Fastened Yield Forward Of Massive Funding Cycle (NYSE:T)

This week on “Sunday Morning”: The Cash Problem (April 12)

Goodbye, Llama? Meta launches new proprietary AI mannequin Muse Spark — first since Superintelligence Labs' formation

Nintendo drops new Swap 2 ‘Tremendous Mario Galaxy’ bundle

MacBook Neo vs. MacBook Air: Which One Ought to You Purchase?

watch Euphoria season 3

GeekWire Podcast takes the prepare throughout the lake to Microsoft – GeekWire

UFC 327: Physician Reveals Carlos Ulberg’s Extreme Knee Harm

AT&T: Locking In A Fastened Yield Forward Of Massive Funding Cycle (NYSE:T)

This week on “Sunday Morning”: The Cash Problem (April 12)

Trump says U.S. will blockade Strait of Hormuz

Latest Posts

UFC 327: Physician Reveals Carlos Ulberg’s Extreme Knee Harm

AT&T: Locking In A Fastened Yield Forward Of Massive Funding Cycle (NYSE:T)

This week on “Sunday Morning”: The Cash Problem (April 12)

What's Hot

Goodbye, Llama? Meta launches new proprietary AI mannequin Muse Spark — first since Superintelligence Labs' formation

Visible chain-of-thought

Benchmarks reveal a return-to-form

Agentic Methods and Effectivity: The "Thought Compression" Impact

Private wellness and Instagram buying

Analysis consciousness

What occurs to Llama?

Proprietary solely (for now)

Related Posts