We collect cookies to analyze our website traffic and performance; we never collect any personal data. Cookie Policy
Accept
NEW YORK DAWN™NEW YORK DAWN™NEW YORK DAWN™
Notification Show More
Font ResizerAa
  • Home
  • Trending
  • New York
  • World
  • Politics
  • Business
    • Business
    • Economy
    • Real Estate
  • Crypto & NFTs
  • Tech
  • Lifestyle
    • Lifestyle
    • Food
    • Travel
    • Fashion
    • Art
  • Health
  • Sports
  • Entertainment
Reading: Consideration ISN'T all you want?! New Qwen3 variant Brumby-14B-Base leverages Energy Retention method
Share
Font ResizerAa
NEW YORK DAWN™NEW YORK DAWN™
Search
  • Home
  • Trending
  • New York
  • World
  • Politics
  • Business
    • Business
    • Economy
    • Real Estate
  • Crypto & NFTs
  • Tech
  • Lifestyle
    • Lifestyle
    • Food
    • Travel
    • Fashion
    • Art
  • Health
  • Sports
  • Entertainment
Follow US
NEW YORK DAWN™ > Blog > Technology > Consideration ISN'T all you want?! New Qwen3 variant Brumby-14B-Base leverages Energy Retention method
Consideration ISN'T all you want?! New Qwen3 variant Brumby-14B-Base leverages Energy Retention method
Technology

Consideration ISN'T all you want?! New Qwen3 variant Brumby-14B-Base leverages Energy Retention method

Last updated: November 4, 2025 9:54 pm
Editorial Board Published November 4, 2025
Share
SHARE

When the transformer structure was launched in 2017 within the now seminal Google paper "Attention Is All You Need," it grew to become an on the spot cornerstone of contemporary synthetic intelligence.

Each main massive language mannequin (LLM) — from OpenAI's GPT collection to Anthropic's Claude, Google's Gemini, and Meta's Llama — has been constructed on some variation of its central mechanism: consideration, the mathematical operation that permits a mannequin to look again throughout its total enter and determine what data issues most.

Eight years later, the identical mechanism that outlined AI’s golden age is now exhibiting its limits. Consideration is highly effective, however additionally it is costly — its computational and reminiscence prices scale quadratically with context size, creating an more and more unsustainable bottleneck for each analysis and trade. As fashions intention to cause throughout paperwork, codebases, or video streams lasting hours or days, consideration turns into the structure’s Achilles’ heel.

On October 28, 2025, the little-known AI startup Manifest AI launched a radical various. Their new mannequin, Brumby-14B-Base, is a retrained variant of Qwen3-14B-Base, one of many main open-source transformer fashions.

However whereas many variants of Qwen have been skilled already, Brumby-14B-Base is novel in that it abandons consideration altogether.

As an alternative, Brumby replaces these layers with a novel mechanism known as Energy Retention—a recurrent, hardware-efficient structure that shops and updates data over arbitrarily lengthy contexts with out the exponential reminiscence development of consideration.

Educated at a said price of simply $4,000, the 14-billion-parameter Brumby mannequin performs on par with established transformer fashions like Qwen3-14B and GLM-4.5-Air, attaining near-state-of-the-art accuracy on a spread of reasoning and comprehension benchmarks.

From Consideration to Retention: The Architectural Shift

The core of Manifest AI’s innovation lies in what they name the Energy Retention layer.

In a standard transformer, each token computes a set of queries (Q), keys (Ok), and values (V), then performs a matrix operation that measures the similarity between each token and each different token—basically a full pairwise comparability throughout the sequence.

That is what provides consideration its flexibility, but in addition what makes it so pricey: processing a sequence twice as lengthy takes roughly 4 instances the compute and reminiscence.

Energy Retention retains the identical inputs (Q, Ok, V), however replaces the worldwide similarity operation with a recurrent state replace.

Every layer maintains a reminiscence matrix S, which is up to date at every time step based on the incoming key, worth, and a discovered gating sign.

The method appears extra like an RNN (Recurrent Neural Community) than a transformer: as a substitute of recomputing consideration over the whole context, the mannequin repeatedly compresses previous data right into a fixed-size latent state.

This implies the computational price of Energy Retention doesn’t develop with context size. Whether or not the mannequin is processing 1,000 or 1,000,000 tokens, the per-token price stays fixed.

That property alone—constant-time per-token computation—marks a profound departure from transformer conduct.

On the identical time, Energy Retention preserves the expressive energy that made consideration profitable. As a result of the recurrence entails tensor powers of the enter (therefore the identify “power retention”), it may signify higher-order dependencies between previous and current tokens.

The result’s an structure that may theoretically retain long-term dependencies indefinitely, whereas remaining as environment friendly as an RNN and as expressive as a transformer.

Retraining, Not Rebuilding

Maybe probably the most placing side of Brumby-14B’s coaching course of is its effectivity. Manifest AI skilled the mannequin for less than 60 hours on 32 Nvidia H100 GPUs, at a price of roughly $4,000 — lower than 2% of what a traditional mannequin of this scale would price to coach from scratch.

Nonetheless, because it relied on a transformer-based mannequin, it's protected to say that this advance alone is not going to finish the transformer AI-era.

As Jacob Buckman, founding father of Manifest AI, clarified in an e-mail to VentureBeat: “The ability to train for $4,000 is indeed only possible when leveraging an existing transformer model,” he stated. “Brumby could not be trained from scratch for that price.”

Nonetheless, Buckman emphasised the importance of that consequence: “The reason this is important is that the ability to build on the weights of the previous generation of model architectures is a critical accelerant for the adoption of a new modeling paradigm.”

He argues this demonstrates how attention-free methods can catch as much as transformer efficiency “for orders-of-magnitude less” funding.

Within the loss curves launched by Manifest AI, Brumby’s coaching loss rapidly converges to that of the Qwen3 baseline inside 3,000 coaching steps, even because the structure diverges considerably from its transformer origins.

Though Brumby-14B-Base started life as Qwen3-14B-Base, it didn’t stay equivalent for lengthy. Manifest AI essentially altered Qwen3’s structure by eradicating its consideration layers—the mathematical engine that defines how a transformer mannequin processes data—and changing them with their new “power retention” mechanism. This variation restructured the mannequin’s inner wiring, successfully giving it a brand new mind whereas preserving a lot of its prior information.

Due to that architectural swap, the present Qwen3 weights not match completely. They have been skilled to function inside a transformer’s consideration dynamics, not the brand new retention-based system. Consequently, the Brumby mannequin initially “forgot” learn how to apply a few of its discovered information successfully. The retraining course of—about 3,000 steps of further studying—served to recalibrate these weights, aligning them with the facility retention framework with out having to start out from zero.

A useful method to consider that is to think about taking a world-class pianist and handing them a guitar. They already perceive rhythm, concord, and melody, however their arms should study fully new patterns to supply the identical music. Equally, Brumby needed to relearn learn how to use its current information by way of a brand new computational instrument. These 3,000 coaching steps have been, in impact, its crash course in guitar classes.

By the tip of this brief retraining section, Brumby had regained its full efficiency, reaching the identical accuracy as the unique Qwen3 mannequin. That fast restoration is what makes the consequence so vital: it reveals that an attention-free system can inherit and adapt the capabilities of a transformer mannequin with solely a fraction of the coaching time and price.

The benchmark development plots present an analogous development: the mannequin quickly approaches its goal accuracy on core evaluations like GSM8K, HellaSwag, and MMLU after just a few thousand steps, matching and even barely surpassing Qwen3 on a number of duties.

Benchmarking the Brumby

Throughout customary analysis duties, Brumby-14B-Base constantly performs at or close to parity with transformer baselines of comparable scale.

Process

Brumby-14B

Qwen3-14B

GLM-4.5-Air

Nemotron Nano (12B)

ARC

0.89

0.94

0.92

0.93

GSM8K

0.88

0.84

0.83

0.84

GSM8K (Platinum)

0.87

0.88

0.85

0.87

HellaSwag

0.77

0.81

0.85

0.82

MATH

0.62

0.54

0.47

0.26

MBPP

0.57

0.75

0.73

0.71

MMLU

0.71

0.78

0.77

0.78

MMLU (Professional)

0.36

0.55

0.51

0.53

Whereas it lags barely behind transformers on knowledge-heavy evaluations like MMLU-Professional, it matches or outperforms them on mathematical reasoning and long-context reasoning duties—exactly the place consideration architectures are likely to falter. This sample reinforces the concept that recurrent or retention-based methods might maintain a structural benefit for reasoning over prolonged temporal or logical dependencies.

{Hardware} Effectivity and Inference Efficiency

Brumby’s energy retention design presents one other main benefit: {hardware} effectivity.

As a result of the state replace entails solely native matrix operations, inference might be carried out with linear complexity in sequence size.

Manifest AI reviews that their quickest kernels, developed by way of their in-house CUDA framework Vidrial, can ship hundreds-fold speedups over consideration on very lengthy contexts.

Buckman stated the alpha-stage Energy Retention kernels “achieve typical hardware utilization of 80–85%, which is higher than FlashAttention2’s 70–75% or Mamba’s 50–60%.”

(Mamba is one other rising “post-transformer” structure developed by Carnegie Mellon scientists again in 2023 that, like Energy Retention, seeks to eradicate the computational bottleneck of consideration. It replaces consideration with a state-space mechanism that processes sequences linearly — updating an inner state over time reasonably than evaluating each token to each different one. This makes it much more environment friendly for lengthy inputs, although it sometimes achieves decrease {hardware} utilization than Energy Retention in early exams.)

Each Energy Retention and Mamba, he added, “expend meaningfully fewer total FLOPs than FlashAttention2 on long contexts, as well as far less memory.”

In accordance with Buckman, the reported 100× speedup comes from this mixed enchancment in utilization and computational effectivity, although he famous that “we have not yet stress-tested it on production-scale workloads.”

Coaching and Scaling Economics

Maybe no statistic within the Brumby launch generated extra consideration than the coaching price.

A 14-billion-parameter mannequin, skilled for $4,000, represents a two-order-of-magnitude discount in the price of basis mannequin growth.

Buckman confirmed that the low price displays a broader scaling sample. “Far from diminishing returns, we have found that ease of retraining improves with scale,” he stated. “The number of steps required to successfully retrain a model decreases with its parameter count.”

Manifest has not but validated the price of retraining fashions at 700B parameters, however Buckman projected a spread of $10,000–$20,000 for fashions of that magnitude—nonetheless far under transformer coaching budgets.

He additionally reiterated that this method might democratize large-scale experimentation by permitting smaller analysis teams or corporations to retrain or repurpose current transformer checkpoints with out prohibitive compute prices.

Integration and Deployment

In accordance with Buckman, changing an current transformer right into a Energy Retention mannequin is designed to be easy.

“It is straightforward for any company that is already retraining, post-training, or fine-tuning open-source models,” he stated. “Simply pip install retention, change one line of your architecture code, and resume training where you left off.”

He added that after solely a small variety of GPU-hours, the mannequin sometimes recovers its authentic efficiency—at which level it good points the effectivity advantages of the attention-free design.

“The resulting architecture will permit far faster long-context training and inference than previously,” Buckman famous.

On infrastructure, Buckman stated the primary Brumby kernels are written in Triton, appropriate with each NVIDIA and AMD accelerators. Specialised CUDA kernels are additionally obtainable by way of the crew’s in-house Vidrial framework. Integration with vLLM and different inference engines stays a piece in progress: “We have not yet integrated Power Retention into inference engines, but doing so is a major ongoing initiative at Manifest.”

As for distributed inference, Buckman dismissed issues about instability: “We have not found this difficulty to be exacerbated in any way by our recurrent-state architecture. In fact, context-parallel training and GPU partitioning for multi-user inference both become significantly cleaner technically when using our approach.”

Mission and Lengthy-Time period Imaginative and prescient

Past the engineering particulars, Buckman additionally described Manifest’s broader mission. “Our mission is to train a neural network to model all human output,” he stated.

The crew’s aim, he defined, is to maneuver past modeling “artifacts of intelligence” towards modeling “the intelligent processes that generated them.” This shift, he argued, requires “fundamentally rethinking” how fashions are designed and skilled—work that Energy Retention represents solely the start of.

The Brumby-14B launch, he stated, is “one step forward in a long march” towards architectures that may mannequin thought processes repeatedly and effectively.

Public Debate and Business Reception

The launch of Brumby-14B sparked speedy dialogue on X (previously Twitter), the place researchers debated the framing of Manifest AI’s announcement.

Some, together with Meta researcher Ariel (@redtachyon), argued that the “$4,000 foundation model” tagline was deceptive, because the coaching concerned reusing pretrained transformer weights reasonably than coaching from scratch.

“They shuffled around the weights of Qwen, fine-tuned it a bit, and called it ‘training a foundation model for $4k,’” Ariel wrote.

Buckman responded publicly, clarifying that the preliminary tweet had been a part of an extended thread explaining the retraining method. “It’s not like I was being deceptive about it,” he wrote. “I broke it up into separate tweets, and now everyone is mad about the first one.”

In a follow-up e-mail, Buckman took a measured view of the controversy. “The end of the transformer era is not yet here,” he reiterated, “but the march has begun.”

He additionally acknowledged that the $4,000 declare, although technically correct in context, had drawn consideration exactly as a result of it challenged expectations about what it prices to experiment at frontier scale.

Conclusion: A Crack within the Transformer’s Wall?

The discharge of Brumby-14B-Base is greater than an engineering milestone; it’s a proof of idea that the transformer’s dominance might lastly face credible competitors.

By changing consideration with energy retention, Manifest AI has demonstrated that efficiency parity with state-of-the-art transformers is feasible at a fraction of the computational price—and that the long-context bottleneck might be damaged with out unique {hardware}.

The broader implications are twofold. First, the economics of coaching and serving massive fashions might shift dramatically, reducing the barrier to entry for open analysis and smaller organizations.

Second, the architectural variety of AI fashions might develop once more, reigniting theoretical and empirical exploration after half a decade of transformer monoculture.

As Buckman put it: “The end of the transformer era is not yet here. Our release is just one step forward in a long march toward the future.”

You Might Also Like

How Google’s TPUs are reshaping the economics of large-scale AI

How Hud's runtime sensor reduce triage time from 3 hours to 10 minutes

Quilter's AI simply designed an 843‑half Linux pc that booted on the primary attempt. {Hardware} won’t ever be the identical.

OpenAI report reveals a 6x productiveness hole between AI energy customers and everybody else

The 70% factuality ceiling: why Google’s new ‘FACTS’ benchmark is a wake-up name for enterprise AI

TAGGED:attentionBrumby14BBaseisn039tleveragespowerQwen3Retentiontechniquevariant
Share This Article
Facebook Twitter Email Print

Follow US

Find US on Social Medias
FacebookLike
TwitterFollow
YoutubeSubscribe
TelegramFollow
Popular News
Iconic Staten Island store, Ralph’s Well-known Italian Ices, opens day after physique discovered inside
New York

Iconic Staten Island store, Ralph’s Well-known Italian Ices, opens day after physique discovered inside

Editorial Board July 14, 2025
Evaluate: ‘Black Doves’ is a vacation thriller of the center
Adrien Brody, who promised brevity on the Oscars, units report for longest speech
Research finds prenatal testosterone publicity impacts boys’ exercise and women’ muscle energy at age 7
Robert Malone Spreads Falsehoods About Vaccines. He Also Says He Invented Some.

You Might Also Like

The AI that scored 95% — till consultants discovered it was AI
Technology

The AI that scored 95% — till consultants discovered it was AI

December 9, 2025
Mistral launches highly effective Devstral 2 coding mannequin together with open supply, laptop-friendly model
Technology

Mistral launches highly effective Devstral 2 coding mannequin together with open supply, laptop-friendly model

December 9, 2025
Model-context AI: The lacking requirement for advertising AI
Technology

Model-context AI: The lacking requirement for advertising AI

December 9, 2025
Databricks' OfficeQA uncovers disconnect: AI brokers ace summary checks however stall at 45% on enterprise docs
Technology

Databricks' OfficeQA uncovers disconnect: AI brokers ace summary checks however stall at 45% on enterprise docs

December 9, 2025

Categories

  • Health
  • Sports
  • Politics
  • Entertainment
  • Technology
  • Art
  • World

About US

New York Dawn is a proud and integral publication of the Enspirers News Group, embodying the values of journalistic integrity and excellence.
Company
  • About Us
  • Newsroom Policies & Standards
  • Diversity & Inclusion
  • Careers
  • Media & Community Relations
  • Accessibility Statement
Contact Us
  • Contact Us
  • Contact Customer Care
  • Advertise
  • Licensing & Syndication
  • Request a Correction
  • Contact the Newsroom
  • Send a News Tip
  • Report a Vulnerability
Term of Use
  • Digital Products Terms of Sale
  • Terms of Service
  • Privacy Policy
  • Cookie Settings
  • Submissions & Discussion Policy
  • RSS Terms of Service
  • Ad Choices
© 2024 New York Dawn. All Rights Reserved.
Welcome Back!

Sign in to your account

Lost your password?