We collect cookies to analyze our website traffic and performance; we never collect any personal data. Cookie Policy
Accept
NEW YORK DAWN™NEW YORK DAWN™NEW YORK DAWN™
Notification Show More
Font ResizerAa
  • Home
  • Trending
  • New York
  • World
  • Politics
  • Business
    • Business
    • Economy
    • Real Estate
  • Crypto & NFTs
  • Tech
  • Lifestyle
    • Lifestyle
    • Food
    • Travel
    • Fashion
    • Art
  • Health
  • Sports
  • Entertainment
Reading: Inference is splitting in two — Nvidia’s $20B Groq guess explains its subsequent act
Share
Font ResizerAa
NEW YORK DAWN™NEW YORK DAWN™
Search
  • Home
  • Trending
  • New York
  • World
  • Politics
  • Business
    • Business
    • Economy
    • Real Estate
  • Crypto & NFTs
  • Tech
  • Lifestyle
    • Lifestyle
    • Food
    • Travel
    • Fashion
    • Art
  • Health
  • Sports
  • Entertainment
Follow US
NEW YORK DAWN™ > Blog > Technology > Inference is splitting in two — Nvidia’s $20B Groq guess explains its subsequent act
Inference is splitting in two — Nvidia’s B Groq guess explains its subsequent act
Technology

Inference is splitting in two — Nvidia’s $20B Groq guess explains its subsequent act

Last updated: January 3, 2026 2:09 am
Editorial Board Published January 3, 2026
Share
SHARE

Nvidia’s $20 billion strategic licensing cope with Groq represents one of many first clear strikes in a four-front battle over the longer term AI stack. 2026 is when that battle turns into apparent to enterprise builders.

For the technical decision-makers we speak to day-after-day — the folks constructing the AI purposes and the info pipelines that drive them — this deal is a sign that the period of the one-size-fits-all GPU because the default AI inference reply is ending.

We’re getting into the age of the Disaggregated Inference Structure, the place the silicon itself is being break up into two differing types to accommodate a world that calls for each huge context and instantaneous reasoning.

Why inference is breaking the GPU structure in two

To know why Nvidia CEO Jensen Huang dropped one-third of his reported $60 billion money pile on a licensing deal, it’s a must to take a look at the existential threats converging on his firm’s reported 92% market share. 

The trade reached a tipping level in late 2025: For the primary time, inference — the part the place skilled fashions really run — surpassed coaching by way of complete knowledge heart income, in response to Deloitte. On this new "Inference Flip," the metrics have modified. Whereas accuracy stays the baseline, the battle is now being fought over latency and the flexibility to take care of "state" in autonomous brokers.

There are 4 fronts of that battle, and every entrance factors to the identical conclusion: Inference workloads are fragmenting sooner than GPUs can generalize.

1. Breaking the GPU in two: Prefill vs. decode

Gavin Baker, an investor in Groq (and subsequently biased, but in addition unusually fluent on the structure), summarized the core driver of the Groq deal cleanly: “Inference is disaggregating into prefill and decode.”

Prefill and decode are two distinct phases:

The prefill part: Consider this because the person’s "prompt" stage. The mannequin should ingest huge quantities of information — whether or not it's a 100,000-line codebase or an hour of video — and compute a contextual understanding. That is "compute-bound," requiring huge matrix multiplication that Nvidia’s GPUs are traditionally glorious at.

The technology (decode) part: That is the precise token-by-token "generation.” Once the prompt is ingested, the model generates one word (or token) at a time, feeding each one back into the system to predict the next. This is "memory-bandwidth certain." If the data can't move from the memory to the processor fast enough, the model stutters, no matter how powerful the GPU is. (This is where Nvidia was weak, and where Groq’s special language processing unit (LPU) and its related SRAM memory, shines. More on that in a bit.)

Nvidia has announced an upcoming Vera Rubin family of chips that it’s architecting specifically to handle this split. The Rubin CPX component of this family is the designated "prefill" workhorse, optimized for massive context windows of 1 million tokens or more. To handle this scale affordably, it moves away from the eye-watering expense of high bandwidth memory (HBM) — Nvidia’s current gold-standard memory that sits right next to the GPU die — and instead utilizes 128GB of a new kind of memory, GDDR7. While HBM provides extreme speed (though not as quick as Groq’s static random-access memory (SRAM)), its supply on GPUs is limited and its cost is a barrier to scale; GDDR7 provides a more cost-effective way to ingest massive datasets.

Meanwhile, the "Groq-flavored" silicon, which Nvidia is integrating into its inference roadmap, will serve as the high-speed "decode" engine. This is about neutralizing a threat from alternative architectures like Google's TPUs and maintaining the dominance of CUDA, Nvidia’s software ecosystem that has served as its primary moat for over a decade.

All of this was enough for Baker, the Groq investor, to predict that Nvidia’s move to license Groq will cause all other specialized AI chips to be canceled — that is, outside of Google’s TPU, Tesla’s AI5, and AWS’s Trainium.

2. The differentiated power of SRAM

At the heart of Groq’s technology is SRAM. Unlike the DRAM found in your PC or the HBM on an Nvidia H100 GPU, SRAM is etched directly into the logic of the processor.

Michael Stewart, managing partner of Microsoft’s venture fund, M12, describes SRAM as the best for moving data over short distances with minimal energy. "The power to maneuver a bit in SRAM is like 0.1 picojoules or much less," Stewart said. "To maneuver it between DRAM and the processor is extra like 20 to 100 instances worse."

In the world of 2026, where agents must reason in real-time, SRAM acts as the ultimate "scratchpad": a high-speed workspace where the model can manipulate symbolic operations and complex reasoning processes without the "wasted cycles" of exterior reminiscence shuttling.

Nevertheless, SRAM has a serious downside: it’s bodily cumbersome and costly to fabricate, which means its capability is restricted in comparison with DRAM. That is the place Val Bercovici, chief AI officer at Weka, one other firm providing reminiscence for GPUs, sees the market segmenting.

Groq-friendly AI workloads — the place SRAM has the benefit — are those who use small fashions of 8 billion parameters and under, Bercovici mentioned. This isn’t a small market, although. “It’s just a giant market segment that was not served by Nvidia, which was edge inference, low latency, robotics, voice, IoT devices — things we want running on our phones without the cloud for convenience, performance, or privacy," he said.

This 8B "sweet spot" is significant because 2025 saw an explosion in model distillation, where many enterprise companies are shrinking massive models into highly efficient smaller versions. While SRAM isn't practical for the trillion-parameter "frontier" models, it is perfect for these smaller, high-velocity models.

3. The Anthropic threat: The rise of the ‘portable stack’

Perhaps the most under-appreciated driver of this deal is Anthropic’s success in making its stack portable across accelerators.

The company has pioneered a portable engineering approach for training and inference — basically a software layer that allows its Claude models to run across multiple AI accelerator families — including Nvidia’s GPUs and Google’s Ironwood TPUs. Until recently, Nvidia's dominance was protected because running high-performance models outside of the Nvidia stack was a technical nightmare. “It’s Anthropic,” Weka’s Bercovici instructed me. “The fact that Anthropic was able to … build up a software stack that could work on TPUs as well as on GPUs, I don’t think that’s being appreciated enough in the marketplace.”

(Disclosure: Weka has been a sponsor of VentureBeat occasions.)

Anthropic not too long ago dedicated to accessing as much as 1 million TPUs from Google, representing over a gigawatt of compute capability. This multi-platform strategy ensures the corporate isn't held hostage by Nvidia's pricing or provide constraints. So for Nvidia, the Groq deal is equally a defensive transfer. By integrating Groq’s ultra-fast inference IP, Nvidia is ensuring that probably the most performance-sensitive workloads — like these working small fashions or as a part of real-time brokers — will be accommodated inside Nvidia’s CUDA ecosystem, whilst rivals attempt to bounce ship to Google's Ironwood TPUs. CUDA is the particular software program Nvidia offers to builders to combine GPUs. 

4. The agentic ‘statehood’ conflict: Manus and the KV Cache

The timing of this Groq deal coincides with Meta’s acquisition of the agent pioneer Manus simply two days in the past. The importance of Manus was partly its obsession with statefulness.

If an agent can’t keep in mind what it did 10 steps in the past, it’s ineffective for real-world duties like market analysis or software program growth. KV Cache (Key-Worth Cache) is the "short-term memory" that an LLM builds through the prefill part.

Manus reported that for production-grade brokers, the ratio of enter tokens to output tokens can attain 100:1. This implies for each phrase an agent says, it’s "thinking" and "remembering" 100 others. On this atmosphere, the KV Cache hit price is the one most vital metric for a manufacturing agent, Manus mentioned. If that cache is "evicted" from reminiscence, the agent loses its prepare of thought, and the mannequin should burn huge power to recompute the immediate.

Groq’s SRAM is usually a "scratchpad" for these brokers — though, once more, principally for smaller fashions — as a result of it permits for the near-instant retrieval of that state. Mixed with Nvidia's Dynamo framework and the KVBM, Nvidia is constructing an "inference operating system" that may tier this state throughout SRAM, DRAM, and different flash-based choices like that from Bercovici’s Weka.

Thomas Jorgensen, senior director of Know-how Enablement at Supermicro, which makes a speciality of constructing clusters of GPUs for big enterprise corporations, instructed me in September that compute is now not the first bottleneck for superior clusters. Feeding knowledge to GPUs was the bottleneck, and breaking that bottleneck requires reminiscence.

"The whole cluster is now the computer," Jorgensen mentioned. "Networking becomes an internal part of the beast … feeding the beast with data is becoming harder because the bandwidth between GPUs is growing faster than anything else."

Because of this Nvidia is pushing into disaggregated inference. By separating the workloads, enterprise purposes can use specialised storage tiers to feed knowledge at memory-class efficiency, whereas the specialised "Groq-inside" silicon handles the high-speed token technology.

The decision for 2026

We’re getting into an period of maximum specialization. For many years, incumbents might win by delivery one dominant general-purpose structure — and their blind spot was usually what they ignored on the sides. Intel’s lengthy neglect of low-power is the basic instance, Michael Stewart, managing companion of Microsoft’s enterprise fund M12, instructed me. Nvidia is signaling it received’t repeat that mistake. “If even the leader, even the lion of the jungle will acquire talent, will acquire technology — it’s a sign that the whole market is just wanting more options,” Stewart mentioned.

For technical leaders, the message is to cease architecting your stack prefer it’s one rack, one accelerator, one reply. In 2026, benefit will go to the groups that label workloads explicitly — and route them to the suitable tier:

prefill-heavy vs. decode-heavy

long-context vs. short-context

interactive vs. batch

small-model vs. large-model

edge constraints vs. data-center assumptions

Your structure will observe these labels. In 2026, “GPU strategy” stops being a buying choice and turns into a routing choice. The winners received’t ask which chip they purchased — they’ll ask the place each token ran, and why.

You Might Also Like

Claude Cowork turns Claude from a chat software into shared AI infrastructure

How OpenAI is scaling the PostgreSQL database to 800 million customers

Researchers broke each AI protection they examined. Listed below are 7 inquiries to ask distributors.

MemRL outperforms RAG on complicated agent benchmarks with out fine-tuning

All the pieces in voice AI simply modified: how enterprise AI builders can profit

TAGGED:20BactbetexplainsGroqinferenceNvidiassplitting
Share This Article
Facebook Twitter Email Print

Follow US

Find US on Social Medias
FacebookLike
TwitterFollow
YoutubeSubscribe
TelegramFollow
Popular News
Mind-computer interface permits paralyzed customers to customise their sense of contact
Health

Mind-computer interface permits paralyzed customers to customise their sense of contact

Editorial Board May 1, 2025
The 3 best options for peace in Ukraine
Lord Huron’s Ben Schneider on ‘crucial weirdos,’ cosmic jukeboxes and unanswered questions
Facing Tough Election, Orban Turns to Putin for Support
Examine highlights monoclonal antibody remedy as protected and efficient in stopping rabies

You Might Also Like

Salesforce Analysis: Throughout the C-suite, belief is the important thing to scaling agentic AI
Technology

Salesforce Analysis: Throughout the C-suite, belief is the important thing to scaling agentic AI

January 22, 2026
Railway secures 0 million to problem AWS with AI-native cloud infrastructure
Technology

Railway secures $100 million to problem AWS with AI-native cloud infrastructure

January 22, 2026
Why LinkedIn says prompting was a non-starter — and small fashions was the breakthrough
Technology

Why LinkedIn says prompting was a non-starter — and small fashions was the breakthrough

January 22, 2026
ServiceNow positions itself because the management layer for enterprise AI execution
Technology

ServiceNow positions itself because the management layer for enterprise AI execution

January 21, 2026

Categories

  • Health
  • Sports
  • Politics
  • Entertainment
  • Technology
  • Art
  • World

About US

New York Dawn is a proud and integral publication of the Enspirers News Group, embodying the values of journalistic integrity and excellence.
Company
  • About Us
  • Newsroom Policies & Standards
  • Diversity & Inclusion
  • Careers
  • Media & Community Relations
  • Accessibility Statement
Contact Us
  • Contact Us
  • Contact Customer Care
  • Advertise
  • Licensing & Syndication
  • Request a Correction
  • Contact the Newsroom
  • Send a News Tip
  • Report a Vulnerability
Term of Use
  • Digital Products Terms of Sale
  • Terms of Service
  • Privacy Policy
  • Cookie Settings
  • Submissions & Discussion Policy
  • RSS Terms of Service
  • Ad Choices
© 2024 New York Dawn. All Rights Reserved.
Welcome Back!

Sign in to your account

Lost your password?