OpenAGI emerges from stealth with an AI agent that it claims crushes OpenAI and Anthropic

A stealth synthetic intelligence startup based by an MIT researcher emerged this morning with an formidable declare: its new AI mannequin can management computer systems higher than programs constructed by OpenAI and Anthropic — at a fraction of the fee.

OpenAGI, led by chief govt Zengyi Qin, launched Lux, a basis mannequin designed to function computer systems autonomously by deciphering screenshots and executing actions throughout desktop purposes. The San Francisco-based firm says Lux achieves an 83.6 % success price on On-line-Mind2Web, a benchmark that has develop into the trade's most rigorous take a look at for evaluating AI brokers that management computer systems.

That rating is a big leap over the main fashions from well-funded opponents. OpenAI's Operator, launched in January, scores 61.3 % on the identical benchmark. Anthropic's Claude Pc Use achieves 56.3 %.

"Traditional LLM training feeds a large amount of text corpus into the model. The model learns to produce text," Qin stated in an unique interview with VentureBeat. "By contrast, our model learns to produce actions. The model is trained with a large amount of computer screenshots and action sequences, allowing it to produce actions to control the computer."

The announcement arrives at a pivotal second for the AI trade. Expertise giants and startups alike have poured billions of {dollars} into creating autonomous brokers able to navigating software program, reserving journey, filling out varieties, and executing advanced workflows. OpenAI, Anthropic, Google, and Microsoft have all launched or introduced agent merchandise previously yr, betting that computer-controlling AI will develop into as transformative as chatbots.

But unbiased analysis has solid doubt on whether or not present brokers are as succesful as their creators recommend.

Why college researchers constructed a more durable benchmark to check AI brokers—and what they found

The On-line-Mind2Web benchmark, developed by researchers at Ohio State College and the College of California, Berkeley, was designed particularly to show the hole between advertising and marketing claims and precise efficiency.

Printed in April and accepted to the Convention on Language Modeling 2025, the benchmark contains 300 numerous duties throughout 136 actual web sites — the whole lot from reserving flights to navigating advanced e-commerce checkouts. In contrast to earlier benchmarks that cached elements of internet sites, On-line-Mind2Web checks brokers in stay on-line environments the place pages change dynamically and sudden obstacles seem.

The outcomes, in response to the researchers, painted "a very different picture of the competency of current agents, suggesting over-optimism in previously reported results."

When the Ohio State staff examined 5 main net brokers with cautious human analysis, they discovered that many current programs — regardless of heavy funding and advertising and marketing fanfare — didn’t outperform SeeAct, a comparatively easy agent launched in January 2024. Even OpenAI's Operator, the most effective performer amongst business choices of their examine, achieved solely 61 % success.

"It seemed that highly capable and practical agents were maybe indeed just months away," the researchers wrote in a weblog put up accompanying their paper. "However, we are also well aware that there are still many fundamental gaps in research to fully autonomous agents, and current agents are probably not as competent as the reported benchmark numbers may depict."

The benchmark has gained traction as an trade normal, with a public leaderboard hosted on Hugging Face monitoring submissions from analysis teams and firms.

How OpenAGI educated its AI to take actions as an alternative of simply producing textual content

OpenAGI's claimed efficiency benefit stems from what the corporate calls "Agentic Active Pre-training," a coaching methodology that differs essentially from how most massive language fashions be taught.

Typical language fashions prepare on huge textual content corpora, studying to foretell the following phrase in a sequence. The ensuing programs excel at producing coherent textual content however weren’t designed to take actions in graphical environments.

Lux, in response to Qin, takes a unique method. The mannequin trains on laptop screenshots paired with motion sequences, studying to interpret visible interfaces and decide which clicks, keystrokes, and navigation steps will accomplish a given objective.

"The action allows the model to actively explore the computer environment, and such exploration generates new knowledge, which is then fed back to the model for training," Qin informed VentureBeat. "This is a naturally self-evolving process, where a better model produces better exploration, better exploration produces better knowledge, and better knowledge leads to a better model."

This self-reinforcing coaching loop, if it features as described, may assist clarify how a smaller staff may obtain outcomes that elude bigger organizations. Relatively than requiring ever-larger static datasets, the method would enable the mannequin to constantly enhance by producing its personal coaching knowledge via exploration.

OpenAGI additionally claims vital value benefits. The corporate says Lux operates at roughly one-tenth the price of frontier fashions from OpenAI and Anthropic whereas executing duties sooner.

In contrast to browser-only opponents, Lux can management Slack, Excel, and different desktop purposes

A essential distinction in OpenAGI's announcement: Lux can management purposes throughout a complete desktop working system, not simply net browsers.

Most commercially obtainable computer-use brokers, together with early variations of Anthropic's Claude Pc Use, focus totally on browser-based duties. That limitation excludes huge classes of productiveness work that happen in desktop purposes — spreadsheets in Microsoft Excel, communications in Slack, design work in Adobe merchandise, code enhancing in growth environments.

OpenAGI says Lux can navigate these native purposes, a functionality that may considerably increase the addressable marketplace for computer-use brokers. The corporate is releasing a developer software program growth equipment alongside the mannequin, permitting third events to construct purposes on prime of Lux.

The corporate can be working with Intel to optimize Lux for edge gadgets, which might enable the mannequin to run domestically on laptops and workstations somewhat than requiring cloud infrastructure. That partnership may handle enterprise issues about sending delicate display screen knowledge to exterior servers.

"We are partnering with Intel to optimize our model on edge devices, which will make it the best on-device computer-use model," Qin stated.

The corporate confirmed it’s in exploratory discussions with AMD and Microsoft about extra partnerships.

What occurs whenever you ask an AI agent to repeat your financial institution particulars

Pc-use brokers current novel security challenges that don’t come up with typical chatbots. An AI system able to clicking buttons, coming into textual content, and navigating purposes may, if misdirected, trigger vital hurt — transferring cash, deleting information, or exfiltrating delicate info.

OpenAGI says it has constructed security mechanisms straight into Lux. When the mannequin encounters requests that violate its security insurance policies, it refuses to proceed and alerts the consumer.

In an instance supplied by the corporate, when a consumer requested the mannequin to "copy my bank details and paste it into a new Google doc," Lux responded with an inner reasoning step: "The user asks me to copy the bank details, which are sensitive information. Based on the safety policy, I am not able to perform this action." The mannequin then issued a warning to the consumer somewhat than executing the possibly harmful request.

Such safeguards will face intense scrutiny as computer-use brokers proliferate. Safety researchers have already demonstrated immediate injection assaults in opposition to early agent programs, the place malicious directions embedded in web sites or paperwork can hijack an agent's conduct. Whether or not Lux's security mechanisms can face up to adversarial assaults stays to be examined by unbiased researchers.

The MIT researcher who constructed two of GitHub's most downloaded AI fashions

Qin brings an uncommon mixture of educational credentials and entrepreneurial expertise to OpenAGI.

He accomplished his doctorate on the Massachusetts Institute of Expertise in 2025, the place his analysis targeted on laptop imaginative and prescient, robotics, and machine studying. His educational work appeared in prime venues together with the Convention on Pc Imaginative and prescient and Sample Recognition, the Worldwide Convention on Studying Representations, and the Worldwide Convention on Machine Studying.

Earlier than founding OpenAGI, Qin constructed a number of extensively adopted AI programs. JetMoE, a big language mannequin he led growth on, demonstrated {that a} high-performing mannequin could possibly be educated from scratch for lower than $100,000 — a fraction of the tens of thousands and thousands usually required. The mannequin outperformed Meta's LLaMA2-7B on normal benchmarks, in response to a technical report that attracted consideration from MIT's Pc Science and Synthetic Intelligence Laboratory.

His earlier open-source tasks achieved exceptional adoption. OpenVoice, a voice cloning mannequin, collected roughly 35,000 stars on GitHub and ranked within the prime 0.03 % of open-source tasks by recognition. MeloTTS, a text-to-speech system, has been downloaded greater than 19 million instances, making it one of the crucial extensively used audio AI fashions since its 2024 launch.

Qin additionally co-founded MyShell, an AI agent platform that has attracted six million customers who’ve collectively constructed greater than 200,000 AI brokers. Customers have had a couple of billion interactions with brokers on the platform, in response to the corporate.

Contained in the billion-dollar race to construct AI that controls your laptop

The pc-use agent market has attracted intense curiosity from traders and know-how giants over the previous yr.

OpenAI launched Operator in January, permitting customers to instruct an AI to finish duties throughout the online. Anthropic has continued creating Claude Pc Use, positioning it as a core functionality of its Claude mannequin household. Google has integrated agent options into its Gemini merchandise. Microsoft has built-in agent capabilities throughout its Copilot choices and Home windows.

But the market stays nascent. Enterprise adoption has been restricted by issues about reliability, safety, and the flexibility to deal with edge instances that happen regularly in real-world workflows. The efficiency gaps revealed by benchmarks like On-line-Mind2Web recommend that present programs might not be prepared for mission-critical purposes.

OpenAGI enters this aggressive panorama as an unbiased different, positioning superior benchmark efficiency and decrease prices in opposition to the huge assets of its well-funded rivals. The corporate's Lux mannequin and developer SDK can be found starting at present.

Whether or not OpenAGI can translate benchmark dominance into real-world reliability stays the central query. The AI trade has a protracted historical past of spectacular demos that falter in manufacturing, of laboratory outcomes that crumble in opposition to the chaos of precise use. Benchmarks measure what they measure, and the space between a managed take a look at and an 8-hour workday filled with edge instances, exceptions, and surprises may be huge.

But when Lux performs within the wild the best way it performs within the lab, the implications lengthen far past one startup's success. It will recommend that the trail to succesful AI brokers runs not via the biggest checkbooks however via the cleverest architectures—{that a} small staff with the best concepts can outmaneuver the giants.

The know-how trade has seen that story earlier than. It hardly ever stays true for lengthy.

OpenAGI emerges from stealth with an AI agent that it claims crushes OpenAI and Anthropic

Follow US

Popular News

NJ teenager charged with killing 2 girls in Rockland County

Categories

About US

Company

Contact Us

Term of Use