We collect cookies to analyze our website traffic and performance; we never collect any personal data. Cookie Policy
Accept
NEW YORK DAWN™NEW YORK DAWN™NEW YORK DAWN™
Notification Show More
Font ResizerAa
  • Home
  • Trending
  • New York
  • World
  • Politics
  • Business
    • Business
    • Economy
    • Real Estate
  • Crypto & NFTs
  • Tech
  • Lifestyle
    • Lifestyle
    • Food
    • Travel
    • Fashion
    • Art
  • Health
  • Sports
  • Entertainment
Reading: ByteDance’s UI-TARS can take over your laptop, outperforms GPT-4o and Claude
Share
Font ResizerAa
NEW YORK DAWN™NEW YORK DAWN™
Search
  • Home
  • Trending
  • New York
  • World
  • Politics
  • Business
    • Business
    • Economy
    • Real Estate
  • Crypto & NFTs
  • Tech
  • Lifestyle
    • Lifestyle
    • Food
    • Travel
    • Fashion
    • Art
  • Health
  • Sports
  • Entertainment
Follow US
NEW YORK DAWN™ > Blog > Technology > ByteDance’s UI-TARS can take over your laptop, outperforms GPT-4o and Claude
ByteDance’s UI-TARS can take over your laptop, outperforms GPT-4o and Claude
Technology

ByteDance’s UI-TARS can take over your laptop, outperforms GPT-4o and Claude

Last updated: January 23, 2025 2:08 am
Editorial Board Published January 23, 2025
Share
SHARE

A brand new AI agent has emerged from the mother or father firm of TikTok to take management of your laptop and carry out complicated workflows.

Very like Anthropic’s Pc Use, ByteDance’s new UI-TARS understands graphical person interfaces (GUIs), applies reasoning and takes autonomous, step-by-step motion. 

Skilled on roughly 50B tokens and supplied in 7B and 72B parameter variations, the PC/MacOS brokers achieves state-of-the-art (SOTA) efficiency on 10-plus GUI benchmarks throughout efficiency, notion, grounding and general agent capabilities, persistently beating out OpenAI’s GPT-4o, Claude and Google’s Gemini.

“Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention,” researchers from ByteDance and Tsinghua College write in a brand new analysis paper. 

Image 1Supply: Arxiv

How UI-TARS explains its pondering

UI-TARS works throughout desktop, cell and internet purposes, utilizing multimodal inputs (textual content, pictures, interactions) to know visible environments.

Its UI options two tabs — one to the left exhibiting its step-by-step “thinking,” and a bigger one to the proper the place it pulls up information, web sites and apps and mechanically takes motion. 

For instance, in a demo video launched at this time, the mannequin is prompted to “Find round trip flights from SEA to NYC on the 5th and return on the 10th next month and filter by price in ascending order.”

In response, UI-TARS navigates to the web site for Delta Airways, fills within the “from” and “to” fields, clicks within the related dates and kinds and filters by value, explaining every step in its pondering field earlier than taking motion. 

In one other situation, it’s instructed to put in an autoDocstring extension in VS Code. Right here is its step-by-step pondering because it completes that process: 

It first studies that “to proceed with installing the autoDocstring extension in VS Code, I first need to open the VS Code application.” 

Recognizing that the VS Code window has not totally loaded, the mannequin notes that “it is necessary to wait for the application to finish initializing before proceeding with the installation of the autoDocstring extension. This ensures that all required functionalities are available for the next steps.” 

It then explains that it should entry the Extensions view in VS Code, which may be finished by clicking on a tab within the left sidebar. 

Operating right into a small glitch, it causes: “The previous attempt to click on the ‘Extensions’ tab may not have been precise enough, so I will try clicking on it again to ensure the Extensions view opens successfully.”

As soon as within the Extensions view, it sorts in ‘autoDocstring’ earlier than concluding: “Since the installation process has already been initiated, there is no further action required at this moment. I will wait for the installation to complete and confirm its success once the process is finished.”

Screenshot 3 1

Outperforming its rivals

Throughout a wide range of benchmarks, researchers report that UI-TARS persistently outranked OpenAI’s GPT-4o; Anthropic’s Claude-3.5-Sonnet; Gemini-1.5-Professional and Gemini-2.0; 4 Qwen fashions; and quite a few tutorial fashions.

As an illustration, in VisualWebBench — which measures a mannequin’s skill to floor internet components together with webpage high quality assurance and optical character recognition — UI-TARS 72B scored 82.8%, outperforming GPT-4o (78.5%) and Claude 3.5 (78.2%). 

It additionally did considerably higher on WebSRC benchmarks (understanding of semantic content material and structure in internet contexts) and ScreenQA-short (comprehension of complicated cell display screen layouts and internet construction). UI-TARS-7B achieved main scores of 93.6% on WebSRC, whereas UI-TARS-72B achieved 88.6% on ScreenQA-short, outperforming Qwen, Gemini, Claude 3.5 and GPT-4o. 

“These results demonstrate the superior perception and comprehension capabilities of UI-TARS in web and mobile environments,” the researchers write. “Such perceptual ability lays the foundation for agent tasks, where accurate environmental understanding is crucial for task execution and decision-making.”

UI-TARS additionally confirmed spectacular leads to ScreenSpot Professional and ScreenSpot v2 , which assess a mannequin’s skill to know and localize components in GUIs. Additional, researchers examined its capabilities in planning multi-step actions and low-level duties in cell environments, and benchmarked it on OSWorld (which assesses open-ended laptop duties) and AndroidWorld (which scores autonomous brokers on 116 programmatic duties throughout 20 cell apps). 

Screenshot 4Supply: Arxiv

Image 5Supply: Arxiv

Below the hood

To assist it take step-by-step actions and acknowledge what it’s seeing, UI-TARS was skilled on a large-scale dataset of screenshots that parsed metadata together with aspect description and sort, visible description, bounding packing containers (place data), aspect operate and textual content from numerous web sites, purposes and working techniques. This enables the mannequin to supply a complete, detailed description of a screenshot, capturing not solely components however spatial relationships and general structure. 

The mannequin additionally makes use of state transition captioning to determine and describe the variations between two consecutive screenshots and decide whether or not an motion — reminiscent of a mouse click on or keyboard enter — has occurred. In the meantime, set-of-mark (SoM) prompting permits it to overlay distinct marks (letters, numbers) on particular areas of a picture. 

The mannequin is supplied with each short-term and long-term reminiscence to deal with duties at hand whereas additionally retaining historic interactions to enhance later decision-making. Researchers skilled the mannequin to carry out each System 1 (quick, automated and intuitive) and System 2 (sluggish and deliberate) reasoning. This enables for multi-step decision-making, “reflection” pondering, milestone recognition and error correction. 

Researchers emphasised that it’s vital that the mannequin have the ability to keep constant objectives and have interaction in trial and error to hypothesize, check and consider potential actions earlier than finishing a process. They launched two kinds of information to help this: error correction and post-reflection information. For error correction, they recognized errors and labeled corrective actions; for post-reflection, they simulated restoration steps. 

“This strategy ensures that the agent not only learns to avoid errors but also adapts dynamically when they occur,” the researchers write.

Clearly, UI-TARS reveals spectacular capabilities, and it’ll be attention-grabbing to see its evolving use instances within the more and more aggressive AI brokers house. Because the researchers word: “Looking ahead, while native agents represent a significant leap forward, the future lies in the integration of active and lifelong learning, where agents autonomously drive their own learning through continuous, real-world interactions.”

Researchers level out that Claude Pc Use “performs strongly in web-based tasks but significantly struggles with mobile scenarios, indicating that the GUI operation ability of Claude has not been well transferred to the mobile domain.” 

Against this, “UI-TARS exhibits excellent performance in both website and mobile domain.” 

Day by day insights on enterprise use instances with VB Day by day

If you wish to impress your boss, VB Day by day has you lined. We provide the inside scoop on what corporations are doing with generative AI, from regulatory shifts to sensible deployments, so you may share insights for optimum ROI.

An error occured.

vb daily phone

You Might Also Like

OpenClaw proves agentic AI works. It additionally proves your safety mannequin doesn't. 180,000 builders simply made that your drawback.

How main CPG manufacturers are reworking operations to outlive market pressures

This tree search framework hits 98.7% on paperwork the place vector search fails

Arcee's U.S.-made, open supply Trinity Massive and 10T-checkpoint supply uncommon take a look at uncooked mannequin intelligence

The belief paradox killing AI at scale: 76% of information leaders can't govern what staff already use

TAGGED:ByteDancesClaudecomputerGPT4ooutperformsUITARS
Share This Article
Facebook Twitter Email Print

Follow US

Find US on Social Medias
FacebookLike
TwitterFollow
YoutubeSubscribe
TelegramFollow
Popular News
Homeland Safety ends collective bargaining settlement with TSA staffers, an assault on employee rights
Politics

Homeland Safety ends collective bargaining settlement with TSA staffers, an assault on employee rights

Editorial Board March 7, 2025
FKA twigs cancels gigs in NYC, Chicago, Toronto attributable to visa snafu
The Rangers’ Goalie Is Adjusting to the Pressures of the Playoffs
Techniques-level drug design might level the way in which to more practical therapies for growing old and persistent illness
Assessment: Photorealist portray lastly will get due respect. MOCA reveals the ‘work’ in every ‘murals’

You Might Also Like

AI brokers can speak to one another — they only can't suppose collectively but
Technology

AI brokers can speak to one another — they only can't suppose collectively but

January 29, 2026
Infostealers added Clawdbot to their goal lists earlier than most safety groups knew it was operating
Technology

Infostealers added Clawdbot to their goal lists earlier than most safety groups knew it was operating

January 29, 2026
AI fashions that simulate inner debate dramatically enhance accuracy on advanced duties
Technology

AI fashions that simulate inner debate dramatically enhance accuracy on advanced duties

January 29, 2026
Airtable's Superagent maintains full execution visibility to unravel multi-agent context drawback
Technology

Airtable's Superagent maintains full execution visibility to unravel multi-agent context drawback

January 28, 2026

Categories

  • Health
  • Sports
  • Politics
  • Entertainment
  • Technology
  • Art
  • World

About US

New York Dawn is a proud and integral publication of the Enspirers News Group, embodying the values of journalistic integrity and excellence.
Company
  • About Us
  • Newsroom Policies & Standards
  • Diversity & Inclusion
  • Careers
  • Media & Community Relations
  • Accessibility Statement
Contact Us
  • Contact Us
  • Contact Customer Care
  • Advertise
  • Licensing & Syndication
  • Request a Correction
  • Contact the Newsroom
  • Send a News Tip
  • Report a Vulnerability
Term of Use
  • Digital Products Terms of Sale
  • Terms of Service
  • Privacy Policy
  • Cookie Settings
  • Submissions & Discussion Policy
  • RSS Terms of Service
  • Ad Choices
© 2024 New York Dawn. All Rights Reserved.
Welcome Back!

Sign in to your account

Lost your password?