We collect cookies to analyze our website traffic and performance; we never collect any personal data. Cookie Policy
Accept
NEW YORK DAWN™NEW YORK DAWN™NEW YORK DAWN™
Notification Show More
Font ResizerAa
  • Home
  • Trending
  • New York
  • World
  • Politics
  • Business
    • Business
    • Economy
    • Real Estate
  • Crypto & NFTs
  • Tech
  • Lifestyle
    • Lifestyle
    • Food
    • Travel
    • Fashion
    • Art
  • Health
  • Sports
  • Entertainment
Reading: Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers
Share
Font ResizerAa
NEW YORK DAWN™NEW YORK DAWN™
Search
  • Home
  • Trending
  • New York
  • World
  • Politics
  • Business
    • Business
    • Economy
    • Real Estate
  • Crypto & NFTs
  • Tech
  • Lifestyle
    • Lifestyle
    • Food
    • Travel
    • Fashion
    • Art
  • Health
  • Sports
  • Entertainment
Follow US
NEW YORK DAWN™ > Blog > Technology > Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers
Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers
Technology

Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers

Last updated: November 8, 2025 1:10 am
Editorial Board Published November 8, 2025
Share
SHARE

The builders of Terminal-Bench, a benchmark suite for evaluating the efficiency of autonomous AI brokers on real-world terminal-based duties, have launched model 2.0 alongside Harbor, a brand new framework for testing, enhancing and optimizing AI brokers in containerized environments.

The twin launch goals to handle long-standing ache factors in testing and optimizing AI brokers, notably these constructed to function autonomously in life like developer environments.

With a tougher and rigorously verified process set, Terminal-Bench 2.0 replaces model 1.0 as the usual for assessing frontier mannequin capabilities.

Harbor, the accompanying runtime framework, allows builders and researchers to scale evaluations throughout 1000’s of cloud containers and integrates with each open-source and proprietary brokers and coaching pipelines.

“Harbor is the package deal we want we had had whereas making Terminal-Bench," wrote co-creator Alex Shaw on X. "It’s for agent, mannequin, and benchmark builders and researchers who need to consider and enhance brokers and fashions."

Higher Bar, Cleaner Data

Terminal-Bench 1.0 saw rapid adoption after its release in May 2025, becoming a default benchmark for evaluating agent performance across the field of AI-powered agents operating in developer-style terminal environments. These agents interact with systems through the command line, mimicking how developers work behind the scenes of the graphical user interface.

However, its broad scope came with inconsistencies. Several tasks were identified by the community as poorly specified or unstable due to external service changes.

Version 2.0 addresses those issues directly. The updated suite includes 89 tasks, each subjected to several hours of manual and LLM-assisted validation. The emphasis is on making tasks solvable, realistic, and clearly specified, raising the difficulty ceiling while improving reliability and reproducibility.

A notable example is the download-youtube task, which was removed or refactored in 2.0 due to its dependence on unstable third-party APIs.

“Astute Terminal-Bench fans may notice that SOTA performance is comparable to TB1.0 despite our claim that TB2.0 is harder,” Shaw noted on X. “We believe this is because task quality is substantially higher in the new benchmark.”

Harbor: Unified Rollouts at Scale

Alongside the benchmark update, the team launched Harbor, a new framework for running and evaluating agents in cloud-deployed containers.

Harbor supports large-scale rollout infrastructure, with compatibility for major providers like Daytona and Modal.

Designed to generalize across agent architectures, Harbor supports:

Evaluation of any container-installable agent

Scalable supervised fine-tuning (SFT) and reinforcement learning (RL) pipelines

Custom benchmark creation and deployment

Full integration with Terminal-Bench 2.

Harbor was used internally to run tens of thousands of rollouts during the creation of the new benchmark. It is now publicly available via harborframework.com, with documentation for testing and submitting agents to the public leaderboard.

Early Results: GPT-5 Leads in Task Success

Initial results from the Terminal-Bench 2.0 leaderboard show OpenAI's Codex CLI (command line interface), a GPT-5 powered variant, in the lead, with a 49.6% success rate — the highest among all agents tested so far.

Close behind are other GPT-5 variants and Claude Sonnet 4.5-based agents.

Top 5 Agent Results (Terminal-Bench 2.0):

Codex CLI (GPT-5) — 49.6%

Codex CLI (GPT-5-Codex) — 44.3%

OpenHands (GPT-5) — 43.8%

Terminus 2 (GPT-5-Codex) — 43.4%

Terminus 2 (Claude Sonnet 4.5) — 42.8%

The close clustering among top models indicates active competition across platforms, with no single agent solving more than half the tasks.

Submission and Use

To test or submit an agent, users install Harbor and run the benchmark using simple CLI commands. Submissions to the leaderboard require five benchmark runs, and results can be emailed to the developers along with job directories for validation.

harbor run -d [email protected] -m "<mannequin>" -a "<agent>" –n-attempts 5 –jobs-dir <path/to/output>

Terminal-Bench 2.0 is already being built-in into analysis workflows targeted on agentic reasoning, code technology, and power use. In keeping with co-creator Mike Merrill, a postdoctoral researcher at Stanford, an in depth preprint is in progress overlaying the verification course of and design methodology behind the benchmark.

Aiming for Standardization

The mixed launch of Terminal-Bench 2.0 and Harbor marks a step towards extra constant and scalable agent analysis infrastructure. As LLM brokers proliferate in developer and operational environments, the necessity for managed, reproducible testing has grown.

These instruments provide a possible basis for a unified analysis stack — supporting mannequin enchancment, setting simulation, and benchmark standardization throughout the AI ecosystem.

You Might Also Like

AI denial is turning into an enterprise threat: Why dismissing “slop” obscures actual functionality positive factors

GAM takes purpose at “context rot”: A dual-agent reminiscence structure that outperforms long-context LLMs

The 'reality serum' for AI: OpenAI’s new technique for coaching fashions to admit their errors

Anthropic vs. OpenAI pink teaming strategies reveal completely different safety priorities for enterprise AI

Inside NetSuite’s subsequent act: Evan Goldberg on the way forward for AI-powered enterprise methods

TAGGED:agentsContainersframeworkHarborlaunchesTerminalBenchtesting
Share This Article
Facebook Twitter Email Print

Follow US

Find US on Social Medias
FacebookLike
TwitterFollow
YoutubeSubscribe
TelegramFollow
Popular News
US style band JCPenney launches RM Rebecca Minkoff fall assortment
Fashion

US style band JCPenney launches RM Rebecca Minkoff fall assortment

Editorial Board September 24, 2025
FDA sends 100 cease-and-desist letters in crackdown on DTC drug commercials
Canada Live Updates: Ontario Police Arrest Protesters Near Border Bridge Blockade
Irving Petlin’s Haunted Visions of Historical past
This Covid Surge Feels Different

You Might Also Like

Nvidia's new AI framework trains an 8B mannequin to handle instruments like a professional
Technology

Nvidia's new AI framework trains an 8B mannequin to handle instruments like a professional

December 4, 2025
Gong examine: Gross sales groups utilizing AI generate 77% extra income per rep
Technology

Gong examine: Gross sales groups utilizing AI generate 77% extra income per rep

December 4, 2025
AWS launches Kiro powers with Stripe, Figma, and Datadog integrations for AI-assisted coding
Technology

AWS launches Kiro powers with Stripe, Figma, and Datadog integrations for AI-assisted coding

December 4, 2025
Workspace Studio goals to unravel the true agent drawback: Getting staff to make use of them
Technology

Workspace Studio goals to unravel the true agent drawback: Getting staff to make use of them

December 4, 2025

Categories

  • Health
  • Sports
  • Politics
  • Entertainment
  • Technology
  • Art
  • World

About US

New York Dawn is a proud and integral publication of the Enspirers News Group, embodying the values of journalistic integrity and excellence.
Company
  • About Us
  • Newsroom Policies & Standards
  • Diversity & Inclusion
  • Careers
  • Media & Community Relations
  • Accessibility Statement
Contact Us
  • Contact Us
  • Contact Customer Care
  • Advertise
  • Licensing & Syndication
  • Request a Correction
  • Contact the Newsroom
  • Send a News Tip
  • Report a Vulnerability
Term of Use
  • Digital Products Terms of Sale
  • Terms of Service
  • Privacy Policy
  • Cookie Settings
  • Submissions & Discussion Policy
  • RSS Terms of Service
  • Ad Choices
© 2024 New York Dawn. All Rights Reserved.
Welcome Back!

Sign in to your account

Lost your password?