We collect cookies to analyze our website traffic and performance; we never collect any personal data. Cookie Policy
Accept
NEW YORK DAWN™NEW YORK DAWN™NEW YORK DAWN™
Notification Show More
Font ResizerAa
  • Home
  • Trending
  • New York
  • World
  • Politics
  • Business
    • Business
    • Economy
    • Real Estate
  • Crypto & NFTs
  • Tech
  • Lifestyle
    • Lifestyle
    • Food
    • Travel
    • Fashion
    • Art
  • Health
  • Sports
  • Entertainment
Reading: AI can repair bugs—however can’t discover them: OpenAI’s examine highlights limits of LLMs in software program engineering
Share
Font ResizerAa
NEW YORK DAWN™NEW YORK DAWN™
Search
  • Home
  • Trending
  • New York
  • World
  • Politics
  • Business
    • Business
    • Economy
    • Real Estate
  • Crypto & NFTs
  • Tech
  • Lifestyle
    • Lifestyle
    • Food
    • Travel
    • Fashion
    • Art
  • Health
  • Sports
  • Entertainment
Follow US
NEW YORK DAWN™ > Blog > Technology > AI can repair bugs—however can’t discover them: OpenAI’s examine highlights limits of LLMs in software program engineering
AI can repair bugs—however can’t discover them: OpenAI’s examine highlights limits of LLMs in software program engineering
Technology

AI can repair bugs—however can’t discover them: OpenAI’s examine highlights limits of LLMs in software program engineering

Last updated: February 19, 2025 1:06 am
Editorial Board Published February 19, 2025
Share
SHARE

Giant language fashions (LLMs) could have modified software program improvement, however enterprises might want to assume twice about solely changing human software program engineers with LLMs, regardless of OpenAI CEO Sam Altman’s declare that fashions can substitute “low-level” engineers.

In a brand new paper, OpenAI researchers element how they developed an LLM benchmark referred to as SWE-Lancer to check how a lot basis fashions can earn from real-life freelance software program engineering duties. The take a look at discovered that, whereas the fashions can clear up bugs, they will’t see why the bug exists and proceed to make extra errors. 

The researchers tasked three LLMs — OpenAI’s GPT-4o and o1 and Anthropic’s Claude-3.5 Sonnet — with 1,488 freelance software program engineer duties from the freelance platform Upwork amounting to $1 million in payouts. They divided the duties into two classes: particular person contributor duties (resolving bugs or implementing options), and administration duties (the place the mannequin roleplays as a supervisor who will select the very best proposal to resolve points). 

“Results indicate that the real-world freelance work in our benchmark remains challenging for frontier language models,” the researchers write. 

The take a look at reveals that basis fashions can’t absolutely substitute human engineers. Whereas they may also help clear up bugs, they’re not fairly on the stage the place they will begin incomes freelancing money by themselves. 

Benchmarking freelancing fashions

The researchers and 100 different skilled software program engineers recognized potential duties on Upwork and, with out altering any phrases, fed these to a Docker container to create the SWE-Lancer dataset. The container doesn’t have web entry and can’t entry GitHub “to avoid the possible of models scraping code diffs or pull request details,” they defined.

The staff recognized 764 particular person contributor duties, totaling about $414,775, starting from 15-minute bug fixes to weeklong function requests. These duties, which included reviewing freelancer proposals and job postings, would pay out $585,225.

The duties have been added to the expensing platform Expensify. 

The researchers generated prompts primarily based on the duty title and outline and a snapshot of the codebase. If there have been extra proposals to resolve the problem, “we also generated a management task using the issue description and list of proposals,” they defined.

From right here, the researchers moved to end-to-end take a look at improvement. They wrote Playwright exams for every activity that applies these generated patches which have been then “triple-verified” by skilled software program engineers.

“Tests simulate real-world user flows, such as logging into the application, performing complex actions (making financial transactions) and verifying that the model’s solution works as expected,” the paper explains. 

Check outcomes

After working the take a look at, the researchers discovered that not one of the fashions earned the total $1 million worth of the duties. Claude 3.5 Sonnet, the best-performing mannequin, earned solely $208,050 and resolved 26.2% of the person contributor points. Nevertheless, the researchers level out, “the majority of its solutions are incorrect, and higher reliability is needed for trustworthy deployment.”

The fashions carried out nicely throughout most particular person contributor duties, with Claude 3.5-Sonnet performing finest, adopted by o1 and GPT-4o. 

“Agents excel at localizing, but fail to root cause, resulting in partial or flawed solutions,” the report explains. “Agents pinpoint the source of an issue remarkably quickly, using keyword searches across the whole repository to quickly locate the relevant file and functions — often far faster than a human would. However, they often exhibit a limited understanding of how the issue spans multiple components or files, and fail to address the root cause, leading to solutions that are incorrect or insufficiently comprehensive. We rarely find cases where the agent aims to reproduce the issue or fails due to not finding the right file or location to edit.”

Apparently, the fashions all carried out higher on supervisor duties that required reasoning to judge technical understanding.

These benchmark exams confirmed that AI fashions can clear up some “low-level” coding issues and might’t substitute “low-level” software program engineers but. The fashions nonetheless took time, usually made errors, and couldn’t chase a bug round to search out the foundation explanation for coding issues. Many “low-level” engineers work higher, however the researchers stated this is probably not the case for very lengthy. 

Every day insights on enterprise use circumstances with VB Every day

If you wish to impress your boss, VB Every day has you lined. We provide the inside scoop on what corporations are doing with generative AI, from regulatory shifts to sensible deployments, so you may share insights for optimum ROI.

An error occured.

You Might Also Like

The AI that scored 95% — till consultants discovered it was AI

Mistral launches highly effective Devstral 2 coding mannequin together with open supply, laptop-friendly model

Model-context AI: The lacking requirement for advertising AI

Databricks' OfficeQA uncovers disconnect: AI brokers ace summary checks however stall at 45% on enterprise docs

Monitoring each resolution, greenback and delay: The brand new course of intelligence engine driving public-sector progress

TAGGED:bugsbutengineeringfindfixhighlightslimitsLLMsOpenAIsSoftwarestudy
Share This Article
Facebook Twitter Email Print

Follow US

Find US on Social Medias
FacebookLike
TwitterFollow
YoutubeSubscribe
TelegramFollow
Popular News
Nets’ Cam Thomas talks season-ending hamstring harm, impending free company
Sports

Nets’ Cam Thomas talks season-ending hamstring harm, impending free company

Editorial Board March 17, 2025
St. John’s gamers leaning on Rick Pitino’s NCAA Match expertise amid lack of their very own: ‘He’s like a unique monster’
These Pumpkin Morning Glory Muffins Have Turn into My Fall Signature
Most of the World’s Vaccines Likely Won’t Prevent Infection From Omicron
Too few ladies who pause breast most cancers therapy throughout being pregnant resume remedy after supply

You Might Also Like

Z.ai debuts open supply GLM-4.6V, a local tool-calling imaginative and prescient mannequin for multimodal reasoning
Technology

Z.ai debuts open supply GLM-4.6V, a local tool-calling imaginative and prescient mannequin for multimodal reasoning

December 9, 2025
Anthropic's Claude Code can now learn your Slack messages and write code for you
Technology

Anthropic's Claude Code can now learn your Slack messages and write code for you

December 8, 2025
Reserving.com’s agent technique: Disciplined, modular and already delivering 2× accuracy
Technology

Reserving.com’s agent technique: Disciplined, modular and already delivering 2× accuracy

December 8, 2025
Design within the age of AI: How small companies are constructing massive manufacturers quicker
Technology

Design within the age of AI: How small companies are constructing massive manufacturers quicker

December 8, 2025

Categories

  • Health
  • Sports
  • Politics
  • Entertainment
  • Technology
  • Art
  • World

About US

New York Dawn is a proud and integral publication of the Enspirers News Group, embodying the values of journalistic integrity and excellence.
Company
  • About Us
  • Newsroom Policies & Standards
  • Diversity & Inclusion
  • Careers
  • Media & Community Relations
  • Accessibility Statement
Contact Us
  • Contact Us
  • Contact Customer Care
  • Advertise
  • Licensing & Syndication
  • Request a Correction
  • Contact the Newsroom
  • Send a News Tip
  • Report a Vulnerability
Term of Use
  • Digital Products Terms of Sale
  • Terms of Service
  • Privacy Policy
  • Cookie Settings
  • Submissions & Discussion Policy
  • RSS Terms of Service
  • Ad Choices
© 2024 New York Dawn. All Rights Reserved.
Welcome Back!

Sign in to your account

Lost your password?