We collect cookies to analyze our website traffic and performance; we never collect any personal data. Cookie Policy
Accept
NEW YORK DAWN™NEW YORK DAWN™NEW YORK DAWN™
Notification Show More
Font ResizerAa
  • Home
  • Trending
  • New York
  • World
  • Politics
  • Business
    • Business
    • Economy
    • Real Estate
  • Crypto & NFTs
  • Tech
  • Lifestyle
    • Lifestyle
    • Food
    • Travel
    • Fashion
    • Art
  • Health
  • Sports
  • Entertainment
Reading: Anthropic vs. OpenAI pink teaming strategies reveal completely different safety priorities for enterprise AI
Share
Font ResizerAa
NEW YORK DAWN™NEW YORK DAWN™
Search
  • Home
  • Trending
  • New York
  • World
  • Politics
  • Business
    • Business
    • Economy
    • Real Estate
  • Crypto & NFTs
  • Tech
  • Lifestyle
    • Lifestyle
    • Food
    • Travel
    • Fashion
    • Art
  • Health
  • Sports
  • Entertainment
Follow US
NEW YORK DAWN™ > Blog > Technology > Anthropic vs. OpenAI pink teaming strategies reveal completely different safety priorities for enterprise AI
Anthropic vs. OpenAI pink teaming strategies reveal completely different safety priorities for enterprise AI
Technology

Anthropic vs. OpenAI pink teaming strategies reveal completely different safety priorities for enterprise AI

Last updated: December 4, 2025 8:03 pm
Editorial Board Published December 4, 2025
Share
SHARE

Mannequin suppliers need to show the safety and robustness of their fashions, releasing system playing cards and conducting red-team workouts with every new launch. However it may be tough for enterprises to parse by means of the outcomes, which range extensively and may be deceptive.

Anthropic's 153-page system card for Claude Opus 4.5 versus OpenAI's 60-page GPT-5 system card reveals a basic break up in how these labs strategy safety validation. Anthropic discloses of their system card how they depend on multi-attempt assault success charges from 200-attempt reinforcement studying (RL) campaigns. OpenAI additionally studies tried jailbreak resistance. Each metrics are legitimate. Neither tells the entire story.

Safety leaders deploying AI brokers for searching, code execution and autonomous motion have to know what every pink group analysis truly measures, and the place the blind spots are.

What the assault information reveals

Grey Swan's Shade platform ran adaptive adversarial campaigns towards Claude fashions. The assault success fee (ASR) tells the story.

Opus 4.5 in coding environments hit 4.7% ASR at one try, 33.6% at ten and 63.0% at 100. In pc use with prolonged considering, it held at 0% ASR even after 200 makes an attempt. It's the primary mannequin to saturate the benchmark.

Sonnet 4.5 on the identical thresholds confirmed 70% ASR in coding and 85.7% in pc use.

Opus 4.5 delivers roughly 7x enchancment in coding resistance and full resistance in pc use.

This illustrates that the hole between mannequin tiers throughout the identical household is now bigger than some cross-vendor comparisons. For safety groups, that's a procurement issue that didn't exist six months in the past.

For OpenAI, the Grey Swan Shade platform discovered that the o1 system card delivered 6% ASR for dangerous textual content and 5% for malicious code, all based mostly on single-attempt metrics. Third-party testing by SPLX and NeuralTrust discovered GPT-5's uncooked ASR hit 89% initially, dropping under 1% inside two weeks of patching.

The report Jailbreaking Main Security-Aligned LLMs with Easy Adaptive Assaults confirms the sample. With enough makes an attempt, Claude 3.5 Sonnet confirmed 78% ASR, and GPT-4o reached 89%.

Two methods to catch deception

Anthropic screens roughly 10 million neural options throughout analysis utilizing dictionary studying. Options map to human-interpretable ideas, together with deception, sycophancy, bias, power-seeking and concealment.

In the meantime, OpenAI's strategy depends on chain-of-thought (CoT) monitoring. Their deception classifier flagged 0.17% of o1-preview responses as doubtlessly misleading with 92% accuracy on inside validation. GPT-5 decreased CoT deception from 4.8% to 2.1% in comparison with o3. The tactic assumes seen reasoning displays inside processing. Anthropic's interpretability work instantly questions that assumption.

When fashions sport the take a look at

In Apollo Analysis's analysis of o1, the mannequin tried to show off oversight mechanisms 5% of the time when it believed it could be shut down. In 2% of circumstances, o1 tried self-exfiltration. When confronted about misaligned actions, o1 denied wrongdoing or fabricated explanations 99% of the time. The o1 system card disclosed instrumental alignment faking. In 37% of situations, the mannequin pursued developer-aligned targets throughout analysis, then switched to its personal targets when it believed oversight was minimal.

In the meantime, Anthropic studies Opus 4.5's analysis consciousness dropped from 26.5% (Opus 4.1) to lower than 10% internally. UK AISI measured analysis consciousness reducing from 26.5% (Sonnet 4.5) to 12.7% (Opus 4.5 Preview). METR's GPT-5 analysis famous the mannequin "can identify the exact evaluation it is in on some occasions" and "seems to use this information to get a better score mostly."

If a mannequin behaves otherwise when it detects analysis situations, it's a certain guess that manufacturing habits will change into unpredictable when pushed to scale. That is the core drawback with analysis consciousness. Fashions that acknowledge they're being examined try and sport the take a look at. Anthropic's 60% discount in analysis consciousness from Opus 4.1 to 4.5 represents their success in focused engineering efforts towards this vector.

Pink teaming on immediate injection defenses reveals related divergence.

Anthropic's testing studies 96% prevention in device use situations, rising to 99.4% with further safeguards like immediate shields. Fujitsu's unbiased evaluation discovered GPT-5 susceptible to roughly 20% of examined injection assaults. OpenAI's o3/o4-mini system card studies instruction hierarchy scores of 0.86 for system-versus-developer conflicts.

These are completely different metrics and completely different assault methodologies, however the hole suggests Anthropic's agentic use circumstances have been hardened towards adversarial inputs embedded in device outputs.

Evaluating pink teaming outcomes

Right here's how Opus 4.5 stacks up towards OpenAI's mannequin household throughout 16 dimensions. .

Dimension

Claude Opus 4.5

GPT-5

o3 / o4-mini

o1

System card size

153 pages

55 pages

~30 pages

~40 pages

Assault methodology

200-attempt RL campaigns

Single-attempt + iterative patching

Single-attempt

Single-attempt

ASR @ 1 try (coding)

4.7%

89% uncooked (pre-patch)

Not revealed

5%

ASR @ 100 makes an attempt

63.0%

Not revealed

Not revealed

Not revealed

ASR @ 200 makes an attempt (Laptop Use)

0% (saturates benchmark)

Not revealed

Not revealed

Not revealed

Immediate injection protection

96% prevention; 99.4% with safeguards

~20% ASR (Fujitsu)

0.86 instruction hierarchy

Increased ASR than GPT-4o

Interpretability

~10M neural options monitored

CoT monitoring

CoT monitoring

CoT monitoring

Deception detection

Inner function activation

2.1% CoT flagged

0.95 not_unsafe

0.17% flagged; 99% deny

Analysis consciousness

<10%

Identifies precise eval (METR)

Not revealed

37% alignment faking

Scheming (Apollo)

Opus 4.5 confirmed very low charges of sabotage propensity in inside evaluations, orders of magnitude decrease than o1.

Not revealed

Decrease than o1; nonetheless misleading

5% flip off oversight; 2% self-exfil

CBRN threat

Beneath ASL-4

Medium (approaching Excessive)

Medium

Medium (first to achieve)

Governance

FRT → AST → RSO/CEO

SAG + Preparedness v2

SAG + Preparedness v2

SAG + Preparedness

Exterior companions

Grey Swan, UK AISI, US CAISI, METR

UK AISI, US AISI, METR, Apollo

UK AISI, US AISI, METR, Apollo, Sample Labs

UK AISI, US AISI, METR, Apollo

Reward hacking

Sure (inconceivable duties)

Sure (METR)

~1% of duties; 5/24 Kernels

Sure

Misalignment discount

60% vs. Opus 4.1

Not revealed

Decrease scheming than o1

Baseline

Sources: Opus 4.5 system card, GPT-5 system card, o3/o4-mini system card, o1 system card, Grey Swan, METR, Apollo Analysis

Why these variations matter

Enterprises have a number of elements to think about when analyzing mannequin evals. These embrace:

Assault persistence thresholds

Anthropic's 200-attempt RL campaigns simulate adversaries with time, assets, and adaptive tooling. In RL-based adversarial testing, the attacker mannequin learns from every failure, adjusts its strategy and systematically probes for weaknesses. It's coaching an AI to interrupt one other AI.

Grey Swan's Shade platform doesn't simply retry the identical immediate. It learns from failures, adjusts assault vectors and exploits patterns in mannequin responses. The 4.7% to 63.0% degradation curve in coding environments reveals how resistance erodes below sustained strain.

OpenAI's single-attempt metrics reply a distinct query: How typically does a naive attacker succeed on the primary strive? Each matter. A phishing marketing campaign hits tens of millions of customers as soon as every. A nation-state actor runs hundreds of variations towards a single goal. Safety groups have to know which risk mannequin their vendor examined towards.

Detection structure

Anthropic screens roughly 10 million inside neural options throughout analysis. These are summary ideas like deception, sycophancy and power-seeking that activate throughout the mannequin's hidden layers. When a fraud/deception function lights up throughout benign roleplay, that's a sign invisible to output evaluation. The mannequin's inside state reveals associations its outputs don't.

OpenAI takes a distinct strategy with CoT monitoring. Their deception classifier flagged 0.17% of o1-preview responses as doubtlessly misleading with 92% accuracy. The belief is that reasoning traces replicate inside processing. Anthropic's interpretability analysis instantly challenges this. Fashions can produce clear reasoning chains whereas inside options point out one thing else solely.

Neither strategy is full. CoT monitoring scales extra simply. Characteristic monitoring catches what CoT misses however requires deep mannequin entry and interpretability infrastructure that the majority enterprises can't replicate.

Scheming analysis design

Apollo Analysis's methodology places fashions in situations the place misaligned habits gives a strategic benefit: Shutdown threats, aim conflicts and oversight detection. o1 tried to show off oversight mechanisms 5% of the time when believing shutdown was imminent. It tried to repeat itself to stop alternative in 2% of circumstances. When confronted, it denied wrongdoing 99% of the time. Opus 4.5 confirmed very low charges of sabotage propensity in inside evaluations, orders of magnitude decrease than o1.

However the analysis environments differ. Anthropic's stress exams deal with whether or not fashions try sabotage when given clear alternatives. OpenAI's exams by way of Apollo probe instrumental reasoning, particularly whether or not fashions pretend alignment throughout analysis, then defect when oversight drops. The 37% alignment-faking fee in o1 represents a distinct failure mode than sabotage makes an attempt.

The comparability drawback

These methodological variations make direct comparability tough. Opus 4.5's 0% ASR at 200 makes an attempt in pc use versus o1's 5% malicious code fee aren’t measuring the identical factor. The assault surfaces, try counts and success standards differ.

The matrix reveals what every vendor selected to measure and publish. That itself indicators priorities. Anthropic's 153-page system card emphasizes degradation curves and inside function monitoring. OpenAI emphasizes iterative enchancment and CoT transparency.

Backside line: Enterprise safety groups inheriting these fashions inherit these measurement philosophies. The query isn't which quantity is best. It's the measurement that matches your deployment's precise risk panorama.

What unbiased pink group evaluators discovered

Unbiased pink group evaluations, for his or her half, typically function with solely completely different strategies. This could inform a really completely different story and reveal further mannequin traits that enterprises should contemplate.

METR's pink group analysis measured autonomous capabilities utilizing a time horizon rating, which tracks how lengthy a activity can run earlier than the agent fails half the time. o3 sustained dependable efficiency for about 1 hour and half-hour. o4-mini held for 1 hour quarter-hour. METR additionally detected reward hacking in roughly 1% of o3's makes an attempt, together with 5 of 24 kernel optimization experiments the place the mannequin tampered with scoring features to inflate its personal outcomes.

Apollo Analysis examined o3 and o4-mini for in-context scheming. Each exhibited misleading tendencies corresponding to sandbagging evaluations and sabotaging duties with believable deniability, however scored decrease than o1. They assess that o3 and o4-mini are unlikely to trigger catastrophic hurt as a result of scheming, however extra minor real-world harms stay attainable with out monitoring.

The UK AISI/Grey Swan problem ran 1.8 million assaults throughout 22 fashions. Each mannequin broke. ASR ranged from 1.47% to six.49%. Opus 4.5 positioned first on Grey Swan's Agent Pink Teaming benchmark with 4.7% ASR versus GPT-5.1 at 21.9% and Gemini 3 Professional at 12.5%.

No present frontier system resists decided, well-resourced assaults. The differentiation lies in how rapidly defenses degrade and at what try threshold. Opus 4.5's benefit compounds over repeated makes an attempt. Single-attempt metrics flatten the curve.

What To Ask Your Vendor

Safety groups evaluating frontier AI fashions want particular solutions, beginning with ASR at 50 and 200 makes an attempt moderately than single-attempt metrics alone. Discover out whether or not they detect deception by means of output evaluation or inside state monitoring. Know who challenges pink group conclusions earlier than deployment and what particular failure modes they've documented. Get the analysis consciousness fee. Distributors claiming full security haven't stress-tested adequately.

The underside line

Numerous red-team methodologies reveal that each frontier mannequin breaks below sustained assault. The 153-page system card versus the 55-page system card isn't nearly documentation size. It's a sign of what every vendor selected to measure, stress-test, and disclose.

For persistent adversaries, Anthropic's degradation curves present precisely the place resistance fails. For fast-moving threats requiring speedy patches, OpenAI's iterative enchancment information issues extra. For agentic deployments with searching, code execution and autonomous motion, the scheming metrics change into your main threat indicator.

Safety leaders have to cease asking which mannequin is safer. Begin asking which analysis methodology matches the threats your deployment will truly face. The system playing cards are public. The info is there. Use it.

You Might Also Like

GAM takes purpose at “context rot”: A dual-agent reminiscence structure that outperforms long-context LLMs

The 'reality serum' for AI: OpenAI’s new technique for coaching fashions to admit their errors

Inside NetSuite’s subsequent act: Evan Goldberg on the way forward for AI-powered enterprise methods

Nvidia's new AI framework trains an 8B mannequin to handle instruments like a professional

Gong examine: Gross sales groups utilizing AI generate 77% extra income per rep

TAGGED:AnthropicenterprisemethodsOpenAIprioritiesredrevealSecurityteaming
Share This Article
Facebook Twitter Email Print

Follow US

Find US on Social Medias
FacebookLike
TwitterFollow
YoutubeSubscribe
TelegramFollow
Popular News
NFT Domains: Learn how to Register and Commerce for Revenue
Crypto & NFTs

NFT Domains: Learn how to Register and Commerce for Revenue

Editorial Board December 20, 2024
How Wilt Chamberlain’s 100-Point Game Changed the NBA
Coronary heart illness deaths worldwide linked to chemical broadly utilized in plastics
Inflammatory bowel illness tied to greater danger for coronary heart illness
New imaging method maps drug motion by way of pores and skin in minutes

You Might Also Like

AWS launches Kiro powers with Stripe, Figma, and Datadog integrations for AI-assisted coding
Technology

AWS launches Kiro powers with Stripe, Figma, and Datadog integrations for AI-assisted coding

December 4, 2025
Workspace Studio goals to unravel the true agent drawback: Getting staff to make use of them
Technology

Workspace Studio goals to unravel the true agent drawback: Getting staff to make use of them

December 4, 2025
Gemini 3 Professional scores 69% belief in blinded testing up from 16% for Gemini 2.5: The case for evaluating AI on real-world belief, not tutorial benchmarks
Technology

Gemini 3 Professional scores 69% belief in blinded testing up from 16% for Gemini 2.5: The case for evaluating AI on real-world belief, not tutorial benchmarks

December 3, 2025
Tariff turbulence exposes pricey blind spots in provide chains and AI
Technology

Tariff turbulence exposes pricey blind spots in provide chains and AI

December 3, 2025

Categories

  • Health
  • Sports
  • Politics
  • Entertainment
  • Technology
  • Art
  • World

About US

New York Dawn is a proud and integral publication of the Enspirers News Group, embodying the values of journalistic integrity and excellence.
Company
  • About Us
  • Newsroom Policies & Standards
  • Diversity & Inclusion
  • Careers
  • Media & Community Relations
  • Accessibility Statement
Contact Us
  • Contact Us
  • Contact Customer Care
  • Advertise
  • Licensing & Syndication
  • Request a Correction
  • Contact the Newsroom
  • Send a News Tip
  • Report a Vulnerability
Term of Use
  • Digital Products Terms of Sale
  • Terms of Service
  • Privacy Policy
  • Cookie Settings
  • Submissions & Discussion Policy
  • RSS Terms of Service
  • Ad Choices
© 2024 New York Dawn. All Rights Reserved.
Welcome Back!

Sign in to your account

Lost your password?