SEC-bench Pro

Can Language Models Solve Long-Horizon Software Security Tasks?

SEC-bench Pro is a self-evolving security benchmark that measures agents' ability to hunt security bugs in critical software systems. More projects and security tasks will be added over time. Use Version to switch between benchmark snapshots.

V8 is Google's open-source JavaScript and WebAssembly engine. This track includes 103 source-file instances.

# Model Success Completed Provider Backend
1
47.6%
49/103
102/103 1 timed out OpenAI OpenAI
2
35.0%
36/103
98/103 5 timed out OpenAI OpenAI
3
Opus 4.6 (max) Claude Code
22.3%
23/103
36/103 67 timed out Anthropic AWS Bedrock
4
GLM-5 (high) OpenCode
Open
1.9%
2/103
84/103 19 timed out Z.ai AWS Bedrock
5
Kimi K2.5 (high) OpenCode
Open
1.9%
2/103
101/103 2 timed out Moonshot AI AWS Bedrock
6
Open
0.0%
0/103
98/103 5 timed out MiniMax AWS Bedrock

Newer results for Claude Opus and Mythos models are not available due to safeguard restrictions.

News

  • [June 17, 2026]: Overall and Linux leaderboards are now built.
  • [May 5, 2026]: SpiderMonkey leaderboard has been released.
  • [May 1, 2026]: SEC-bench Pro launches with the V8 leaderboard.

Overview

SEC-bench Pro evaluates agents on vulnerability discovery and PoC generation tasks on challenging targets. Each target instance ships with a Docker image, metadata, a rendered prompt, and a harness. The harness renders the prompt from meta.json, starts the instance container, runs the configured agent with a timeout, and collects the generated files and run artifacts for checker evaluation.