SEC-bench Pro

Can Language Models Solve Long-Horizon Software Security Tasks?

SEC-bench Pro is a self-evolving security benchmark that measures agents' ability to hunt security bugs in critical software systems. More projects and security tasks will be added over time. Use Version to switch between benchmark snapshots.

Overall aggregates V8, Firefox, and Linux into a 344-instance leaderboard. Split bars show how each project contributes to the score.

# Model Success Completed Provider Backend
1
GPT-5.5 (xhigh) Codex
63.1%
217/344
V8 Firefox Linux
340/344 4 timed out OpenAI OpenAI
2
GPT-5.4 (xhigh) Codex
48.5%
167/344
V8 Firefox Linux
333/344 11 timed out OpenAI OpenAI
3
Opus 4.6 (max) Claude Code
29.9%
103/344
V8 Firefox Linux
140/344 204 timed out Anthropic AWS Bedrock
4
GLM-5 (high) OpenCode
Open
6.4%
22/344
V8 Firefox Linux
303/344 41 timed out Z.ai AWS Bedrock
5
Kimi K2.5 (high) OpenCode
Open
4.4%
15/344
V8 Firefox Linux
337/344 7 timed out Moonshot AI AWS Bedrock
6
MiniMax M2.5 (high) OpenCode
Open
0.6%
2/344
V8 Firefox Linux
335/344 9 timed out MiniMax AWS Bedrock

Newer results for Claude Opus and Mythos models are not available due to safeguard restrictions.

News

  • [June 17, 2026]: Overall and Linux leaderboards are now built.
  • [May 5, 2026]: SpiderMonkey leaderboard has been released.
  • [May 1, 2026]: SEC-bench Pro launches with the V8 leaderboard.

Overview

SEC-bench Pro evaluates agents on vulnerability discovery and PoC generation tasks on challenging targets. Each target instance ships with a Docker image, metadata, a rendered prompt, and a harness. The harness renders the prompt from meta.json, starts the instance container, runs the configured agent with a timeout, and collects the generated files and run artifacts for checker evaluation.