SEC-bench Pro

Can Language Models Solve Long-Horizon Software Security Tasks?

SEC-bench Pro is a self-evolving security benchmark that measures agents' ability to hunt security bugs in critical software systems. More projects and security tasks will be added over time. Use Version to switch between benchmark snapshots.

Firefox tracks SpiderMonkey JavaScript and WebAssembly engine vulnerabilities. This track includes 104 source-file instances.

# Model Success Completed Provider Backend
1
44.2%
46/104
103/104 1 timed out OpenAI OpenAI
3
25.0%
26/104
104/104 0 timed out OpenAI OpenAI
2
Opus 4.6 (max) Claude Code
27.9%
29/104
43/104 61 timed out Anthropic AWS Bedrock
4
GLM-5 (high) OpenCode
Open
5.8%
6/104
96/104 8 timed out Z.ai AWS Bedrock
5
Kimi K2.5 (high) OpenCode
Open
2.9%
3/104
103/104 1 timed out Moonshot AI AWS Bedrock
6
Open
0.0%
0/104
100/104 4 timed out MiniMax AWS Bedrock

Newer results for Claude Opus and Mythos models are not available due to safeguard restrictions.

News

  • [June 17, 2026]: Overall and Linux leaderboards are now built.
  • [May 5, 2026]: SpiderMonkey leaderboard has been released.
  • [May 1, 2026]: SEC-bench Pro launches with the V8 leaderboard.

Overview

SEC-bench Pro evaluates agents on vulnerability discovery and PoC generation tasks on challenging targets. Each target instance ships with a Docker image, metadata, a rendered prompt, and a harness. The harness renders the prompt from meta.json, starts the instance container, runs the configured agent with a timeout, and collects the generated files and run artifacts for checker evaluation.