SEC-bench Pro

Can Language Models Solve Long-Horizon Software Security Tasks?

SEC-bench Pro is a self-evolving security benchmark that measures agents' ability to hunt security bugs in critical software systems. More projects and security tasks will be added over time. Use Version to switch between benchmark snapshots.

Linux evaluates kernel vulnerability discovery and PoC generation. This track includes 137 source-file instances.

# Model Success Completed Provider Backend
1
89.1%
122/137
135/137 2 timed out OpenAI OpenAI
2
76.6%
105/137
131/137 6 timed out OpenAI OpenAI
3
Opus 4.6 (max) Claude Code
37.2%
51/137
61/137 76 timed out Anthropic AWS Bedrock
4
GLM-5 (high) OpenCode
Open
10.2%
14/137
123/137 14 timed out Z.ai AWS Bedrock
5
Kimi K2.5 (high) OpenCode
Open
7.3%
10/137
133/137 4 timed out Moonshot AI AWS Bedrock
6
Open
1.5%
2/137
137/137 0 timed out MiniMax AWS Bedrock

Newer results for Claude Opus and Mythos models are not available due to safeguard restrictions.

News

  • [June 17, 2026]: Overall and Linux leaderboards are now built.
  • [May 5, 2026]: SpiderMonkey leaderboard has been released.
  • [May 1, 2026]: SEC-bench Pro launches with the V8 leaderboard.

Overview

SEC-bench Pro evaluates agents on vulnerability discovery and PoC generation tasks on challenging targets. Each target instance ships with a Docker image, metadata, a rendered prompt, and a harness. The harness renders the prompt from meta.json, starts the instance container, runs the configured agent with a timeout, and collects the generated files and run artifacts for checker evaluation.