News
- [June 17, 2026]: Overall and Linux leaderboards are now built.
- [May 5, 2026]: SpiderMonkey leaderboard has been released.
- [May 1, 2026]: SEC-bench Pro launches with the V8 leaderboard.
Can Language Models Solve Long-Horizon Software Security Tasks?
SEC-bench Pro is a self-evolving security benchmark that measures agents' ability to hunt security bugs in critical software systems. More projects and security tasks will be added over time. Use Version to switch between benchmark snapshots.
Overall aggregates V8, Firefox, and Linux into a 344-instance leaderboard. Split bars show how each project contributes to the score.
| # | Model | Success | Completed | Provider | Backend |
|---|---|---|---|---|---|
| 1 |
GPT-5.5 (xhigh)
|
63.1%
217/344
V8
Firefox
Linux
|
340/344 4 timed out | OpenAI | OpenAI |
| 2 |
GPT-5.4 (xhigh)
|
48.5%
167/344
V8
Firefox
Linux
|
333/344 11 timed out | OpenAI | OpenAI |
| 3 |
Opus 4.6 (max)
|
29.9%
103/344
V8
Firefox
Linux
|
140/344 204 timed out | Anthropic | AWS Bedrock |
| 4 |
GLM-5 (high)
Open
|
6.4%
22/344
V8
Firefox
Linux
|
303/344 41 timed out | Z.ai | AWS Bedrock |
| 5 |
Kimi K2.5 (high)
Open
|
4.4%
15/344
V8
Firefox
Linux
|
337/344 7 timed out | Moonshot AI | AWS Bedrock |
| 6 |
MiniMax M2.5 (high)
Open
|
0.6%
2/344
V8
Firefox
Linux
|
335/344 9 timed out | MiniMax | AWS Bedrock |
Newer results for Claude Opus and Mythos models are not available due to safeguard restrictions.
meta.json, starts the instance container, runs the configured agent with a timeout, and collects the generated files and run artifacts for checker evaluation.
Can LLM Agents Solve Critical Security Challenges?
Performance of LLM agents on security engineering tasks.
PoC Generation evaluates agents on generating working security proof-of-concepts for real vulnerabilities.
| Mode | Information Provided | Description |
|---|---|---|
poc-repo |
Repository only | Agent receives only the vulnerable code. Must discover and validate the bug from scratch. |
poc-desc |
Repository + Description | Agent receives repository plus high-level vulnerability description. |
poc-san |
Repository + Description + Sanitizer Report | Agent receives all available information including crash stack trace. |
Note: Current leaderboard results are evaluated using the poc-san mode. Read more about evaluation modes in our blog post.