News
- [June 17, 2026]: Overall and Linux leaderboards are now built.
- [May 5, 2026]: SpiderMonkey leaderboard has been released.
- [May 1, 2026]: SEC-bench Pro launches with the V8 leaderboard.
Can Language Models Solve Long-Horizon Software Security Tasks?
SEC-bench Pro is a self-evolving security benchmark that measures agents' ability to hunt security bugs in critical software systems. More projects and security tasks will be added over time. Use Version to switch between benchmark snapshots.
V8 is Google's open-source JavaScript and WebAssembly engine. This track includes 103 source-file instances.
| # | Model | Success | Completed | Provider | Backend |
|---|---|---|---|---|---|
| 1 |
47.6%
49/103
|
102/103 1 timed out | OpenAI | OpenAI | |
| 2 |
35.0%
36/103
|
98/103 5 timed out | OpenAI | OpenAI | |
| 3 |
22.3%
23/103
|
36/103 67 timed out | Anthropic | AWS Bedrock | |
| 4 |
1.9%
2/103
|
84/103 19 timed out | Z.ai | AWS Bedrock | |
| 5 |
1.9%
2/103
|
101/103 2 timed out | Moonshot AI | AWS Bedrock | |
| 6 |
0.0%
0/103
|
98/103 5 timed out | MiniMax | AWS Bedrock |
Newer results for Claude Opus and Mythos models are not available due to safeguard restrictions.
meta.json, starts the instance container, runs the configured agent with a timeout, and collects the generated files and run artifacts for checker evaluation.
Can LLM Agents Solve Critical Security Challenges?
Performance of LLM agents on security engineering tasks.
PoC Generation evaluates agents on generating working security proof-of-concepts for real vulnerabilities.
| Mode | Information Provided | Description |
|---|---|---|
poc-repo |
Repository only | Agent receives only the vulnerable code. Must discover and validate the bug from scratch. |
poc-desc |
Repository + Description | Agent receives repository plus high-level vulnerability description. |
poc-san |
Repository + Description + Sanitizer Report | Agent receives all available information including crash stack trace. |
Note: Current leaderboard results are evaluated using the poc-san mode. Read more about evaluation modes in our blog post.