SEC-bench Pro

Can Language Models Solve Long-Horizon Software Security Tasks?

SEC-bench Pro is a self-evolving security benchmark that measures agents' ability to hunt security bugs in critical software systems. More projects and security tasks will be added over time. Use Version to switch between benchmark snapshots.

Firefox tracks SpiderMonkey JavaScript and WebAssembly engine vulnerabilities. This track includes 104 source-file instances.

#	Model	Success	Completed	Provider	Backend
1	GPT-5.5 (xhigh) Codex	63.1% 217/344 V8 Firefox Linux	340/344 4 timed out	OpenAI	OpenAI
2	GPT-5.4 (xhigh) Codex	48.5% 167/344 V8 Firefox Linux	333/344 11 timed out	OpenAI	OpenAI
3	Opus 4.6 (max) Claude Code	29.9% 103/344 V8 Firefox Linux	140/344 204 timed out	Anthropic	AWS Bedrock
4	GLM-5 (high) OpenCode Open	6.4% 22/344 V8 Firefox Linux	303/344 41 timed out	Z.ai	AWS Bedrock
5	Kimi K2.5 (high) OpenCode Open	4.4% 15/344 V8 Firefox Linux	337/344 7 timed out	Moonshot AI	AWS Bedrock
6	MiniMax M2.5 (high) OpenCode Open	0.6% 2/344 V8 Firefox Linux	335/344 9 timed out	MiniMax	AWS Bedrock

Newer results for Claude Opus and Mythos models are not available due to safeguard restrictions.

#	Model	Success	Completed	Provider	Backend
1	GPT-5.5 (xhigh) Codex	47.6% 49/103	102/103 1 timed out	OpenAI	OpenAI
2	GPT-5.4 (xhigh) Codex	35.0% 36/103	98/103 5 timed out	OpenAI	OpenAI
3	Opus 4.6 (max) Claude Code	22.3% 23/103	36/103 67 timed out	Anthropic	AWS Bedrock
4	GLM-5 (high) OpenCode Open	1.9% 2/103	84/103 19 timed out	Z.ai	AWS Bedrock
5	Kimi K2.5 (high) OpenCode Open	1.9% 2/103	101/103 2 timed out	Moonshot AI	AWS Bedrock
6	MiniMax M2.5 (high) OpenCode Open	0.0% 0/103	98/103 5 timed out	MiniMax	AWS Bedrock

Newer results for Claude Opus and Mythos models are not available due to safeguard restrictions.

#	Model	Success	Completed	Provider	Backend
1	GPT-5.5 (xhigh) Codex	44.2% 46/104	103/104 1 timed out	OpenAI	OpenAI
3	GPT-5.4 (xhigh) Codex	25.0% 26/104	104/104 0 timed out	OpenAI	OpenAI
2	Opus 4.6 (max) Claude Code	27.9% 29/104	43/104 61 timed out	Anthropic	AWS Bedrock
4	GLM-5 (high) OpenCode Open	5.8% 6/104	96/104 8 timed out	Z.ai	AWS Bedrock
5	Kimi K2.5 (high) OpenCode Open	2.9% 3/104	103/104 1 timed out	Moonshot AI	AWS Bedrock
6	MiniMax M2.5 (high) OpenCode Open	0.0% 0/104	100/104 4 timed out	MiniMax	AWS Bedrock

Newer results for Claude Opus and Mythos models are not available due to safeguard restrictions.

#	Model	Success	Completed	Provider	Backend
1	GPT-5.5 (xhigh) Codex	89.1% 122/137	135/137 2 timed out	OpenAI	OpenAI
2	GPT-5.4 (xhigh) Codex	76.6% 105/137	131/137 6 timed out	OpenAI	OpenAI
3	Opus 4.6 (max) Claude Code	37.2% 51/137	61/137 76 timed out	Anthropic	AWS Bedrock
4	GLM-5 (high) OpenCode Open	10.2% 14/137	123/137 14 timed out	Z.ai	AWS Bedrock
5	Kimi K2.5 (high) OpenCode Open	7.3% 10/137	133/137 4 timed out	Moonshot AI	AWS Bedrock
6	MiniMax M2.5 (high) OpenCode Open	1.5% 2/137	137/137 0 timed out	MiniMax	AWS Bedrock

Newer results for Claude Opus and Mythos models are not available due to safeguard restrictions.

News

[June 17, 2026]: Overall and Linux leaderboards are now built.
[May 5, 2026]: SpiderMonkey leaderboard has been released.
[May 1, 2026]: SEC-bench Pro launches with the V8 leaderboard.

Overview

SEC-bench Pro evaluates agents on vulnerability discovery and PoC generation tasks on challenging targets. Each target instance ships with a Docker image, metadata, a rendered prompt, and a harness. The harness renders the prompt from meta.json, starts the instance container, runs the configured agent with a timeout, and collects the generated files and run artifacts for checker evaluation.

News

[June 17, 2026]: Overall and Linux leaderboards are now built.
[May 5, 2026]: SpiderMonkey leaderboard has been released.
[May 1, 2026]: SEC-bench Pro launches with the V8 leaderboard.

Overview

News

[June 17, 2026]: Overall and Linux leaderboards are now built.
[May 5, 2026]: SpiderMonkey leaderboard has been released.
[May 1, 2026]: SEC-bench Pro launches with the V8 leaderboard.

Overview

News

[June 17, 2026]: Overall and Linux leaderboards are now built.
[May 5, 2026]: SpiderMonkey leaderboard has been released.
[May 1, 2026]: SEC-bench Pro launches with the V8 leaderboard.

Overview

SEC-bench

Can LLM Agents Solve Critical Security Challenges?

Leaderboard

Performance of LLM agents on security engineering tasks.

PoC Generation evaluates agents on generating working security proof-of-concepts for real vulnerabilities.

#	System	Score	Provider	Date	Artifacts
1	OpenHands + Claude-3.7-Sonnet Verified	18% SEC-bench score	All Hands	2025-05-28	Logs
2	SWE-agent + Claude-3.7-Sonnet Verified	12.5% SEC-bench score	Princeton	2025-05-10	Logs
3	Aider + Claude-3.7-Sonnet Verified	3% SEC-bench score	Aider	2025-05-28	Logs

#	System	Score	Provider	Date	Artifacts
1	AgenticRepair + GPT-5.2 Verified	75% SEC-bench score	University of Melbourne	2026-01-15	Logs
2	AgenticRepair + GPT-5-mini Verified	50% SEC-bench score	University of Melbourne	2026-01-15	Logs
3	OpenHands + Claude-3.7-Sonnet Verified	34% SEC-bench score	All Hands	2025-05-28	Logs
4	SWE-agent + Claude-3.7-Sonnet Verified	31.5% SEC-bench score	Princeton	2025-05-10	Logs
5	Aider + Claude-3.7-Sonnet Verified	23.5% SEC-bench score	Aider	2025-05-28	Logs

News

[Dec 2025]: OSS-Fuzz instances are now integrated into SEC-bench!
[Sep 2025]: SEC-bench has been accepted to NeurIPS 2025!
[June 2025]: We release the SEC-bench benchmark with PoC Generation and Vulnerability Patching tasks.

SEC-bench Overview

SEC-bench is the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks.

PoC Generation Task

PoC Generation evaluates the ability of LLM agents to craft proof-of-concept (PoC) artifacts that trigger specified vulnerabilities. A submission is considered successful only if it creates a working PoC artifact that triggers a valid sanitizer error.

Evaluation Modes:

SEC-bench supports three evaluation modes with varying levels of information provided to agents:

Mode	Information Provided	Description
`poc-repo`	Repository only	Agent receives only the vulnerable code. Must discover and validate the bug from scratch.
`poc-desc`	Repository + Description	Agent receives repository plus high-level vulnerability description.
`poc-san`	Repository + Description + Sanitizer Report	Agent receives all available information including crash stack trace.

Note: Current leaderboard results are evaluated using the poc-san mode. Read more about evaluation modes in our blog post.

SEC-bench Pro

News

Overview

News

Overview

News

Overview

News

Overview

SEC-bench

Leaderboard

News

SEC-bench Overview

PoC Generation Task

Vulnerability Patching Task