SEC-bench Leaderboard

Leaderboard

Performance of LLM agents on security engineering tasks.
The ✓ badge indicates results verified by the SEC-bench team. The OSS badge indicates open-source submissions.

Model	% Resolved	Date
OpenHands + Claude-3.7-Sonnet ✓	18.0%	2025-05-28
SWE-agent + Claude-3.7-Sonnet ✓	12.5%	2025-05-10
Aider + Claude-3.7-Sonnet ✓	3.0%	2025-05-28

📣 News

[Dec 2025]: OSS-Fuzz instances are now integrated into SEC-bench!
[Sep 2025]: SEC-bench has been accepted to NeurIPS 2025!
[June 2025]: We release the SEC-bench benchmark with PoC Generation and Vulnerability Patching tasks
[June 2025]: SEC-bench dataset is now available on HuggingFace!

SEC-bench Overview

SEC-bench is the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks.

SEC-bench employs a novel multi-agent scaffold that automatically:

1 Constructs code repositories with harnesses from CVE reports
2 Reproduces vulnerabilities in isolated Docker environments
3 Generates gold patches for reliable evaluation

The benchmark includes two main tasks: PoC Generation (generating proof-of-concept exploits) and Vulnerability Patching (fixing security vulnerabilities).

PoC Generation Task

PoC Generation evaluates the ability of LLM agents to craft proof-of-concept (PoC) artifacts that trigger specified vulnerabilities. A submission is considered successful only if it creates a working PoC artifact that triggers a valid sanitizer error.

Evaluation Modes:

SEC-bench supports three evaluation modes with varying levels of information provided to agents:

Mode	Information Provided	Description
`poc-repo`	Repository only	Agent receives only the vulnerable code. Must discover and validate the bug from scratch.
`poc-desc`	Repository + Description	Agent receives repository plus high-level vulnerability description.
`poc-san`	Repository + Description + Sanitizer Report	Agent receives all available information including crash stack trace.

Note: Current leaderboard results are evaluated using the poc-san mode. Read more about evaluation modes in our blog post.

Model	% Resolved	Date
AgentMem + GPT-5.2 ✓	75.0%	2026-01-15
AgentMem + GPT-5-mini ✓	50.0%	2026-01-15
OpenHands + Claude-3.7-Sonnet ✓	34.0%	2025-05-28
SWE-agent + Claude-3.7-Sonnet ✓	31.5%	2025-05-10
Aider + Claude-3.7-Sonnet ✓	23.5%	2025-05-28

Vulnerability Patching Task

Vulnerability Patching evaluates the ability of LLM agents to generate patches that fix security vulnerabilities in real-world software projects.

Given a CVE description and the vulnerable code repository, LLM agents must:

Understand the root cause of the vulnerability
Locate the vulnerable code in the repository
Generate a correct patch that fixes the vulnerability without breaking functionality

SEC-bench

Leaderboard

📣 News

SEC-bench Overview

PoC Generation Task

Vulnerability Patching Task

Citation

BibTeX

Acknowledgement