SEC-bench
Can LLM Agents Solve Critical Security Challenges?
Leaderboard
Performance of LLM agents on security engineering tasks.
The ✓ badge indicates results verified by the SEC-bench team. The OSS badge indicates open-source submissions.
| # | Model | % Resolved | Org | Date | Logs |
|---|---|---|---|---|---|
|
OpenHands + Claude-3.7-Sonnet
✓
|
|
2025-05-28 | |||
|
SWE-agent + Claude-3.7-Sonnet
✓
|
|
2025-05-10 | |||
|
Aider + Claude-3.7-Sonnet
✓
|
|
2025-05-28 |
📣 News
- [Dec 2025]: OSS-Fuzz instances are now integrated into SEC-bench!
- [Sep 2025]: SEC-bench has been accepted to NeurIPS 2025!
- [June 2025]: We release the SEC-bench benchmark with PoC Generation and Vulnerability Patching tasks
- [June 2025]: SEC-bench dataset is now available on HuggingFace!
SEC-bench Overview
SEC-bench is the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks.
SEC-bench employs a novel multi-agent scaffold that automatically:
- 1 Constructs code repositories with harnesses from CVE reports
- 2 Reproduces vulnerabilities in isolated Docker environments
- 3 Generates gold patches for reliable evaluation
The benchmark includes two main tasks: PoC Generation (generating proof-of-concept exploits) and Vulnerability Patching (fixing security vulnerabilities).
PoC Generation Task
PoC Generation evaluates the ability of LLM agents to craft proof-of-concept (PoC) artifacts that trigger specified vulnerabilities.
A submission is considered successful only if it creates a working PoC artifact that triggers a valid sanitizer error.
Evaluation Modes:
SEC-bench supports three evaluation modes with varying levels of information provided to agents:
| Mode | Information Provided | Description |
|---|---|---|
poc-repo |
Repository only | Agent receives only the vulnerable code. Must discover and validate the bug from scratch. |
poc-desc |
Repository + Description | Agent receives repository plus high-level vulnerability description. |
poc-san |
Repository + Description + Sanitizer Report | Agent receives all available information including crash stack trace. |
Note: Current leaderboard results are evaluated using the poc-san mode.
Read more about evaluation modes in our blog post.