Leaderboard

Performance of LLM agents on security engineering tasks.
The badge indicates results verified by the SEC-bench team. The OSS badge indicates open-source submissions.

# Model % Resolved Org Date Logs
OpenHands + Claude-3.7-Sonnet
18.0%
2025-05-28
SWE-agent + Claude-3.7-Sonnet
12.5%
2025-05-10
Aider + Claude-3.7-Sonnet
3.0%
2025-05-28

📣 News

  • [Dec 2025]: OSS-Fuzz instances are now integrated into SEC-bench!
  • [Sep 2025]: SEC-bench has been accepted to NeurIPS 2025!
  • [June 2025]: We release the SEC-bench benchmark with PoC Generation and Vulnerability Patching tasks
  • [June 2025]: SEC-bench dataset is now available on HuggingFace!

SEC-bench Overview

SEC-bench is the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks.

SEC-bench Overview

SEC-bench employs a novel multi-agent scaffold that automatically:

  • 1 Constructs code repositories with harnesses from CVE reports
  • 2 Reproduces vulnerabilities in isolated Docker environments
  • 3 Generates gold patches for reliable evaluation

The benchmark includes two main tasks: PoC Generation (generating proof-of-concept exploits) and Vulnerability Patching (fixing security vulnerabilities).

PoC Generation Task

PoC Generation evaluates the ability of LLM agents to craft proof-of-concept (PoC) artifacts that trigger specified vulnerabilities. A submission is considered successful only if it creates a working PoC artifact that triggers a valid sanitizer error.

Evaluation Modes:

SEC-bench supports three evaluation modes with varying levels of information provided to agents:

Mode Information Provided Description
poc-repo Repository only Agent receives only the vulnerable code. Must discover and validate the bug from scratch.
poc-desc Repository + Description Agent receives repository plus high-level vulnerability description.
poc-san Repository + Description + Sanitizer Report Agent receives all available information including crash stack trace.

Note: Current leaderboard results are evaluated using the poc-san mode. Read more about evaluation modes in our blog post.