OpenZeppelin finds data contamination in OpenAI’s EVMbench

A recent audit by blockchain security firm OpenZeppelin has revealed significant flaws in OpenAI's new EVMbench, an artificial intelligence benchmark designed to test how well AI models can handle smart contract vulnerabilities. The findings point to critical issues with data contamination and invalid vulnerability classifications, raising questions about the benchmark's effectiveness for true cybersecurity evaluation in the crypto space.

OpenZeppelin's analysis identified that the dataset suffers from training data leaks. This means the top-performing AI agents had likely already been exposed to the benchmark's specific vulnerability reports during their pretraining phases. This contamination undermines a core goal of AI security: the ability to discover novel, previously unseen vulnerabilities, often called zero-day threats, in smart contract code.

The audit also highlighted classification problems, noting at least four issues labeled as high-severity vulnerabilities that are not practically exploitable. Such inaccuracies can misdirect crucial resources and provide a false sense of security regarding a protocol's actual risk profile, a serious concern for overall blockchain security.

EVMbench was created in collaboration with Paradigm to evaluate AI capabilities in identifying, patching, and exploiting smart contract weaknesses. During testing, AI agents were prevented from accessing the internet to search for solutions, theoretically forcing them to rely on reasoning. However, the benchmark's construction from past audit data appears to have compromised this goal.

These methodological flaws have direct implications for real-world defense. If AI tools are evaluated on contaminated data, their performance in detecting genuine, novel exploits or sophisticated phishing attempts may be overstated. This gap is critical as the industry battles increasingly complex malware and ransomware attacks targeting digital assets.

The situation underscores a broader challenge in crypto and cybersecurity: ensuring that evaluation frameworks themselves are robust and unbiased. Accurate benchmarks are vital for developing AI that can proactively defend against data breach incidents and sophisticated exploits, rather than simply recognizing old patterns.

For developers and auditors, this report serves as a reminder that vigilance must extend to the very tools used to measure security. Relying on flawed benchmarks could leave smart contracts exposed to risks that AI models are falsely certified to catch. The path forward requires transparent, uncontaminated datasets to train and test the next generation of security AI.