ChatGPT Accuracy in Smart Contract Vulnerability Detection Exceeds 75%

While ChatGPT's ability to scan code for vulnerabilities can be leveraged by cybersecurity teams, its capabilities for composing a wide range of content, including sophisticated malicious code, raise serious concerns.

Robot working at a computer
Halborn's research has revealed that in certain cases, ChatGPT 4 demonstrated an impressive 86.6% accuracy in detecting malicious code associated with specific vulnerabilities.

Yesterday, Halborn, a leading blockchain security firm, unveiled the findings of its summer research, which involved an analysis of 134 smart contracts to assess the capabilities of the top AI-powered chatbot, ChatGPT.

The scope of the study was broad, aiming to address various questions regarding the potential of ChatGPT to replace human smart contract auditors. Halborn's team sought to answer questions about the utility of ChatGPT for students learning about smart contracts, its effectiveness in solving Capture the Flag (CTF) tasks commonly assigned in cybersecurity competitions, and its ability to identify vulnerabilities in code, among other considerations.

One aspect scrutinized by Halborn was ChatGPT's proficiency in detecting fundamental textbook vulnerabilities. The team curated a set of sample contracts typically used to illustrate various types of attacks, vulnerabilities, and coding pitfalls. After preprocessing, which involved removing vulnerabilities that appeared too similar to each other and could be considered duplicates, the team narrowed down the sample to 134 contracts.

These contracts were then categorized based on the types of attacks they represented, resulting in a total of 41 attack categories. Examples of these categories include arbitrary jump with a function type variable, denial of service (DoS), insufficient gas griefing, outdated compiler version, reentrancy, and others.

Read also: Weekly Damage from Web3 Exploits Surpasses $35 Million

Since chatbots tend to give multiple responses to the same question which can vary based on the prompt, Halborn implemented a testing strategy that involved dividing the assessments into two categories: one where ChatGPT was directly questioned and another where the chatbot was treated as a genuine cybersecurity auditor.

Halborn's methodology included evaluating the performance of both ChatGPT versions 3.5 and 4. The chatbots were provided with the same prompt multiple times, with each prompt corresponding to a specific question. Both versions were tasked with generating an answer up to five times, acknowledging the statistical nature of language models. The objective was to observe the number of attempts required to obtain a correct answer and to assess ChatGPT's ability to identify vulnerabilities.

Distribution of vulnerabilities
Source: Halborn

The initial message in the conversation served as the prompt, and results were categorized as follows: "Yes" (correct detection with identified cause), "Partial" (correct detection without identified cause), or "No" (vulnerability not detected). To maintain generic responses, no feedback was provided to ChatGPT. In cases where contracts were excessively large, they were appropriately shortened for ChatGPT to process the prompt.

Although the team did not identify any significant differences in accuracy between direct questions and role-playing scenarios, it acknowledged that direct questions yielded slightly better results. Additionally, a similar assessment was conducted by shifting the focus to prompt specificity, comparing the accuracy of results generated by general prompts seeking all vulnerabilities in the code versus more specific prompts targeting a particular type of vulnerability.

"Regarding the rate of detection of vulnerabilities, ChatGPT 3.5 is capable of detecting correctly 73.1% of them with the direct question, while ChatGPT 4 is capable of detecting 76.1%," Halborn concludes, adding that "With the role-playing prompt the numbers are slightly lower, 70.1% for ChatGPT 3.5 and 67.9% for ChatGPT 4."

The study results reveal that when utilizing a direct prompt, ChatGPT 3.5 exhibited varying levels of detection for different types of vulnerabilities. Among the vulnerability types assessed, forced reception of Ether, abuse of global semantics, insufficient gas griefing, storage collisions, unencrypted private data on-chain, hash collisions with multiple variable-length arguments, short address attacks, and referencing an external malicious contract showed the least detection by ChatGPT 3.5.

Conversely, ChatGPT 3.5 exhibited better detection for certain vulnerabilities, particularly those of a simpler nature such as bad randomness, variable shadowing, or integer overflow. Additionally, it was observed that ChatGPT 3.5 demonstrated an ability to detect some more intricate vulnerabilities, including signature malleability or arbitrary jumps with function-type variables.

The findings suggest that ChatGPT 3.5's performance varied across different types of vulnerabilities, with a higher success rate in detecting simpler issues compared to more complex ones.

Read also: ChatGPT sees Bitcoin as the currency with "the highest probability of existing in perpetuity"

"In general, we can observe that those vulnerabilities that entail a deeper understanding of the code as well as the capacity of correlation among different parts of the contract, like cross-function reentrancy, storage collisions, or insufficient gas griefing, are not detected correctly," the team concludes, adding that the similar trend was observed in the cases that required a deep understanding of the work of the Ethereum Virtual Machine and Solidity, a popular programming language used for writing smart contracts.

Percentage of vulnerabilities detected per prompt
Source: Halborn

Meanwhile, the study has uncovered that ChatGPT performs well in identifying simpler vulnerabilities that are syntax or code-based, particularly those exhibiting clear patterns such as DoS and reentrancy attacks.

Halborn adds that "Specific prompts work better than general ones," explaining that "Asking if a piece of code has a specific vulnerability works better than asking for all vulnerabilities," while mentioning in the prompt a "vulnerability subtype" if there is one, gives even better results.

"For a model not designed to identify vulnerabilities in code, it has a high detection rate of 74.6% for ChatGPT 3.5 and 86.6% for ChatGPT 4 in the best-case scenario (specific prompt)," Halborn claims.

ChatGPT solves CTFs

To assess ChatGPT's proficiency in solving Capture the Flag (CTF) challenges, Halborn employed tasks primarily sourced from three CTF repositories: Ethernaut, Damn Vulnerable DeFi, and Capture the Ether.

"ChatGPT 4 is capable of completely solving the challenges in 43.3% of the cases and partially 20% of the time," the team has found out.

According to the study results, ChatGPT 4 i particularly effective in identifying specific types of vulnerabilities in code, especially in recognizing issues related to authentication through tx.origin, governance scheme flaws, and errors in protocol logic. However, the study also highlights certain vulnerabilities where ChatGPT 4 may face challenges in detection. These include problems associated with faulty initialization, scenarios involving double entry points, and situations where exploitation arises from a delegatecall to an untrusted callee.

Halborn also welcomes the community to explore the research-related materials gathered in the dedicated GitHub repository, which includes chat logs and tested samples.

ChatGPT assists both cybersecurity specialists and hackers

Halborn also references previous efforts made by other cybersecurity teams to assess ChatGPT's capabilities in scanning code for malicious fragments. These attempts, supposedly indicating questionable effectiveness, may have been influenced by a relatively limited scope of studies.

Actually, in a February 2023 interview with eSecurity Planet, Shiran Grinberg, the Research and Cyber Operations Director at cybersecurity firm Cynet, mentioned that the company had already been utilizing the chatbot.

"We are able to take a machine learning model and to turn it into an AI mechanism which basically learns many types of legitimate files versus many malicious files," Grinberg explained, stressing the effectiveness of using a large amount of training data, which helps Cynet to differentiate malicious files from legitimate.

However, Grinberg also cautioned about the potential of ChatGPT to assist hackers in creating new types of threats, regardless of their level of technical knowledge. Despite ChatGPT being designed to notify users of any attempt to engage in illegal activities, Grinberg noted the possibility of constructing prompts that could deceive the chatbot into sharing code containing malicious fragments.