Can LLMs find bugs in large codebases?

TLDR

We built a new benchmark called "Bug In The Code Stack" (BICS) to test how well LLMs can find syntactic bugs in large Python codebases.
GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length.
The hype is real. GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models.
Generally, longer context length resulted in lower accuracy. However, there were some exceptions to this.
Models react differently to the placement of the bug within the source code. GPT-3.5-Turbo and Claude 3 Opus were the most sensitive, and GPT-4-Series was the least sensitive. Generally, less sensitivity means a more robust model.

Motivation

As LLMs' context window sizes grow, their use as coding assistants for large codebases is increasing. It's crucial to understand how longer context lengths impact their performance.

The "needle in the haystack" analysis tests LLMs' ability to find specific information in long documents. Previous benchmarks like BABILONG focused on text tasks. Now, as LLMs are used more for coding, it's important to see how they perform on code tasks and if the task type affects their accuracy.

Experimental design

We developed a new benchmark called Bug In The Code Stack (BICS), which contains auto-assembled Python source code as the haystack and a syntactic bug placed within the source code as the needle. The LLM is tasked with finding the line number and the type of the bug.

Figure 1. Sample source code with a syntactic bug, the LLM is tasked with retrieving both the line number (e.g. Line 4) and the type of the bug (e.g. Missing Parenthesis) that occurs in the input code. We place the bug at various depths within the source code to evaluate the LLM's ability to retrieve the bug accurately.

Each model was run on context lengths ranging from 500 tokens to 16K tokens and target depths ranging from 0% to 100%. We ran each experiment 25 times, and the average accuracy is shown in the following charts.

To give context, 16K tokens are around 25 pages long. The models are challenged to find a single syntactic bug, which could be as small as a missing parenthesis, within 25 pages of code! This benchmark poses quite a challenge to many of the models.

Comparing results on most popular models

Figure 2. Comparing the average accuracy of each model at a target length of 50%. The performance for GPT-4o and GPT-4-Turbo stays consistently high at various context lengths. Note: Llama3-70B has a context window of 8192 tokens; the 16K tokens tests are omitted.

Figure 4. Comparing the variance in average accuracy per target depth for each model, a higher variance indicates that the model is more sensitive to the placement of the bug within the source code. Claude 3 Opus, 3.5 Turbo and Command-R+ are the most sensitive.

From the charts above, we can see the performance gap between different models, with GPT-4o performing the best at both short and long context lengths, closely followed by GPT-4-Turbo. Claude 3 Opus shows a similar level of performance at short context lengths but struggles at long context lengths. Additionally, GPT-3.5-Turbo, Llama3-70B, and Command-R+ all show similar performance levels, while Gemini-1.0-Pro struggles the most in the benchmark.

Comparing BICS and BABILONG

Figure 5. GPT-3.5-Turbo performs 5-10x worse on BICS than BABILONG at a target depth of 75%; i.e. our benchmark is harder than BABILONG.

In addition, we see that LLMs display much lower accuracy on the BICS benchmark than the BABILONG benchmark. This indicates that LLMs struggle more at understanding long codebases than long text, hinting at a future improvement to the models for code comprehension capabilities.

Detailed results

Here are the detailed results for each model.

Figure 6. GPT-4o: Extremely high retrieval performance across all context lengths and target depths.

Figure 7. GPT-4-Turbo: High retrieval performance at short context lengths; seems to struggle at 50% depth for longer contexts.

Figure 8. Claude 3 Opus: Sharp drop in accuracy at most depths for longer context lengths.

Figure 9. GPT-3.5-Turbo: Weirdly better at bugs placed at the end of the code.

Figure 10. Command-R+: Poor performance at target depth < 0.25, worst at longer context lengths.

Figure 11. Gemini-1.0-Pro: Bad at everything.

Figure 12. Llama3-70B: On par with GPT-3.5-Turbo; very impressive for a small model.

Future experiments

The "Bug In The Code Stack" benchmark presents a new challenge measuring LLMs' capability at long context lengths. In the future, we would also like to extend the benchmark by adding logical errors that cannot be detected using static code analyzers, which further helps evaluate the capabilities of the models. In addition, we can run experiments with different programming languages, such as Javascript or C++, and observe the performance difference.

About the Authors

Sumanyu is the Co-Founder & CEO @ Hamming. Previously helped Citizen grow its MAU by 4X and helped bootstrap revenue from 0 to millions in ARR in under 6 months. Before that, grew an AI-powered sales program @ Tesla to 100s of millions in revenue/year as a Senior Staff Data Scientist. Published a first-author paper in AI during undergrad. BASc from UWaterloo w/ dean's list.

Hokyung (Andy) Lee is a third-year Computer Science student at the University of Waterloo with previous ML experience at Environment Canada and is currently benchmarking LLM on real-world tasks with Hamming.ai.

This article was originally published on Hamming.ai.