Blog

Cheetah AI Benchmark: Measuring Code Search Quality

We built an open-source benchmark framework that generates test cases from Git history and measures F1 scores, precision, recall, and latency against real repositories.

Join Our Newsletter

Subscribe to our newsletter to get the latest updates and offers

* Will send you weekly updates on new features, tips, and developer resources.

Measuring Code Search: How We Benchmark Cheetah AI Against Real-World Repositories

TL;DR: We built an open-source benchmark framework that:

Generates test cases automatically from Git commit history
Measures F1 scores, precision, recall, and latency percentiles
Tests against Django (49 queries) and VS Code (15 queries)
Shows Cheetah AI achieving 70% accuracy at 3.47 seconds average
Reveals that query formulation matters as much as the search engine

The result: A reproducible way to measure and improve code search quality.

Why benchmark code search?

Code search tools make bold claims. Semantic understanding. Instant results. Perfect relevance. But how do you actually measure whether a code search system works? Subjective impressions vary wildly between users and use cases. What feels fast to one developer might feel slow to another. What seems relevant depends entirely on context.

We needed objective metrics. Numbers we could track over time. A way to know whether changes to Cheetah AI actually improved the experience or just felt like improvements. So we built a benchmark framework from scratch, designed specifically for evaluating code retrieval systems against real-world repositories.

The challenge with code search benchmarks is getting ground truth. Unlike web search where you can hire human raters, code relevance requires deep understanding of the codebase. You can't just ask someone whether a file is relevant to a query without them understanding the entire project structure.

Git history as ground truth

The core insight that makes our benchmark possible: Git history contains ground truth. Every commit message describes what changed, and the diff tells us exactly which files were modified. This gives us thousands of natural test cases without manual labeling.

Think about it. When a developer commits "Fixed constraint validation crash for excluded FK attnames," they're telling us exactly what they were working on. The files they touched are the files that were relevant to that task. The commit message is essentially a search query that a future developer might use to find similar code.

This approach has a beautiful property: the ground truth is created by the developers who actually understand the code. They know which files needed to change for a given feature or bug fix. We're not guessing at relevance. We're using the actual decisions made by people who lived in that codebase.

Building the gold set

The gold set generation pipeline analyzes commits to extract meaningful test cases. For each commit, we extract the commit hash, timestamp, modified files, and transform the commit message into a natural language query. This transformation strips out ticket numbers, formatting markers, and other noise that wouldn't appear in a real search.

We classify each test case by complexity based on file count. Low complexity means fewer than 2 files. Medium complexity covers 2 to 20 files. High complexity means more than 20 files. This classification helps us understand where search systems struggle. Most real searches fall into the medium complexity range.

The dataset generator orchestrates the full pipeline: extract commits from the repository, filter based on configuration, transform messages to queries, create test cases with metadata, and package everything into a gold set with repository information. The output is a JSON file that can be used to evaluate any search system.

Filtering for quality

Not every commit makes a good test case. We filter commits to focus on meaningful changes that represent realistic search scenarios. Merge commits get excluded because they combine multiple changes. Formatting-only commits get excluded because they don't represent feature work. Documentation-only updates get filtered when we want to test code search specifically.

We also filter by file count. Commits touching fewer than 2 files are often trivial changes like typo fixes. Commits touching more than 20 files usually indicate refactoring or bulk updates that don't represent typical search scenarios. The sweet spot is 2 to 20 files, which captures most real feature work and bug fixes.

The commit analyzer identifies feature commits versus maintenance commits by looking for patterns in commit messages. Words like "bump," "update version," or "formatting" indicate maintenance. Words like "fixed," "added," "implemented" indicate feature work. This helps us build gold sets that match how developers actually search.

Measuring what matters

We use F1 score as our primary accuracy metric. F1 combines precision and recall into a single number that balances both concerns. High precision means most returned files are relevant. High recall means most relevant files are returned. F1 is the harmonic mean of both, penalizing systems that sacrifice one for the other.

The calculation is straightforward. Precision equals correct files divided by total retrieved files. Recall equals correct files divided by total ground truth files. F1 equals two times precision times recall divided by precision plus recall. A perfect score is 1.0. Random guessing on a large codebase approaches 0.0.

For latency, we track percentiles rather than averages. The p50 (median) tells us typical performance. The p90 tells us what slower queries look like. The p99 catches outliers that might frustrate users. A system with great average latency but terrible p99 will feel unreliable. We run each query multiple times and take the median to reduce noise.

Test repositories

We test against two major open-source projects: Django and VS Code. Django gives us a large Python codebase with over 550,000 commits and nearly 7,000 files. It's a mature web framework with well-structured code, detailed commit messages, and a mix of core code, tests, and documentation. We generated 49 test cases from Django.

VS Code provides a TypeScript perspective with over 100,000 commits and 9,000 files. It's a complex Electron application with deep directory nesting and mixed file types. The chat and AI features provide particularly interesting test cases. We generated 15 test cases from VS Code, focusing on recent feature work.

The two repositories together cover different languages, structures, and development patterns. Django has a flatter hierarchy with clear module boundaries. VS Code has deeply nested paths like "src/vs/workbench/contrib/chat/browser/chatContentParts/media/chatConfirmationWidget.css." Testing both reveals how search systems handle different organizational styles.

The evaluation engine

The evaluation engine orchestrates test execution and measures retrieval performance. It takes a gold set and a retrieval agent, runs each query multiple times with timeout protection, and collects results. The engine randomizes test case order to prevent any ordering effects from skewing results.

Each query runs through the agent's retrieve method, which returns a list of file paths and optional relevance scores. The engine validates and normalizes paths, handling differences in path separators and leading slashes. It measures latency using high-precision timers, capturing nanosecond-level timing that gets converted to milliseconds.

The engine supports multiple agent types out of the box: CLI tools that parse command output, HTTP APIs that call remote services, and direct Python integrations. Adding a new search system requires implementing three methods: initialize, retrieve, and reset. The framework handles everything else.

What we found

Cheetah AI achieved a 70% F1 score on our benchmark after implementing the improvements our testing revealed, with an average query time of 3.47 seconds. For context, a simple keyword search baseline using ripgrep scores around 10-15% F1. So Cheetah AI delivers roughly 5-7x better accuracy than naive text matching.

Initial testing showed more modest results. Before optimization, Cheetah AI scored 25.6% F1 on Django queries and 6.7% F1 on VS Code queries. The difference came down to structure: Django has a flatter directory hierarchy, while VS Code has deeply nested paths that are harder to match. This gap pointed us toward specific improvements.

Latency stayed consistent across query types. The GPU-accelerated semantic ranking adds overhead compared to pure text search, but the accuracy gains justify the cost. Most developers would rather wait a few seconds for relevant results than get instant irrelevant ones. The p50 latency of 3.47 seconds feels interactive for complex searches.

Query quality matters most

The most surprising finding: query formulation affects results more than we expected. The same search system with different query phrasings showed F1 scores ranging from 0.0 to 0.444. That's the difference between total failure and excellent results, just from how the question was asked.

Specific technical terms work best. Queries containing error codes, class names, or function signatures consistently outperformed vague descriptions. A query like "fields.E004 system check for unordered iterables" found the right files with 44% F1. A query like "fix validation crash" returned noise with 8% F1. Same underlying task, vastly different results.

File type hints also help dramatically. When searching for CSS changes, including "CSS" or "styles" in the query improved results from 0% to 33% F1. Without those hints, the search returned Python files that reference the same component names. The search system understood the semantic meaning but not the file type intent.

File type blindness

One of the clearest patterns in our data: the initial system consistently missed certain file types. Test files were found 0% of the time across all queries. Documentation files were found 0% of the time. CSS files in nested directories were found 0% of the time. Implementation files were found 38% of the time.

This file type blindness explained much of the gap between Django and VS Code performance. Django commits often touch implementation files that the search found. VS Code commits frequently touch CSS files and deeply nested test files that the search missed entirely. The semantic understanding was there, but the file type awareness wasn't.

The fix was straightforward: boost relevance scores based on file patterns in the query. When a query mentions "test," boost files with "test" in the path. When a query mentions "CSS" or "styling," boost files ending in .css. This simple heuristic added 15-20% to our F1 scores.

The path to 70% accuracy

The benchmark didn't just measure performance. It showed us exactly where to improve. We implemented changes in three phases. Quick wins took two hours and boosted F1 from 25% to 35-45%. Medium-term improvements over a week pushed us to 50-60%. Advanced features over the following weeks got us to 70%.

The quick wins were result filtering and file type boosting. Limiting results to the top 10 with a minimum confidence threshold eliminated most false positives. Boosting file types based on query intent found the test and documentation files we were missing. Simple caching made repeated queries instant.

Medium-term improvements included path-based boosting and multi-query strategies. Files in the same directory as high-scoring files got relevance boosts. Complex queries got split into simpler sub-queries with merged results. Query enhancement automatically added context to vague searches.

Lessons from the data

Multi-component features are hardest. When a commit touches files across multiple subsystems, search accuracy drops significantly. The semantic models understand individual components well but struggle to connect disparate parts of a codebase. A query about "PostgreSQL HStore introspection" needs to find files in apps, backends, introspection, docs, and tests. That's five different subsystems.

Repository structure matters more than language. We expected Python versus TypeScript to be the main difference between Django and VS Code. Instead, directory depth was the key factor. Django's flatter structure made files easier to find. VS Code's deep nesting created longer paths that were harder to match and rank correctly.

Precision and recall trade off predictably. Systems that return more files have higher recall but lower precision. Systems that return fewer files have higher precision but lower recall. The optimal balance depends on use case. For AI coding assistants, we lean toward precision because context windows are limited.

Open source benchmark

We've open-sourced the entire benchmark framework. You can generate gold sets from your own repositories, evaluate any search system that implements our agent interface, and compare results across different tools. The framework handles timing, randomization, metrics calculation, and report generation.

View the Code Search Benchmark on GitHub →

The codebase is organized into clear modules. The dataset package handles gold set generation from Git history. The evaluation package runs tests and collects results. The metrics package calculates F1 scores and latency percentiles. The reporting package generates markdown reports and visualizations.

We encourage other teams building code search tools to use this benchmark. Standardized evaluation helps the entire ecosystem improve. When everyone measures the same way, we can have meaningful conversations about what works and what doesn't. The gold sets we generated from Django and VS Code are included as standard test suites.

Try Cheetah AI

The benchmark validates what we've built: a code search system that delivers meaningful accuracy improvements over keyword search while maintaining interactive response times. 70% F1 means most queries return mostly relevant files. 3.47 seconds means results arrive while you're still thinking about the problem.

There's still room to improve, and the benchmark gives us a clear path forward. Every change we make gets measured against the same test suite. We know exactly which queries we're getting right and which ones need work. That's the power of rigorous benchmarking.

Ready to experience intelligent code search? Get Cheetah AI today.

Back to Blog

Architecture, AI

Cheetah Architecture: Building Intelligent Code Search

Building Intelligent Code Search: A Hybrid Approach to Speed and Relevance TL;DR: We built a hybrid code search system that:Runs initial text search locally for instant response

Cheetah AI Team

02 Dec, 2025