Blog

Cheetah Architecture: Building Intelligent Code Search

Discover how Cheetah AI built a hybrid code search system that combines local text search for instant response with GPU-accelerated semantic ranking in the cloud.

Join Our Newsletter

Subscribe to our newsletter to get the latest updates and offers

* Will send you weekly updates on new features, tips, and developer resources.

Building Intelligent Code Search: A Hybrid Approach to Speed and Relevance

TL;DR: We built a hybrid code search system that:

Runs initial text search locally for instant response
Uses GPU-accelerated semantic ranking in the cloud
Keeps your code private with minimal data transfer
Integrates natively with AI coding assistants via MCP
Delivers relevant results in under 4 seconds

The result: Fast, intelligent code search that understands what you're looking for.

A different approach to retrieval

The dominant approach in code search today falls into two camps: traditional text-based tools that are fast but semantically unaware, or cloud-based semantic search systems that understand context but suffer from latency and infrastructure costs.

Traditional tools like grep and ripgrep excel at speed, scanning millions of lines in milliseconds, but they're fundamentally pattern matchers. They find exact matches without understanding what the code does. On the other end, modern embedding-based search systems powered by transformer models can grasp semantic meaning but require extensive preprocessing, large vector databases, and significant computational resources.

At Cheetah AI, we've taken a different path: a hybrid architecture that leverages local text search for candidate generation and GPU-accelerated semantic models for intelligent reranking. This approach combines the millisecond response times developers expect with the contextual understanding modern AI enables.

The challenge of real-time code understanding

Developers don't search code in isolation. They're working in specific contexts, on particular branches, with recent changes, debugging specific issues. A search system needs to understand not just what code exists, but what code is relevant right now.

Consider a developer searching for "authentication middleware" in a large codebase. A naive text search might return dozens of matches: import statements, comments, test files, deprecated code. What they actually need is the active implementation, the one being used in the current execution path.

This requires more than keyword matching. It demands understanding of code structure, call hierarchies, and contextual relevance. Traditional search can't provide this. Pure semantic search is too slow for interactive use. We needed something in between.

Architecture flow: local filtering to cloud intelligence

Client-side candidate generation

The foundation of our approach is aggressive candidate generation on the client machine. Using optimized ripgrep binaries bundled with our tool, we perform parallel text searches across multiple derived search terms. This happens entirely locally, with no network latency and no data leaving the developer's machine.

The key insight is that text search, while semantically naive, has an extraordinarily low false negative rate when used correctly. If authentication middleware exists in your codebase and contains those words, ripgrep will find it. The challenge isn't missing relevant code; it's filtering out the irrelevant matches.

Rather than searching for a single term, we extract primary search terms and variations, running parallel searches with configurable context windows. Each match includes surrounding lines of code, giving us enough context for later semantic analysis. With careful tuning, this initial phase completes quickly even on large codebases.

Smart deduplication and context expansion

Parallel text searches naturally generate overlapping results. The same function might match multiple search terms. We've developed a deduplication strategy that preserves the most contextually rich matches while eliminating redundancy.

The system tracks which keywords each match corresponds to, ensuring we keep diverse results that cover different aspects of the query rather than multiple matches for the same code. For each file, we select the best match per keyword based on context size and relevance indicators.

After deduplication, we expand context by reading the full code spans around each match. This gives us complete functions or class definitions rather than just snippets, providing the semantic models with enough information to make intelligent decisions. The read phase uses parallel file I/O with bounded concurrency to keep latency low.

Code structure awareness

One of the most powerful aspects of our pipeline is its understanding of code structure. Using Tree-sitter parsers, we extract semantic references from high-quality matches: imports, function definitions, class declarations, and function calls.

This enables a second wave of targeted searches. If the initial search finds an important function, we automatically search for its definition, its callers, and related imports. This contextual expansion happens in parallel with the primary search results being processed, adding minimal latency while significantly improving result quality.

The system maintains a search context throughout the process, tracking which patterns have been seen to avoid redundant searches while building a semantic map of the relevant code. This approach mimics how experienced developers navigate codebases: starting with a keyword, then following references to understand the broader context.

GPU-accelerated semantic ranking

With code spans generated locally, the critical question becomes: which ones actually answer the developer's question? This is where GPU acceleration makes the difference between a research project and a production tool.

We employ a two-stage GPU pipeline. The first stage uses sparse embeddings to score all matches based on lexical-semantic similarity. This technique captures both exact keyword matches and semantic relationships while being extremely efficient to compute on GPU hardware.

The sparse embedding approach provides fast initial filtering that's critical for interactive response times. The top matches from this stage proceed to deeper analysis.

Cross-encoder reranking

The final reranking stage uses cross-encoder models optimized for ONNX Runtime with CUDA execution. These models consider the query and code together, capturing interaction effects that bi-encoder approaches miss.

Cross-encoders are computationally expensive, too expensive to run on every match from the initial search. By using sparse embeddings to filter down to the most promising results first, we make cross-encoder reranking practical for interactive use.

This staged approach (fast sparse scoring followed by deeper cross-encoding) mirrors how humans process information: quickly eliminating obviously irrelevant options, then carefully comparing the promising ones.

Intelligent context truncation

A key challenge in any search pipeline is balancing context size against processing cost. More context enables better semantic understanding, but increases latency and computational requirements.

Our system uses strategic truncation at each stage. Initial grep matches include small context windows. Span reading expands to full functions. GPU scoring uses the critical portions of each span that capture the essential semantics without overwhelming the models.

This isn't arbitrary truncation. We preserve the matched text and surrounding context in a way that maintains semantic coherence. For final reasoning, we pass full context to language models, but only for the top results. Each stage trades off between breadth and depth appropriately for its role in the pipeline.

Privacy and security by design

One of the fundamental constraints we embraced is that developers' code should stay on their machines as much as possible. The hybrid architecture naturally supports this: all initial filtering, parsing, and search happens locally.

Only the selected code spans are sent to cloud services for semantic ranking. Developers maintain complete control over what code leaves their environment, as the local grep and AST parsing stages require no network connectivity.

This design also has performance benefits. Local processing eliminates network round-trips for the most computationally light stages (text search and parsing), while reserving expensive network calls for the stages that genuinely benefit from specialized GPU infrastructure.

Scaling considerations

The hybrid architecture scales in interesting ways. Client-side text search scales linearly with codebase size but benefits from modern SSD speeds and multicore parallelism. Large codebases remain searchable quickly with properly configured ignore patterns.

Server-side GPU scaling is more conventional. We can shard large workloads across multiple GPUs or handle concurrent requests by batching. The stateless nature of the semantic ranking stages makes horizontal scaling straightforward.

The most interesting scaling challenge is managing result volume. Very broad queries might generate many initial matches. We handle this through aggressive early filtering, smart deduplication, and cutoffs that maintain quality while capping computational cost. The system degrades gracefully, returning the best results it can within time budgets rather than failing on large result sets.

The integration advantage

One aspect that's harder to quantify but crucial in practice is integration with developer workflows. By running the first stages locally, we integrate naturally with existing tools. Developers using Cursor, Kiro, or other AI coding assistants get search results that respect their ignore files, honor their directory structure, and understand their branch state.

The MCP (Model Context Protocol) integration makes code search a native capability for AI assistants. The assistant handles query formulation and keyword extraction, the search tool handles finding relevant code, and the GPU services handle semantic understanding. Each component focuses on what it does best.

This composability extends to the architecture itself. The local stages (grep, read, parse) are reusable for other tools. The GPU services can serve multiple client types. The semantic models can be swapped or upgraded independently. Hybrid architectures with clear component boundaries are easier to evolve than monolithic systems.

Lessons learned

Building this system taught us several valuable lessons. First, GPU acceleration matters most when the baseline is already fast. Shaving off seconds changes user behavior more than shaving off milliseconds. We optimized the initial text search stage to an extreme because it sets expectations for the entire pipeline.

Second, semantic models need the right context size. Too small and they miss important relationships. Too large and they get confused by irrelevant details. The sweet spot varies by model and task, requiring careful empirical testing rather than theoretical optimization.

Third, developer tools need to degrade gracefully. Network issues shouldn't break code search. GPU unavailability shouldn't stop local text search from working. We architected fallbacks at every stage, ensuring developers always get some results even when components fail.

Finally, performance predictability matters as much as average performance. A search that's fast most of the time but occasionally slow feels unreliable. We implemented timeouts, caches, and circuit breakers to keep latency distributions tight.

Looking forward

The hybrid architecture opens several interesting directions for future work. Local caching of semantic embeddings could eliminate redundant GPU processing for frequently searched code. Pre-indexing commonly accessed repositories could further reduce latency for popular open-source dependencies.

Smarter search using historical patterns could improve initial recall while reducing the volume that needs semantic processing. Learning from user interactions (which results get clicked, which searches get refined) could provide training signal for better ranking models.

Integration with version control systems could make search branch-aware and time-aware, helping developers understand not just what code exists but how it evolved. The local execution model makes these kinds of rich integrations practical in ways that pure cloud solutions cannot match.

Try Cheetah AI

We've built Cheetah AI to solve a real problem: helping developers find code that matters in increasingly complex codebases. The hybrid architecture is our answer to the speed versus intelligence trade-off that has plagued code search for years.

If you're interested in seeing how fast, intelligent code search can work, we'd love to have you try Cheetah AI. It's designed to integrate seamlessly with AI coding assistants like Cursor, Kiro, and Claude Code.

Back to Blog

Engineering, Benchmarks

Cheetah AI Benchmark: Measuring Code Search Quality

Measuring Code Search: How We Benchmark Cheetah AI Against Real-World Repositories TL;DR: We built an open-source benchmark framework that:Generates test cases automatically from

Cheetah AI Team

03 Dec, 2025

Development, Best Practices

The True Cost of Technical Debt (And How AI Helps You Avoid It)

The True Cost of Technical Debt "We'll clean it up later." Famous last words. That "quick hack" you shipped three months ago? It's now blocking your most important feature. Welcome to te