How We Built a Local-First, Privacy-Focused RAG for Our AI Terminal

Large Language Models (LLMs) are incredibly powerful, but they have a fundamental blind spot: they have no idea what’s in your project’s codebase. It’s like having a brilliant but amnesiac assistant. You can paste code into the prompt, but that’s slow and clunky. So, how can an AI assistant give truly helpful, context-aware answers about your code?

The answer is a technique called Retrieval-Augmented Generation (RAG). In a nutshell, RAG supercharges an LLM’s prompt with relevant information. For Aye Chat, this means automatically finding the most relevant bits of your code and bundling them with your question.

This post is a deep dive into how we built the RAG system inside Aye Chat. We’ll walk you through the key decisions, the technical hurdles, and the little details that make our approach private, fast, and - most importantly - unintrusive.

Part 1: Choosing the Right Tools for the Job

A RAG system has two primary components: an embedding model to convert text into numerical representations (vectors) and a vector database to store and efficiently search these vectors.

The Embedding Model

Our first major decision was whether to use a public API for embeddings (like OpenAI’s) or a local, on-device model. We chose the local-first approach for several key reasons:

Your Code Stays Your Code. This was non-negotiable for us. Your source code is your secret sauce, and we believe it should never leave your machine. Using a third-party API for a core feature like this was simply not an option.
Cost & Rate-Limiting: API-based embeddings can become expensive, especially for large projects that require frequent re-indexing. They are also subject to rate limits, which can slow down the initial indexing process.
Simplicity: We wanted Aye Chat to be a self-contained, easy-to-install command-line tool. Adding dependencies on external APIs for core functionality complicates the setup and user experience.

With the local approach decided, we needed a model that was effective but also lightweight. A major goal was to avoid forcing users to install heavy frameworks like PyTorch or TensorFlow, which can be a significant barrier. This led us to select ONNXMiniLM_L6_V2, a model available in the ONNX (Open Neural Network Exchange) format. As seen in aye/model/vector_db.py, this model is conveniently packaged with ChromaDB and runs on a lightweight ONNX runtime, sidestepping the need for multi-gigabyte deep learning libraries.

# aye/model/vector_db.py

# Use the lightweight ONNX embedding function included with chromadb
from chromadb.utils.embedding_functions import ONNXMiniLM_L6_V2

# ...

    # Instantiate the lightweight ONNX embedding function.
    # This avoids pulling in PyTorch and is much smaller.
    embedding_function = ONNXMiniLM_L6_V2()

The Vector Database

Next, we needed a vector database. The requirements were clear: it had to be embeddable, run locally, be easy to install (pip install ...), and perform well.

We evaluated several options:

Milvus Lite: While promising, it felt more like a stepping stone to the full, distributed Milvus. For a purely local, single-user CLI tool, it seemed like overkill and potentially more complex to manage.
LanceDB: A strong, modern contender built on the Lance file format. It’s very fast and efficient.
ChromaDB: An open-source, embeddable vector database that has gained significant popularity.

We ultimately chose ChromaDB. The decision was based on its excellent balance of features, ease of use, and maturity. The chromadb.PersistentClient allows us to create a self-contained database right inside the project’s .aye/ directory, requiring no external services. Furthermore, its seamless integration with the ONNXMiniLM_L6_V2 embedding function was a significant advantage, simplifying the implementation. We configured it to use cosine similarity, a standard metric for measuring the similarity between text vectors.

# aye/model/vector_db.py

def initialize_index(root_path: Path) -> Any:
    db_path = root_path / ".aye" / "chroma_db"
    db_path.mkdir(parents=True, exist_ok=True)

    client = chromadb.PersistentClient(path=str(db_path))
    
    embedding_function = ONNXMiniLM_L6_V2()

    collection = client.get_or_create_collection(
        name="project_code_index",
        embedding_function=embedding_function,
        metadata={"hnsw:space": "cosine"}  # Cosine similarity is good for text
    )
    return collection

Part 2: The Indexing Architecture

With the tools selected, we designed the indexing and search process. The IndexManager class in aye/model/index_manager.py is the heart of this system.

Creating the Index: A Two-Phase Approach

We didn’t want you staring at a progress bar, so we came up with a two-phase strategy. Think of it like unpacking after a move.

Phase 1 (Coarse Indexing): First, you quickly dump all the boxes in the correct rooms. This is our ‘coarse’ pass where we index each entire file. It’s super fast and gives you a usable index almost instantly.
Phase 2 (Refined Indexing): Then, in the background, you start the real work: unpacking each box. This is our ‘refined’ pass, where we use tree-sitter to break files down into smart, semantic chunks like functions and classes. The search quality gets better and better as this process runs.

Phase 1: Coarse Indexing When Aye Chat starts, it performs a quick scan of the project to find new or modified files. For each of these files, it creates a single vector for the entire file content and adds it to the index. The document ID is simply the file path.

# aye/model/vector_db.py

def update_index_coarse(
    collection: Any, 
    files_to_update: Dict[str, str]
) -> None:
    # ...
    ids = list(files_to_update.keys())
    documents = list(files_to_update.values())
    # ...
    collection.upsert(ids=ids, documents=documents, metadatas=metadatas)

This process is very fast. It gives the user a usable, albeit imprecise, search index almost immediately. Searching at this stage can identify which files are relevant, even if it can’t pinpoint the exact lines of code.

Phase 2: Refined Indexing with tree-sitter After the coarse pass is complete, another background process kicks in. This is where the semantic processing happens. Instead of simply splitting files by lines, we now use tree-sitter to perform semantic chunking.

As seen in aye/model/ast_chunker.py, we parse the source code into an Abstract Syntax Tree (AST). We then run language-specific queries against this tree to extract meaningful, self-contained code blocks like functions, classes, methods, or interfaces.

# aye/model/ast_chunker.py

CHUNK_QUERIES = {
    "python": """
    (function_definition) @chunk
    (class_definition) @chunk
    """,
    "javascript": """
    (function_declaration) @chunk
    (class_declaration) @chunk
    ...
    """,
    # ... and so on for other languages
}

Each of these AST nodes becomes a “document” in our vector database. This is a massive improvement over naive chunking because each vector now represents a complete logical unit of code, leading to far more precise and relevant search results.

The refine_file_in_index function in aye/model/vector_db.py orchestrates this: it deletes the old whole-file entry and upserts the new, semantically-chunked documents. For languages not yet supported by our AST chunker, or if parsing fails, we gracefully fall back to a simple line-based chunker to ensure all files are indexed.

# aye/model/vector_db.py

def refine_file_in_index(collection: Any, file_path: str, content: str):
    # 1. Delete the old coarse chunk...
    collection.delete(ids=[file_path])

    # 2. Create and upsert the new fine-grained chunks.
    language_name = get_language_from_file_path(file_path)
    chunks = []
    if language_name:
        chunks = ast_chunker(content, language_name)

    # Fallback to line-based chunking...
    if not chunks:
        chunks = _chunk_file(content)
    # ...
    collection.upsert(documents=chunks, metadatas=metadatas, ids=ids)

This progressive approach provides the best of both worlds: immediate availability and progressively improving search quality, all without blocking the user.

The Search: Finding Needles in a Haystack

When the user enters a prompt, the llm_invoker calls the query method of the IndexManager. This in turn calls ChromaDB’s query function. We ask for a generous 300 results to ensure we have a wide pool of potentially relevant code chunks. ChromaDB returns the chunks and their “distance” from the query. We convert this to a more intuitive similarity score (where 1.0 is a perfect match) by calculating 1 - distance.

Part 3: Engineering for a Seamless User Experience

A powerful RAG system is useless if it makes the host application slow or unreliable. We invested significant effort into making the indexing process as unobtrusive as possible.

Background Processing

All indexing work happens in a background thread, as initiated in aye/controller/repl.py. A standard ThreadPoolExecutor would create non-daemon threads, which would prevent the application from exiting until the indexing was complete. To fix this, we implemented a custom DaemonThreadPoolExecutor in aye/model/index_manager.py. This small but critical change ensures that background indexing is automatically terminated when the user quits the chat.

Limiting CPU Impact

Calculating embeddings is CPU-intensive. To prevent Aye Chat from hogging system resources and causing UI lag, we implemented two key constraints:

Worker Count: We limit the number of background indexing threads to half the available CPU cores, with a maximum of 4. This leaves plenty of CPU cycles for the main application and other user tasks.
Process Priority: On POSIX-compliant systems (like Linux and macOS), we use os.nice(5) to lower the priority of the background worker threads. We’re essentially telling the operating system, ‘Hey, this indexing is important, but don’t let it slow down whatever the user is actively doing.’ It’s all about being a good citizen on your machine.

# aye/model/index_manager.py

def _set_low_priority():
    if hasattr(os, 'nice'):
        os.nice(5)

# ...
MAX_WORKERS = min(4, max(1, CPU_COUNT // 2))

Robust, Resumable Indexing

Initial indexing of a large project can and will still take time. If the user quits halfway through, they shouldn’t have to start from scratch next time. We built robustness into the process by regularly saving the state of our file hash index to disk. The IndexManager saves its progress to .aye/file_index.json after every 20 files (SAVE_INTERVAL). If the process is interrupted, the next run will pick up right where it left off, only needing to process the remaining files.

Part 4: From Search Results to LLM Prompt

Once the vector search returns a ranked list of code chunks, the final step is to assemble the context to be sent with the prompt. This is handled in aye/controller/llm_invoker.py.

First, we create a unique, ranked list of files from the returned chunks. A file that appears multiple times is ranked by its highest-scoring chunk.

Then, we iterate through this list of files, adding their full content to the context. This is where we manage the context window size with a system of soft and hard limits:

Soft Limit (CONTEXT_TARGET_SIZE): We aim to pack about 180KB of context. The loop continues adding files as long as the total size is below this threshold.
Hard Limit (CONTEXT_HARD_LIMIT): To prevent API errors from a payload that is too large, we have a hard limit of 200KB. Before adding a file, we check if it would push the total size over this limit. If so, we skip that file and try the next, smaller one in the ranked list.

This logic ensures we prioritize the most relevant files while respecting API limitations.

# aye/controller/llm_invoker.py

# Stop if we've already packed enough context (soft limit).
if current_size > CONTEXT_TARGET_SIZE:
    break

# ...

# Check if adding this file would exceed the hard limit.
if current_size + file_size > CONTEXT_HARD_LIMIT:
    continue # Skip this file and try the next one.

Finally, if a project is small and total size is under CONTEXT_HARD_LIMIT, there is no need for any of this: in that case, we bypass RAG entirely and include every single file in the project that matches the file mask.

Part 5: The Road Ahead

This implementation successfully establishes a robust, performant, and privacy-preserving foundation. With the move to AST-based semantic chunking, we’ve significantly improved the core retrieval logic. However, there are always opportunities for enhancement. Here is our roadmap for making retrieval quality even better.

1. Expanding Language Support and Refining AST Queries

While our tree-sitter implementation covers many popular languages, the quality of chunking is only as good as the AST queries we write.

The Path Forward: We plan to continuously expand the set of supported languages. For existing languages, we will refine our AST queries to handle more edge cases and different coding patterns, ensuring we capture the most logical and self-contained units of code. This is an ongoing process of improvement to maximize the semantic value of our chunks.

2. More Precise Context Assembly

Currently, the system retrieves relevant chunks but then includes the entire content of the parent file in the LLM prompt. While simple and often effective (as it provides broader context), this can be inefficient. If a relevant chunk is found in a large file, the prompt becomes flooded with irrelevant code, which can confuse the LLM and lead to the “lost in the middle” problem where critical information is ignored.

The Path Forward: Right now, when we find a relevant code snippet, we bring its whole file along for the ride. This provides great context, but it’s like inviting a friend to a party and having them bring their entire extended family. Our next step is to be more surgical. Instead of sending the entire file, we will assemble context directly from the top-k retrieved chunks, perhaps including their immediate neighbors (like a parent function or class definition) to provide local awareness without overwhelming the context window.

3. Exploring Code-Specific Embedding Models

We chose ONNXMiniLM_L6_V2 for its small footprint and ease of deployment. It is a pragmatic choice for a general-purpose model. However, it has not been specifically trained on source code.

The Path Forward: We will evaluate and benchmark specialized embedding models that are fine-tuned on code corpora. Models from families like BGE or other code-specific encoders could provide a significant lift in retrieval accuracy by better understanding the semantic nuances of programming languages.

4. Advanced Ranking and Re-ranking

The current ranking logic is simple: rank files based on their single highest-scoring chunk. This can be improved.

The Path Forward: We plan to explore more sophisticated ranking techniques. For example, Reciprocal Rank Fusion (RRF) could be used to generate a more robust file score by considering all retrieved chunks from that file, not just the best one. Furthermore, we may implement a second-stage re-ranker. This would involve taking the top N results from the initial vector search and passing them through a lightweight but more powerful cross-encoder model to re-order them for final inclusion in the prompt, boosting precision.

Conclusion

Developing the RAG system for Aye Chat has been an exercise in balancing power, privacy, and performance. Our focus has been on creating a local-first, non-intrusive tool that aligns with our core principles. The current implementation provides a solid foundation for code-aware assistance, and the roadmap we’ve laid out shows our path toward making it even more precise and helpful.

About Aye Chat

Aye Chat is an open-source, AI-powered terminal workspace that brings the power of AI directly into your command-line workflow. Edit files, run commands, and chat with your codebase without ever leaving the terminal.

Support Us 🫶

Star 🌟 our GitHub repository. It helps new users discover Aye Chat.
Spread the word 🗣️. Share Aye Chat on social media and recommend it to your friends.

Part 1: Choosing the Right Tools for the Job#

The Embedding Model#

The Vector Database#

Part 2: The Indexing Architecture#

Creating the Index: A Two-Phase Approach#

The Search: Finding Needles in a Haystack#

Part 3: Engineering for a Seamless User Experience#

Background Processing#

Limiting CPU Impact#

Robust, Resumable Indexing#

Part 4: From Search Results to LLM Prompt#

Part 5: The Road Ahead#

1. Expanding Language Support and Refining AST Queries#

2. More Precise Context Assembly#

3. Exploring Code-Specific Embedding Models#

4. Advanced Ranking and Re-ranking#

Conclusion#

About Aye Chat#

Support Us 🫶#