# Architecture ## Pipeline Overview PaperRAG processes academic PDFs through a six-stage pipeline: ``` PDF Files | v [Parse] -- Docling extraction with adaptive OCR | v [Chunk] -- Section-aware text splitting | v [Embed] -- Sentence-transformer vectorization | v [Store] -- FAISS index with metadata | v [Retrieve] -- Similarity search + re-ranking | v [Generate] -- LLM answer with citations ``` ## Module Responsibilities ### `parser.py` -- PDF Parsing - Uses Docling for structured document parsing - **Adaptive OCR**: inspects each PDF with PyMuPDF to detect whether it contains extractable text. Text-based PDFs skip OCR entirely (2-3x speedup), while scanned PDFs enable OCR automatically. - Supports CSV manifests for pre-supplied metadata (title, authors, abstract, DOI) - Falls back to raw text extraction on parse failure ### `chunker.py` -- Text Chunking - Section-aware splitting that respects document structure - Configurable chunk size (default: 1000 chars) and overlap (default: 200 chars) - Preserves metadata (file path, section title) on each chunk ### `embedder.py` -- Embedding - Uses `sentence-transformers/all-MiniLM-L6-v2` by default - Batch embedding with configurable batch size - L2 normalization for cosine similarity - Deterministic seeding for reproducibility ### `vectorstore.py` -- Vector Storage - FAISS-based vector index - Tracks file hashes (SHA-256) for incremental indexing - Versioned index with config snapshot persistence - Supports per-file removal for handling deleted/updated PDFs ### `retriever.py` -- Retrieval - Top-k similarity search with configurable score threshold - Optional Maximal Marginal Relevance (MMR) for result diversity - Per-paper result limiting to avoid over-representation ### `llm.py` -- LLM Integration - Supports two local backends: - Ollama for standard model names such as `qwen2.5:1.5b` - `llama.cpp` via `llama-server` for local `.gguf` files and HuggingFace GGUF repos - Streaming responses with citation support - Descriptive error messages for common LLM failures ### `parallel.py` -- Parallel Indexing - Multiprocessing-based parallel PDF processing - Uses `spawn` start method (avoids deadlocks with PyTorch/CUDA) - Per-PDF timeout for hanging documents - Batch checkpointing for crash recovery ### `cli.py` -- CLI - Typer-based command-line interface - Commands: `index`, `review`, `query`, `evaluate`, plus the default REPL entrypoint - Rich console output with progress bars (tqdm) ### `repl.py` -- Interactive REPL - prompt-toolkit-based interactive session - Command history persistence - Live settings adjustment via slash commands: - `/index`, `/topk`, `/threshold`, `/temperature` - `/max-tokens`, `/ctx-size`, `/prompt`, `/model` - `/config`, `/rc`, `/help`, `/exit`, `/quit` ### `config.py` -- Configuration - Pydantic v2 models with validation - Config snapshot save/load for index reproducibility - RAM-aware worker auto-detection ## Key Design Decisions ### Deterministic Hashing Every PDF is identified by its SHA-256 hash. Re-running `paperrag index` only processes new or changed files, making incremental indexing fast. ### Adaptive OCR Rather than a global OCR toggle, PaperRAG inspects each PDF individually using PyMuPDF. This gives optimal speed on mixed collections containing both text-based and scanned documents. ### Crash Recovery The index is checkpointed after every batch during indexing. If the process crashes (e.g. OOM), restarting the same command resumes from where it left off. ### Spawn Multiprocessing PaperRAG forces the `spawn` multiprocessing start method to avoid deadlocks caused by forking processes that use PyTorch, CUDA, or OpenMP. ## Design Principles These principles guide decisions about UX, API shape, and feature scope. ### Progressive Disclosure Default behaviour is minimal; complexity surfaces only when the user needs it. Running `paperrag review paper.pdf` requires zero flags. The `/focus`, `/topk`, and `/config` commands exist for users who want more control, but they are never required. New features should follow the same pattern: sensible default first, opt-in complexity second. ### Convention over Configuration The index location auto-derives from the input path (`/.paperrag-index`). Workers auto-detect from available RAM. The LLM backend is inferred from the model name format (Ollama name vs. `.gguf` path vs. HuggingFace repo ID). `.paperragrc` files are optional overrides, not required setup. A new user should be able to run a useful command without reading the configuration docs. ### Incremental by Default Re-running any command is always safe and cheap. SHA-256 hashes mean unchanged PDFs are skipped automatically during indexing — no `--skip-cached` flag needed. This principle extends to index saves (atomic writes via `.tmp` + move) and REPL state (re-indexing only resets what changed). ### Ownership of Output Each layer is responsible for its own console output. `_handle_index` prints its own progress; the `review` command does not pre-announce what `_handle_index` is about to say. Functions that call other functions do not narrate on their behalf. This keeps output coherent and prevents duplicate or contradictory messages as the codebase grows. ### Local-First, No Cloud Dependencies Everything runs on-device: FAISS (embedded library, no server), Ollama or llama-server (local inference), sentence-transformers (downloaded once, cached). The design treats offline use as the default, not an edge case. Features that require external services should be clearly opt-in. ### Atomic Persistence The vector store writes to a `.tmp` file and then uses `shutil.move()` to replace the live index. A crash mid-save leaves the previous index intact. The same principle applies to batch checkpointing during indexing: partial progress is always recoverable.