Architecture¶
Pipeline Overview¶
PaperRAG processes academic PDFs through a six-stage pipeline:
PDF Files
|
v
[Parse] -- Docling extraction with adaptive OCR
|
v
[Chunk] -- Section-aware text splitting
|
v
[Embed] -- Sentence-transformer vectorization
|
v
[Store] -- FAISS index with metadata
|
v
[Retrieve] -- Similarity search + re-ranking
|
v
[Generate] -- LLM answer with citations
Module Responsibilities¶
parser.py – PDF Parsing¶
Uses Docling for structured document parsing
Adaptive OCR: inspects each PDF with PyMuPDF to detect whether it contains extractable text. Text-based PDFs skip OCR entirely (2-3x speedup), while scanned PDFs enable OCR automatically.
Supports CSV manifests for pre-supplied metadata (title, authors, abstract, DOI)
Falls back to raw text extraction on parse failure
chunker.py – Text Chunking¶
Section-aware splitting that respects document structure
Configurable chunk size (default: 1000 chars) and overlap (default: 200 chars)
Preserves metadata (file path, section title) on each chunk
embedder.py – Embedding¶
Uses
sentence-transformers/all-MiniLM-L6-v2by defaultBatch embedding with configurable batch size
L2 normalization for cosine similarity
Deterministic seeding for reproducibility
vectorstore.py – Vector Storage¶
FAISS-based vector index
Tracks file hashes (SHA-256) for incremental indexing
Versioned index with config snapshot persistence
Supports per-file removal for handling deleted/updated PDFs
retriever.py – Retrieval¶
Top-k similarity search with configurable score threshold
Optional Maximal Marginal Relevance (MMR) for result diversity
Per-paper result limiting to avoid over-representation
llm.py – LLM Integration¶
Supports two local backends:
Ollama for standard model names such as
qwen2.5:1.5bllama.cppviallama-serverfor local.gguffiles and HuggingFace GGUF repos
Streaming responses with citation support
Descriptive error messages for common LLM failures
parallel.py – Parallel Indexing¶
Multiprocessing-based parallel PDF processing
Uses
spawnstart method (avoids deadlocks with PyTorch/CUDA)Per-PDF timeout for hanging documents
Batch checkpointing for crash recovery
cli.py – CLI¶
Typer-based command-line interface
Commands:
index,review,query,evaluate, plus the default REPL entrypointRich console output with progress bars (tqdm)
repl.py – Interactive REPL¶
prompt-toolkit-based interactive session
Command history persistence
Live settings adjustment via slash commands:
/index,/topk,/threshold,/temperature/max-tokens,/ctx-size,/prompt,/model/config,/rc,/help,/exit,/quit
config.py – Configuration¶
Pydantic v2 models with validation
Config snapshot save/load for index reproducibility
RAM-aware worker auto-detection
Key Design Decisions¶
Deterministic Hashing¶
Every PDF is identified by its SHA-256 hash. Re-running paperrag index only processes new or changed files, making incremental indexing fast.
Adaptive OCR¶
Rather than a global OCR toggle, PaperRAG inspects each PDF individually using PyMuPDF. This gives optimal speed on mixed collections containing both text-based and scanned documents.
Crash Recovery¶
The index is checkpointed after every batch during indexing. If the process crashes (e.g. OOM), restarting the same command resumes from where it left off.
Spawn Multiprocessing¶
PaperRAG forces the spawn multiprocessing start method to avoid deadlocks caused by forking processes that use PyTorch, CUDA, or OpenMP.
Design Principles¶
These principles guide decisions about UX, API shape, and feature scope.
Progressive Disclosure¶
Default behaviour is minimal; complexity surfaces only when the user needs it. Running paperrag review paper.pdf requires zero flags. The /focus, /topk, and /config commands exist for users who want more control, but they are never required. New features should follow the same pattern: sensible default first, opt-in complexity second.
Convention over Configuration¶
The index location auto-derives from the input path (<input-dir>/.paperrag-index). Workers auto-detect from available RAM. The LLM backend is inferred from the model name format (Ollama name vs. .gguf path vs. HuggingFace repo ID). .paperragrc files are optional overrides, not required setup. A new user should be able to run a useful command without reading the configuration docs.
Incremental by Default¶
Re-running any command is always safe and cheap. SHA-256 hashes mean unchanged PDFs are skipped automatically during indexing — no --skip-cached flag needed. This principle extends to index saves (atomic writes via .tmp + move) and REPL state (re-indexing only resets what changed).
Ownership of Output¶
Each layer is responsible for its own console output. _handle_index prints its own progress; the review command does not pre-announce what _handle_index is about to say. Functions that call other functions do not narrate on their behalf. This keeps output coherent and prevents duplicate or contradictory messages as the codebase grows.
Local-First, No Cloud Dependencies¶
Everything runs on-device: FAISS (embedded library, no server), Ollama or llama-server (local inference), sentence-transformers (downloaded once, cached). The design treats offline use as the default, not an edge case. Features that require external services should be clearly opt-in.
Atomic Persistence¶
The vector store writes to a .tmp file and then uses shutil.move() to replace the live index. A crash mid-save leaves the previous index intact. The same principle applies to batch checkpointing during indexing: partial progress is always recoverable.