Architecture

Pipeline Overview

PaperRAG processes academic PDFs through a six-stage pipeline:

PDF Files
  |
  v
[Parse] -- Docling extraction with adaptive OCR
  |
  v
[Chunk] -- Section-aware text splitting
  |
  v
[Embed] -- Sentence-transformer vectorization
  |
  v
[Store] -- FAISS index with metadata
  |
  v
[Retrieve] -- Similarity search + re-ranking
  |
  v
[Generate] -- LLM answer with citations

Module Responsibilities

parser.py – PDF Parsing

  • Uses Docling for structured document parsing

  • Adaptive OCR: inspects each PDF with PyMuPDF to detect whether it contains extractable text. Text-based PDFs skip OCR entirely (2-3x speedup), while scanned PDFs enable OCR automatically.

  • Supports CSV manifests for pre-supplied metadata (title, authors, abstract, DOI)

  • Falls back to raw text extraction on parse failure

chunker.py – Text Chunking

  • Section-aware splitting that respects document structure

  • Configurable chunk size (default: 1000 chars) and overlap (default: 200 chars)

  • Preserves metadata (file path, section title) on each chunk

embedder.py – Embedding

  • Uses sentence-transformers/all-MiniLM-L6-v2 by default

  • Batch embedding with configurable batch size

  • L2 normalization for cosine similarity

  • Deterministic seeding for reproducibility

vectorstore.py – Vector Storage

  • FAISS-based vector index

  • Tracks file hashes (SHA-256) for incremental indexing

  • Versioned index with config snapshot persistence

  • Supports per-file removal for handling deleted/updated PDFs

retriever.py – Retrieval

  • Top-k similarity search with configurable score threshold

  • Optional Maximal Marginal Relevance (MMR) for result diversity

  • Per-paper result limiting to avoid over-representation

llm.py – LLM Integration

  • Supports two local backends:

    • Ollama for standard model names such as qwen2.5:1.5b

    • llama.cpp via llama-server for local .gguf files and HuggingFace GGUF repos

  • Streaming responses with citation support

  • Descriptive error messages for common LLM failures

parallel.py – Parallel Indexing

  • Multiprocessing-based parallel PDF processing

  • Uses spawn start method (avoids deadlocks with PyTorch/CUDA)

  • Per-PDF timeout for hanging documents

  • Batch checkpointing for crash recovery

cli.py – CLI

  • Typer-based command-line interface

  • Commands: index, review, query, evaluate, plus the default REPL entrypoint

  • Rich console output with progress bars (tqdm)

repl.py – Interactive REPL

  • prompt-toolkit-based interactive session

  • Command history persistence

  • Live settings adjustment via slash commands:

    • /index, /topk, /threshold, /temperature

    • /max-tokens, /ctx-size, /prompt, /model

    • /config, /rc, /help, /exit, /quit

config.py – Configuration

  • Pydantic v2 models with validation

  • Config snapshot save/load for index reproducibility

  • RAM-aware worker auto-detection

Key Design Decisions

Deterministic Hashing

Every PDF is identified by its SHA-256 hash. Re-running paperrag index only processes new or changed files, making incremental indexing fast.

Adaptive OCR

Rather than a global OCR toggle, PaperRAG inspects each PDF individually using PyMuPDF. This gives optimal speed on mixed collections containing both text-based and scanned documents.

Crash Recovery

The index is checkpointed after every batch during indexing. If the process crashes (e.g. OOM), restarting the same command resumes from where it left off.

Spawn Multiprocessing

PaperRAG forces the spawn multiprocessing start method to avoid deadlocks caused by forking processes that use PyTorch, CUDA, or OpenMP.

Design Principles

These principles guide decisions about UX, API shape, and feature scope.

Progressive Disclosure

Default behaviour is minimal; complexity surfaces only when the user needs it. Running paperrag review paper.pdf requires zero flags. The /focus, /topk, and /config commands exist for users who want more control, but they are never required. New features should follow the same pattern: sensible default first, opt-in complexity second.

Convention over Configuration

The index location auto-derives from the input path (<input-dir>/.paperrag-index). Workers auto-detect from available RAM. The LLM backend is inferred from the model name format (Ollama name vs. .gguf path vs. HuggingFace repo ID). .paperragrc files are optional overrides, not required setup. A new user should be able to run a useful command without reading the configuration docs.

Incremental by Default

Re-running any command is always safe and cheap. SHA-256 hashes mean unchanged PDFs are skipped automatically during indexing — no --skip-cached flag needed. This principle extends to index saves (atomic writes via .tmp + move) and REPL state (re-indexing only resets what changed).

Ownership of Output

Each layer is responsible for its own console output. _handle_index prints its own progress; the review command does not pre-announce what _handle_index is about to say. Functions that call other functions do not narrate on their behalf. This keeps output coherent and prevents duplicate or contradictory messages as the codebase grows.

Local-First, No Cloud Dependencies

Everything runs on-device: FAISS (embedded library, no server), Ollama or llama-server (local inference), sentence-transformers (downloaded once, cached). The design treats offline use as the default, not an edge case. Features that require external services should be clearly opt-in.

Atomic Persistence

The vector store writes to a .tmp file and then uses shutil.move() to replace the live index. A crash mid-save leaves the previous index intact. The same principle applies to batch checkpointing during indexing: partial progress is always recoverable.