On-Premises RAG System

Reference architecture for self-hosted document intelligence with full data sovereignty

Last updated: 5 February 2026

On-Premises RAG System

A reference architecture for deploying retrieval-augmented generation (RAG) systems entirely within your organization's infrastructure, ensuring complete data sovereignty.

Overview

This architecture enables organizations to deploy AI-powered document intelligence without sending any data to external services. All inference, embedding generation, and vector storage occurs on-premises, making it suitable for:

  • Regulated industries (finance, healthcare, government)
  • Air-gapped environments (defense, critical infrastructure)
  • Data residency requirements (GDPR, data localization laws)

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                    On-Premises Network                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │   Ollama     │  │   Qdrant     │  │    AnythingLLM      │  │
│  │  (Inference) │  │  (Vectors)   │  │   (UI + Pipeline)   │  │
│  └──────────────┘  └──────────────┘  └──────────────────────┘  │
│         ↑                ↑                      ↑               │
│         └────────────────┴──────────────────────┘               │
│                          │                                       │
│                    ┌─────┴─────┐                                │
│                    │   Users   │                                │
│                    └───────────┘                                │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Components

Component Role Recommended Version
Ollama Local LLM inference Latest stable
Qdrant Vector database 1.x+
AnythingLLM RAG pipeline + UI Latest stable
Traefik (optional) Reverse proxy 3.x

Security Boundaries

All data remains within your infrastructure:

  • Documents never leave the network perimeter
  • Embeddings generated locally via Ollama
  • Vectors stored in on-premises Qdrant
  • Queries processed entirely locally
  • Audit logs retained on your infrastructure

Data Flow

  1. Document Upload - Users upload via secure internal interface
  2. Chunking - Documents split into semantic chunks (~500 tokens)
  3. Embedding - Ollama generates embeddings locally (CPU or GPU)
  4. Storage - Vectors stored in Qdrant with metadata
  5. Query - User question embedded, similar chunks retrieved
  6. Generation - Ollama generates response with retrieved context
  7. Response - Answer returned with source citations

Deployment Options

Environment Stack Notes
Air-gapped Ollama + Qdrant + AnythingLLM No network access needed post-setup
Corporate VPC Same stack + Traefik TLS termination at reverse proxy
Hybrid Cloud embeddings, local storage Compromise: faster embeddings, local vectors

Hardware Requirements

Minimum (CPU Inference)

  • RAM: 32GB
  • CPU: 8 cores
  • Storage: 500GB SSD
  • Model: Llama 3 8B or Mistral 7B

Suitable for small workloads (<100 documents, light query load).

  • RAM: 64GB
  • CPU: 16 cores
  • GPU: NVIDIA RTX 4090 (24GB VRAM)
  • Storage: 1TB NVMe
  • Model: Llama 3 70B or Mixtral 8x7B

Supports larger document sets (1000+ documents) and concurrent users.

Enterprise (Multi-GPU)

  • RAM: 128GB+
  • CPU: 32+ cores
  • GPU: 2-4x NVIDIA A100 (80GB each)
  • Storage: 4TB+ NVMe RAID
  • Model: Full-size models with quantization

Production workloads with thousands of users.

Cost Estimate

Item One-Time Monthly
Server hardware (recommended tier) $10,000-15,000 -
NVIDIA RTX 4090 $2,000 -
Electricity (~400W continuous) - $50-100
Software licenses $0 $0
Total ~$12,000-17,000 ~$50-100

All software components are open source with no licensing fees.

Implementation Checklist

  • Provision hardware meeting minimum requirements
  • Install Ubuntu Server LTS or RHEL 9
  • Deploy Ollama and pull required models
  • Deploy Qdrant with persistent storage
  • Deploy AnythingLLM with Ollama + Qdrant configured
  • Configure TLS if exposing to internal network
  • Set up backup procedures for vector database
  • Document runbook for operations team
  • Train users on interface and best practices

Next Steps

Ready to deploy a RAG system for your organization?

Book a Technical Scoping Call to discuss your specific requirements, hardware constraints, and implementation timeline.

ragon-premisesself-hosteddocument-intelligence