On-Premises RAG System

Reference architecture for self-hosted document intelligence with full data sovereignty

Last updated: 5 February 2026

On-Premises RAG System

A reference architecture for deploying retrieval-augmented generation (RAG) systems entirely within your organization's infrastructure, ensuring complete data sovereignty.

Overview

This architecture enables organizations to deploy AI-powered document intelligence without sending any data to external services. All inference, embedding generation, and vector storage occurs on-premises, making it suitable for:

Regulated industries (finance, healthcare, government)
Air-gapped environments (defense, critical infrastructure)
Data residency requirements (GDPR, data localization laws)

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                    On-Premises Network                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │   Ollama     │  │   Qdrant     │  │    AnythingLLM      │  │
│  │  (Inference) │  │  (Vectors)   │  │   (UI + Pipeline)   │  │
│  └──────────────┘  └──────────────┘  └──────────────────────┘  │
│         ↑                ↑                      ↑               │
│         └────────────────┴──────────────────────┘               │
│                          │                                       │
│                    ┌─────┴─────┐                                │
│                    │   Users   │                                │
│                    └───────────┘                                │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Components

Component	Role	Recommended Version
Ollama	Local LLM inference	Latest stable
Qdrant	Vector database	1.x+
AnythingLLM	RAG pipeline + UI	Latest stable
Traefik (optional)	Reverse proxy	3.x

Security Boundaries

All data remains within your infrastructure:

Documents never leave the network perimeter
Embeddings generated locally via Ollama
Vectors stored in on-premises Qdrant
Queries processed entirely locally
Audit logs retained on your infrastructure

Data Flow

Document Upload - Users upload via secure internal interface
Chunking - Documents split into semantic chunks (~500 tokens)
Embedding - Ollama generates embeddings locally (CPU or GPU)
Storage - Vectors stored in Qdrant with metadata
Query - User question embedded, similar chunks retrieved
Generation - Ollama generates response with retrieved context
Response - Answer returned with source citations

Deployment Options

Environment	Stack	Notes
Air-gapped	Ollama + Qdrant + AnythingLLM	No network access needed post-setup
Corporate VPC	Same stack + Traefik	TLS termination at reverse proxy
Hybrid	Cloud embeddings, local storage	Compromise: faster embeddings, local vectors

Hardware Requirements

Minimum (CPU Inference)

RAM: 32GB
CPU: 8 cores
Storage: 500GB SSD
Model: Llama 3 8B or Mistral 7B

Suitable for small workloads (<100 documents, light query load).

Recommended (GPU Inference)

RAM: 64GB
CPU: 16 cores
GPU: NVIDIA RTX 4090 (24GB VRAM)
Storage: 1TB NVMe
Model: Llama 3 70B or Mixtral 8x7B

Supports larger document sets (1000+ documents) and concurrent users.

Enterprise (Multi-GPU)

RAM: 128GB+
CPU: 32+ cores
GPU: 2-4x NVIDIA A100 (80GB each)
Storage: 4TB+ NVMe RAID
Model: Full-size models with quantization

Production workloads with thousands of users.

Cost Estimate

Item	One-Time	Monthly
Server hardware (recommended tier)	$10,000-15,000	-
NVIDIA RTX 4090	$2,000	-
Electricity (~400W continuous)	-	$50-100
Software licenses	$0	$0
Total	~$12,000-17,000	~$50-100

All software components are open source with no licensing fees.

Implementation Checklist

Provision hardware meeting minimum requirements
Install Ubuntu Server LTS or RHEL 9
Deploy Ollama and pull required models
Deploy Qdrant with persistent storage
Deploy AnythingLLM with Ollama + Qdrant configured
Configure TLS if exposing to internal network
Set up backup procedures for vector database
Document runbook for operations team
Train users on interface and best practices

Next Steps

Ready to deploy a RAG system for your organization?

Book a Technical Scoping Call to discuss your specific requirements, hardware constraints, and implementation timeline.

ragon-premisesself-hosteddocument-intelligence