On-Premises RAG System
Reference architecture for self-hosted document intelligence with full data sovereignty
Last updated: 5 February 2026
On-Premises RAG System
A reference architecture for deploying retrieval-augmented generation (RAG) systems entirely within your organization's infrastructure, ensuring complete data sovereignty.
Overview
This architecture enables organizations to deploy AI-powered document intelligence without sending any data to external services. All inference, embedding generation, and vector storage occurs on-premises, making it suitable for:
- Regulated industries (finance, healthcare, government)
- Air-gapped environments (defense, critical infrastructure)
- Data residency requirements (GDPR, data localization laws)
Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐
│ On-Premises Network │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Ollama │ │ Qdrant │ │ AnythingLLM │ │
│ │ (Inference) │ │ (Vectors) │ │ (UI + Pipeline) │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
│ ↑ ↑ ↑ │
│ └────────────────┴──────────────────────┘ │
│ │ │
│ ┌─────┴─────┐ │
│ │ Users │ │
│ └───────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Components
| Component | Role | Recommended Version |
|---|---|---|
| Ollama | Local LLM inference | Latest stable |
| Qdrant | Vector database | 1.x+ |
| AnythingLLM | RAG pipeline + UI | Latest stable |
| Traefik (optional) | Reverse proxy | 3.x |
Security Boundaries
All data remains within your infrastructure:
- Documents never leave the network perimeter
- Embeddings generated locally via Ollama
- Vectors stored in on-premises Qdrant
- Queries processed entirely locally
- Audit logs retained on your infrastructure
Data Flow
- Document Upload - Users upload via secure internal interface
- Chunking - Documents split into semantic chunks (~500 tokens)
- Embedding - Ollama generates embeddings locally (CPU or GPU)
- Storage - Vectors stored in Qdrant with metadata
- Query - User question embedded, similar chunks retrieved
- Generation - Ollama generates response with retrieved context
- Response - Answer returned with source citations
Deployment Options
| Environment | Stack | Notes |
|---|---|---|
| Air-gapped | Ollama + Qdrant + AnythingLLM | No network access needed post-setup |
| Corporate VPC | Same stack + Traefik | TLS termination at reverse proxy |
| Hybrid | Cloud embeddings, local storage | Compromise: faster embeddings, local vectors |
Hardware Requirements
Minimum (CPU Inference)
- RAM: 32GB
- CPU: 8 cores
- Storage: 500GB SSD
- Model: Llama 3 8B or Mistral 7B
Suitable for small workloads (<100 documents, light query load).
Recommended (GPU Inference)
- RAM: 64GB
- CPU: 16 cores
- GPU: NVIDIA RTX 4090 (24GB VRAM)
- Storage: 1TB NVMe
- Model: Llama 3 70B or Mixtral 8x7B
Supports larger document sets (1000+ documents) and concurrent users.
Enterprise (Multi-GPU)
- RAM: 128GB+
- CPU: 32+ cores
- GPU: 2-4x NVIDIA A100 (80GB each)
- Storage: 4TB+ NVMe RAID
- Model: Full-size models with quantization
Production workloads with thousands of users.
Cost Estimate
| Item | One-Time | Monthly |
|---|---|---|
| Server hardware (recommended tier) | $10,000-15,000 | - |
| NVIDIA RTX 4090 | $2,000 | - |
| Electricity (~400W continuous) | - | $50-100 |
| Software licenses | $0 | $0 |
| Total | ~$12,000-17,000 | ~$50-100 |
All software components are open source with no licensing fees.
Implementation Checklist
- Provision hardware meeting minimum requirements
- Install Ubuntu Server LTS or RHEL 9
- Deploy Ollama and pull required models
- Deploy Qdrant with persistent storage
- Deploy AnythingLLM with Ollama + Qdrant configured
- Configure TLS if exposing to internal network
- Set up backup procedures for vector database
- Document runbook for operations team
- Train users on interface and best practices
Next Steps
Ready to deploy a RAG system for your organization?
Book a Technical Scoping Call to discuss your specific requirements, hardware constraints, and implementation timeline.