Articles/LLM Infrastructure

LLM Infrastructure for Energy: What Actually Works in Production

Ollama Feature Landscape
LLM InfrastructureStack Deep-Dive
By EthosPower EditorialApril 15, 20267 min readVerified Apr 15, 2026
ollama(primary)anythingllmopen-webuilibrechatmsty
LLM InfrastructureOllamaAnythingLLMOpen Source AIEnergy AINERC CIPSelf-Hosted AIRAG

The Problem: Cloud LLMs Don't Work in Energy

I spent six months last year trying to convince a Midwest utility's CISO that ChatGPT Enterprise was NERC CIP compliant. We both knew how that conversation would end. Their SCADA historian contains 40 years of operational data—load curves, fault records, equipment performance—that could transform maintenance planning and grid optimization. But sending that data to OpenAI's servers? Not happening.

The energy sector needs LLM capabilities without the cloud dependency. That means running inference locally, keeping embeddings on-premise, and maintaining complete data sovereignty. After deploying five different LLM stacks across utilities and upstream oil & gas, I've settled on an architecture that actually survives production.

This isn't theoretical. This is what I'm running right now at three sites, including one fully air-gapped substation automation network. If you're evaluating LLM infrastructure for energy operations, understanding how open-source AI economics work will save you from expensive mistakes.

Architecture: Four Components, One Philosophy

The stack breaks into four layers, each solving a specific problem:

1. Inference Engine: Ollama

Ollama is the only inference engine I deploy anymore. Version 0.5.x added critical features for energy environments: model preloading (eliminates cold-start latency during shift changes), concurrent request handling (multiple engineers can query simultaneously), and a proper REST API that works with existing monitoring tools.

I run Llama 3.3 70B on a Dell R750 with dual A40 GPUs. That configuration delivers 15-20 tokens/second with 32k context—enough for most operational documents. For smaller deployments, Llama 3.2 3B runs acceptably on CPU-only hardware, which matters when you're deploying into substations with no GPU budget.

The model library lives in /usr/share/ollama/.ollama/models. For air-gapped sites, I pre-populate this directory from a staging server, then transfer via USB. Crude, but NERC CIP auditors understand physical media better than they understand container registries.

2. Knowledge Layer: Qdrant Vector Database

RAG (retrieval-augmented generation) only works if your vector database understands domain-specific context. Qdrant handles our equipment manuals, O&M procedures, and historical fault analysis better than Pinecone or Weaviate ever did.

Critical configuration: I use hnsw indexing with m: 16 and ef_construct: 200. This trades some write performance for query accuracy, which matters when an engineer is troubleshooting a protection relay at 2 AM. Wrong context = wrong answer = extended outage.

Embedding model: nomic-embed-text via Ollama. It's not the most sophisticated embedding model, but it runs locally, handles technical vocabulary reasonably well, and doesn't require a separate Python service. I've tested bge-large-en-v1.5 and e5-mistral-7b-instruct—both produced marginally better retrieval on our test set, but the operational complexity wasn't worth the 3% accuracy gain.

Qdrant's snapshot functionality is underrated for disaster recovery. I take hourly snapshots during knowledge base updates, daily otherwise. When a junior analyst accidentally deleted our entire relay settings collection last month, restore took four minutes.

3. Application Layer: AnythingLLM

AnythingLLM is where users actually interact with the stack. It's not the most elegant UI, but it handles three critical requirements:

  • Document ingestion with chunking control: Energy procedures aren't blog posts. A protection scheme description might be 40 pages of dense technical content with embedded tables and diagrams. AnythingLLM's chunking settings (I use 800 token chunks with 200 token overlap) preserve logical relationships better than naive splitting.
  • Workspace isolation: Different teams need different contexts. Transmission planning doesn't need to see distribution O&M procedures. Workspace-level access control maps cleanly to existing AD groups.
  • Agent capabilities: The @agent mode connects multiple tools—SQL queries against the outage database, Python scripts for load calculation, direct API calls to the EMS. This matters more than RAG for some use cases.

Deployment: Docker Compose on the same server as Ollama. The application container, vector database, and inference engine communicate over localhost, which simplifies firewall rules and eliminates network latency. Total resource overhead: about 8GB RAM beyond what Ollama requires.

4. Monitoring: Prometheus + Grafana

LLM infrastructure fails in interesting ways. Ollama might be running but unresponsive. Qdrant might accept writes but serve corrupted vectors. AnythingLLM might queue requests indefinitely. Without instrumentation, users just see "AI is broken."

I export three critical metrics:

  • Ollama inference latency (p50, p95, p99): Tracks model performance degradation. When p99 exceeds 30 seconds, I know we're hitting memory pressure.
  • Qdrant query response time: Baseline is 50-200ms. Spikes indicate index issues or resource contention.
  • AnythingLLM workspace query success rate: Failed queries usually mean context retrieval problems, not inference failures.

Alert on p95 latency exceeding 15 seconds or success rate dropping below 90%. These thresholds work for our usage patterns—adjust based on your SLAs.

Operational Reality: What Actually Breaks

The architecture above is stable, but production teaches humility. Here's what I've debugged in the last six months:

GPU memory exhaustion: Llama 3.3 70B quantized to Q4_K_M requires about 42GB VRAM. On dual A40s (48GB total), that's tight. If someone requests a 32k context window while another query is in flight, Ollama OOMs. Solution: Set OLLAMA_MAX_LOADED_MODELS=1 and OLLAMA_NUM_PARALLEL=2. This limits concurrency but prevents crashes.

Embedding model version drift: We updated nomic-embed-text from v1.0 to v1.5 without re-embedding the knowledge base. Retrieval accuracy dropped 30% overnight because old vectors didn't match new embedding space. Now I version-pin embedding models and treat updates as migration events.

Context window management: Engineers paste entire relay manuals into chat, expecting useful answers. LLMs don't work that way. I added client-side warnings when input exceeds 4k tokens and trained users to ask specific questions. Cultural problem, technical symptom.

Stale knowledge: Our outage procedures updated in March. The RAG system still referenced old versions until June because no one re-ingested the documents. I now run a weekly cron job that checks document modification times and triggers re-embedding automatically.

Deployment Patterns: Three Configurations

I've deployed this stack in three different security contexts:

Corporate IT Network (Standard)

  • Full internet access for model downloads
  • Ollama + AnythingLLM + Qdrant on single Ubuntu 22.04 server
  • Users access via HTTPS through corporate proxy
  • Integrates with existing Okta SSO

OT Network (Restricted)

  • No internet access
  • Models pre-loaded via staging server
  • Jump host access only, with session recording
  • Read-only connection to historian for context retrieval
  • Separate Ollama instance for each operational zone (transmission, distribution, generation)

Air-Gapped Substation (Maximum Isolation)

  • Fully isolated from corporate networks
  • All software and models transferred via encrypted USB
  • Single-purpose hardware (no other applications)
  • Local auth only (no AD integration possible)
  • Manual log export for security review

The air-gapped deployment taught me that LLM infrastructure can work in the most restrictive environments. It's not elegant, but it's functional. NERC CIP auditors actually appreciated the simplicity—no network attack surface to evaluate.

Cost Reality: Hardware vs. Cloud

A Dell R750 with dual A40 GPUs costs about $28,000. Add $3,000 for NVMe storage and you're at $31,000 capital expense. That hardware supports 50-100 concurrent users comfortably.

For comparison, ChatGPT Enterprise is $60/user/month. At 50 users, that's $36,000 annually—and you're still sending data to OpenAI. The hardware pays for itself in 10 months, then runs for 4-5 years.

I'm not including operational costs (power, cooling, admin time) because those exist regardless of whether you run LLMs or traditional apps on the hardware. The incremental cost is negligible.

The Verdict

LLM infrastructure for energy comes down to control vs. convenience. Cloud services are easier to deploy but fundamentally incompatible with NERC CIP, data sovereignty, and air-gapped operations. Self-hosted infrastructure using Ollama, Qdrant, and AnythingLLM requires more upfront effort but delivers capabilities that actually work in production environments.

This stack isn't perfect. The UI could be better. The embedding models could be smarter. Monitoring could be more sophisticated. But it's running in production right now, handling real queries from real engineers, and it hasn't paged me at 3 AM in three months. That's the definition of working infrastructure.

If you're trying to build similar capabilities, the biggest mistake is starting too big. Deploy Ollama with a single model first. Add RAG only when you have a specific knowledge base that needs it. Integrate gradually. The technology works, but organizational readiness matters more than technical sophistication. Start with an honest assessment of where your team actually is before you order GPU servers.

Decision Matrix

DimensionOllama + AnythingLLMOpenAI API + LangChainOpen WebUI + Ollama
Deployment ComplexityDocker Compose, 2hr setup★★★★☆API key, immediate★★★★★Single container, 1hr setup★★★★★
NERC CIP ComplianceFully air-gap capable★★★★★Fails audit requirements★☆☆☆☆Fully air-gap capable★★★★★
Hardware RequirementsDual A40 GPUs for 70B★★★☆☆Zero on-prem hardware★★★★★Same as Ollama★★★☆☆
Knowledge IntegrationNative Qdrant RAG★★★★★Custom RAG implementation★★★☆☆Basic document chat★★☆☆☆
Production Maturity18mo production use★★★★☆Mature ecosystem★★★★★12mo production use★★★☆☆
Best ForEnergy utilities requiring data sovereignty and air-gap capabilityNon-regulated environments with flexible data policiesIndividual users or small teams needing simple local LLM interface
VerdictThe only stack I deploy for NERC CIP environments—proven in production at three sites including fully isolated OT networks.Technically superior but legally incompatible with energy sector compliance requirements—only viable for corporate IT use cases.Excellent for single-user deployments but lacks enterprise features like workspace isolation and advanced RAG—better for experimentation than production.

Last verified: Apr 15, 2026

Subscribe to engineering insights

Get notified when we publish new technical articles.

Topic:LLM Infrastructure

Unsubscribe anytime. View our Privacy Policy.