The Conference Circuit Version
Every AI architecture diagram I see at energy conferences shows the same thing: a neat stack with data lakes feeding vector embeddings into some cloud LLM, wrapped in microservices, orchestrated by Kubernetes. The vendor slides promise sub-second retrieval, infinite scale, and seamless integration with your existing systems.
Then you try to deploy it in a substation control house with no internet connection, where the newest server runs CentOS 7, and every configuration change requires a change management board review three weeks out. That's when the architecture diagrams meet reality.
I've spent the last three years at EthosPower deploying AI infrastructure across utilities, renewable operators, and oil & gas facilities. We've put vector databases in NERC CIP High zones, run LLMs on air-gapped networks, and built RAG systems that answer questions about equipment manuals written in 1987. Here's what actually works when you can't just "spin up a cloud instance."
If you're still figuring out where your organisation stands on this journey, the AI Readiness Assessment will save you from discovering your gaps mid-deployment.
The Storage Layer: Why I Stopped Fighting Graph Versus Vector
The first architectural decision everyone obsesses over: graph database or vector database? I wasted six months trying to pick one.
The reality: you need both, and they do completely different jobs.
Vector databases (we deploy Qdrant almost exclusively now) handle semantic search and retrieval. When an operator asks "show me all breaker failures in the last month similar to this one," you're doing vector similarity against embedded incident reports. Qdrant's Rust implementation runs lean enough to deploy on modest hardware—I've got production instances running on 16GB RAM servers in switchgear rooms. The quantization options let you trade accuracy for memory footprint, which matters when you're working with equipment from 2012.
Configuration that works: HNSW index with m=16, ef_construct=100 for most energy datasets under 10 million vectors. Use scalar quantization if you're memory-constrained; the accuracy hit is negligible for equipment documentation and incident reports.
Graph databases (Neo4j, specifically) model relationships that vectors can't represent. Electrical topology, protection coordination, clearance dependencies—these are inherently graph problems. When you need to trace which relays see a fault, or determine what equipment loses power if you open a specific breaker, you need graph traversal, not vector similarity.
I run Neo4j for asset relationships and Qdrant for document retrieval in the same stack. They're not competitors; they're complementary. Neo4j holds the structured knowledge graph of your physical plant. Qdrant indexes the unstructured documentation, procedures, and historical records.
The integration pattern that works: store document metadata and graph node IDs in Qdrant payloads. When you retrieve a semantically similar document, you can immediately query Neo4j for related assets, parent systems, or dependent procedures. This hybrid approach answers questions like "find procedures similar to this one that apply to equipment downstream of breaker 52-3B."
The LLM Layer: Local Models or Bust
In energy operations, you cannot send operational data to external APIs. Not "shouldn't"—cannot. NERC CIP-002 through CIP-011 make this non-negotiable for bulk electric systems. Even non-CIP facilities often have contractual confidentiality requirements.
This means local LLM inference. We deploy Ollama for model management and serving. It's not the only option (vLLM, llama.cpp, LocalAI all work), but Ollama's model library and simple API make it practical for teams who aren't ML engineers.
Current model selection for energy workloads:
- Llama 3.1 8B: General question answering, procedure lookup, basic reasoning. Runs acceptably on CPU-only servers (16-core Xeon, 64GB RAM). Quantized to Q4_K_M for 6GB memory footprint.
- Mistral 7B: Faster inference than Llama for simple queries. We use it for real-time operator assist where sub-2-second response matters.
- Llama 3.1 70B: Complex reasoning, multi-step procedures, regulatory interpretation. Requires GPU (we typically deploy with 2x RTX A5000, 48GB total VRAM). Reserve for high-value use cases where accuracy justifies the hardware.
Do not underestimate inference latency in operational environments. An operator troubleshooting an outage will not wait 30 seconds for an AI response. We target p95 latency under 3 seconds for Q&A, which means aggressive caching and sometimes accepting lower-quality models for speed.
The RAG Pipeline: Where Theory Meets PDFs Scanned in 1994
Retrieval-Augmented Generation is the only practical pattern for energy AI. You cannot fine-tune models on proprietary procedures and equipment specs—the data volume is too small and changes too frequently. RAG lets you separate the knowledge base (your documents) from the reasoning engine (the LLM).
What the textbooks don't tell you: document ingestion is 80% of the work.
Energy sector documents are nightmares:
- Scanned PDFs with no text layer
- CAD drawings saved as image PDFs
- Protection settings in Excel spreadsheets embedded in Word documents
- Procedures that reference other procedures by document numbers that changed in 2003
- Equipment manuals with part numbers but no actual equipment identifiers
We've built ingestion pipelines in AnythingLLM for simpler deployments and custom n8n workflows for complex cases. The common pattern:
- OCR everything (Tesseract for open-source, Azure Document Intelligence when accuracy matters more than cost)
- Chunk on logical boundaries, not fixed token counts—respect section headers, numbered steps, table boundaries
- Enrich chunks with metadata during ingestion—document type, equipment tags, system identifiers, revision dates
- Store metadata as filterable fields in Qdrant, not just in the vector payload
The metadata filtering is critical. When an operator asks about a specific transformer, you must filter retrieval to documents tagged with that equipment ID. Vector similarity alone will return content about similar transformers, which is actively dangerous in operations.
The Application Layer: Agents Versus Assistants
The AI agent hype is exhausting. Every vendor pitches "autonomous AI agents" that will "manage your grid."
In three years, I've deployed exactly zero autonomous agents in operational environments. What I've deployed: assistants that augment human decision-making.
The distinction matters. An agent acts independently. An assistant retrieves information, suggests actions, and explains reasoning—but a human makes the final decision. In energy operations, this isn't just conservative practice; it's regulatory and safety reality. No AI is signing off on a switching order or authorising equipment clearance.
What works: task-specific assistants with narrow scope.
- Procedure lookup: "Find the lockout/tagout procedure for this transformer." Retrieves relevant documents, highlights applicable sections, links to related safety requirements.
- Incident similarity: "Show me similar breaker failures." Searches historical incidents, ranks by vector similarity, surfaces common causes and resolutions.
- Regulatory interpretation: "Does this modification require a NERC CIP change request?" Retrieves relevant standards, interprets applicability based on BES categorisation.
These assistants run in AnythingLLM instances deployed on-premises. The multi-user workspace model works well for operations teams—each discipline (protection, SCADA, relay, compliance) gets a workspace with document collections specific to their domain.
For more complex workflows with external system integration, we build in n8n. Example: an assistant that queries the SCADA historian via Modbus, retrieves equipment manuals from the document store, and generates a maintenance report. That's not an autonomous agent; it's a workflow orchestrator with LLM-powered summarisation.
The Security Layer: This Is Not Optional
Deploying AI in NERC CIP environments means treating it like any other cyber asset. If your AI system has network connectivity to a BES Cyber System, it's a Protected Cyber Asset. That means:
- Ports and services hardening (CIP-007-6 R1)
- Security patch management (CIP-007-6 R2)
- Malware prevention (CIP-007-6 R3)
- Security event logging (CIP-007-6 R4)
- System access controls (CIP-005-6, CIP-007-6 R5)
This isn't theoretical. I've seen utilities fail CIP audits because they deployed a "proof of concept" AI tool that wasn't hardened and patched. POCs become production, and auditors don't care about your intentions.
Practical implications:
- Deploy on minimal base OS (Rocky Linux 9, Ubuntu 22.04 LTS)—disable everything you don't need
- Container images must come from verified sources or be built internally—no random Docker Hub pulls
- API authentication is mandatory—even internal services need token-based auth at minimum
- Log all queries and responses for audit trail (CIP-007 R4, CIP-008 incident response)
- Implement role-based access aligned with your ESP access controls
The air-gap question comes up constantly. Yes, you can run this entire stack with zero internet connectivity. Ollama models download once, then run offline. Qdrant and Neo4j don't phone home. AnythingLLM is fully self-contained. The constraint is model updates—you'll need a manual process to transfer new model weights via approved media.
What I'd Do Differently
If I were starting from zero today:
Start with document retrieval only. Deploy Qdrant and a basic RAG pipeline in AnythingLLM. Get operators using AI-powered search before you attempt reasoning or analysis. Build trust in the system's ability to find the right document before you ask them to trust its interpretation.
Delay the graph database. Neo4j adds immense value, but only after you have clean asset data. If your CMMS is a mess and nobody trusts the equipment hierarchy, fixing that is prerequisite work. Vector search on documents delivers value immediately; graph queries require foundational data quality you might not have.
Invest in metadata. The quality of your RAG system is directly proportional to the metadata you capture during ingestion. Equipment IDs, system tags, document types, revision dates—these must be extractable and filterable. This is unglamorous ETL work, but it's the difference between a useful assistant and a expensive keyword search.
Right-size your models. Running a 70B parameter model on CPU is technically possible and practically unusable. Better to run a quantized 8B model that responds in 2 seconds than a larger model that takes 45. Operators will use the fast one and ignore the slow one, regardless of accuracy differences.
For a realistic view of what this infrastructure costs to build and operate versus SaaS alternatives, run the numbers through the Sovereign Savings Calculator—vendor quotes always hide the long-term lock-in costs.
The Verdict
AI architecture for energy is not about picking the trendiest database or the largest language model. It's about building systems that work in constrained, regulated, high-consequence environments where "move fast and break things" is how people get hurt.
The stack that works: Qdrant for vector storage, Neo4j for graph relationships when you need them, Ollama for local LLM inference, and task-specific assistants that augment human expertise rather than replacing it. Deploy on-premises, harden for CIP compliance from day one, and obsess over metadata quality in your document ingestion.
This isn't the architecture you'll see in vendor slide decks. It's the one that survives audits, runs on hardware you can actually procure, and delivers value to operators who need answers in seconds, not minutes. Start with the AI Implementation Cost Calculator to budget this properly—underestimating infrastructure and data preparation costs is how projects fail.