The Problem Ollama Solves
You need LLM capabilities for document analysis, procedure generation, or incident summarization, but you can't send operational data to OpenAI. Maybe it's NERC CIP restrictions. Maybe it's corporate policy after watching competitors leak sensitive information through ChatGPT. Maybe you're just tired of explaining to procurement why you need another SaaS subscription that processes critical infrastructure data on someone else's hardware.
Ollama gives you a single binary that runs Llama 3.1, Mistral, Qwen, and 100+ other open models on your own metal. No API keys. No usage metering. No data leaving your network perimeter. I've deployed it in control centers that haven't seen internet connectivity in five years, and it just works.
The catch: you're now responsible for the entire stack. Model selection, prompt engineering, context window management, GPU provisioning — that's all on you. If you want someone else to handle that complexity, use a hosted service. If you need complete control and can operate infrastructure, Ollama is the right tool.
Who This Is For
Ollama makes sense when:
- You have NERC CIP Critical Cyber Assets that need AI augmentation
- Your data classification policy prohibits cloud LLM services
- You operate air-gapped networks (substations, generation facilities, offshore platforms)
- You need consistent inference costs regardless of query volume
- You have engineering staff who can manage containerized services
It doesn't make sense when:
- You need GPT-4 level reasoning for complex planning tasks
- You lack GPU resources (inference on CPU is painfully slow)
- You want pre-built agents and workflows (use AnythingLLM for that)
- Your team has no Linux systems administration experience
How to Deploy Ollama in Energy Operations
Initial Setup
I run Ollama on Ubuntu 22.04 LTS servers with NVIDIA A4000 or A5000 GPUs. The A4000 (16GB VRAM) handles 7B parameter models comfortably. The A5000 (24GB VRAM) runs 13B models at acceptable speeds. For 70B models, you need enterprise hardware or multiple GPUs — usually not worth it in operational environments.
Install takes one command:
```
curl -fsSL https://ollama.com/install.sh | sh
```
That's it. No Python virtual environments. No dependency hell. The binary includes the inference engine, model loader, and API server. It auto-detects your GPU and configures CUDA. I've never had an installation fail.
Model Selection Strategy
Start with llama3.1:8b. Pull it with:
```
ollama pull llama3.1:8b
```
This is your baseline. It handles 80% of operational tasks: summarizing incident reports, extracting data from PDFs, answering questions about procedures. Response quality is surprisingly good for a model that runs on modest hardware.
For specialized tasks, I use:
mistral:7b-instructfor code generation and structured outputqwen2.5:14bfor technical document analysisphi3.5:3.8bfor edge deployments where VRAM is constrained
Avoid the temptation to jump straight to 70B models. They're slow, resource-hungry, and rarely provide enough additional value to justify the operational complexity. In three years of energy sector deployments, I've never had a client keep a 70B model in production.
Integration Architecture
Ollama exposes an OpenAI-compatible API on port 11434. Point any tool that speaks OpenAI's protocol at http://your-server:11434 and it works. I've integrated it with:
- AnythingLLM for document chat and RAG workflows
- n8n for workflow automation (incident classification, report generation)
- Custom Python scripts for batch processing
- LibreChat for a ChatGPT-like interface that stays on-premise
For NERC CIP environments, deploy Ollama inside your Electronic Security Perimeter. No inbound internet access required. Models download once during initial setup, then the server runs indefinitely without external connectivity. I have Ollama instances that haven't touched the internet in 18 months and still serve thousands of queries daily.
If you're evaluating whether self-hosting makes economic sense for your operation, run the numbers through our SaaS vs Sovereign ROI Calculator — I've seen break-even timelines as short as four months for utilities processing large document volumes.
Operational Configuration
Set the context window and thread count in your Modelfile:
```
FROM llama3.1:8b
PARAMETER num_ctx 8192
PARAMETER num_thread 8
```
The default 2048 token context is too small for most energy sector documents. I use 8192 for technical specs and 16384 for regulatory filings. Larger contexts consume more VRAM and slow inference, so test with your actual workload.
Thread count should match your CPU cores. More threads don't always help — I've seen diminishing returns above 8 threads for most models.
What the Output Tells You
Ollama returns JSON with the model's response, tokens processed, and inference timing. The timing data matters more than most people realize.
eval_duration tells you how long inference took. For a 7B model on an A4000, expect 20-40 tokens per second. If you're seeing less than 10 tokens/sec, something's wrong — probably CPU inference instead of GPU, or memory swapping.
prompt_eval_duration is context loading time. This spikes when you send large documents. If it's taking longer than eval_duration, you're context-limited, not compute-limited. Either reduce your context size or upgrade to a GPU with more VRAM.
I log these metrics to Prometheus and alert when token generation drops below threshold. In production, performance degradation usually means:
- GPU memory fragmentation (restart Ollama)
- Thermal throttling (check your cooling)
- Disk I/O bottleneck (move model storage to NVMe)
Limitations and When NOT to Use Ollama
Ollama excels at inference. It's terrible at everything else.
No fine-tuning support. If you need to train models on proprietary data, use a different stack. Ollama loads pre-trained models. That's it. For energy sector use cases, this usually isn't a problem — prompt engineering and RAG handle 95% of customization needs. But if you're doing cutting-edge research, look elsewhere.
No built-in RAG. Ollama gives you an LLM API. It doesn't retrieve documents, chunk text, generate embeddings, or manage vector databases. That's intentional — it does one thing well. Pair it with AnythingLLM or build your own RAG pipeline with Qdrant and LangChain.
Limited concurrency. By default, Ollama processes one request at a time. You can run multiple models simultaneously, but each blocks on inference. For high-throughput applications, deploy multiple Ollama instances behind a load balancer. I typically run 3-4 instances per GPU in production.
Model quality ceiling. Open models lag GPT-4 on complex reasoning tasks. If you're doing multi-step planning, legal analysis, or nuanced policy interpretation, you'll be disappointed. For 80% of operational AI tasks — summarization, extraction, classification, simple Q&A — open models are sufficient. For the other 20%, keep a GPT-4 API key for exceptions.
How Ollama Connects to the EthosPower Ecosystem
At EthosPower, Ollama is our inference layer. It provides the LLM runtime that everything else builds on.
AnythingLLM sits in front of Ollama to add RAG capabilities. You point AnythingLLM at your Ollama endpoint, upload your documents, and get a ChatGPT-like interface that only knows your data. We deploy this for procedure manuals, technical specifications, and maintenance logs.
n8n calls Ollama for workflow automation. Incident report arrives via email → n8n extracts PDF → sends to Ollama for classification → routes to appropriate team → generates summary. Zero human intervention until an actual decision is required.
Qdrant stores embeddings when you need semantic search across large document collections. Ollama generates embeddings via ollama pull nomic-embed-text, Qdrant indexes them, and your RAG pipeline queries both. This is how we handle technical libraries with tens of thousands of documents.
The architecture is deliberately modular. Swap Ollama for vLLM if you need better concurrency. Replace Qdrant with pgvector if you want everything in Postgres. The components speak standard APIs, so you're not locked into any single tool.
The Verdict
Ollama earns its place in your infrastructure when you need LLM capabilities without vendor dependencies or data exfiltration risk. It's not the easiest option — managed services are simpler — but it's the right option when compliance, sovereignty, or economics demand self-hosting.
I deploy Ollama in every EthosPower engagement that involves air-gapped networks or NERC CIP systems. It's reliable, performant enough for operational use cases, and operationally simple once you understand the model selection trade-offs. The main failure mode I see is organizations choosing models that are too large for their hardware, then blaming Ollama for slow inference.
Start with llama3.1:8b on a server with at least 16GB VRAM. Run it for a month on real workloads. Measure token throughput and response quality. Then decide if you need something bigger, smaller, or specialized. Chat with EthosAI if you need help sizing hardware for your specific query patterns.