Self-Hosted LLMs in Production: What It Actually Takes to Cut API Costs
Most businesses running AI workloads hit the same wall: the API bill. A team of 10 using Claude or GPT-4 heavily can easily spend $500 to $5,000 a month. At that spend, the math on self-hosting starts to look interesting — but the pitch decks that say “run it yourself for free” leave out the hard parts.
This post is a case study of our production self-hosted AI stack. Real hardware, real numbers, real tradeoffs. Written for anyone weighing whether to migrate their API-dependent workload to local inference.
The Stack
We run a three-GPU inference server on an Unraid host. The primary workload is a multi-agent AI coding system, a personal assistant with 54 tool integrations, and a content pipeline for four websites.
- GPUs: 3x NVIDIA Tesla V100 (32GB each) — 96GB total VRAM
- Host: Dell R7525, dual EPYC, 256GB RAM
- Inference engine: Ollama 0.18.1 with tensor parallelism
- Primary model: Qwen3-Coder (80B MoE, custom quantization)
- Secondary model: Llama 3.3 (70B) for long-form content
- Voice: OpenAI Whisper (large-v3) for speech-to-text
Four Claude Code workers, a WebSocket assistant backend, and a scheduled content generator all share this inference capacity.
The Numbers
Throughput on the primary 80B model:
| Metric | Value |
|---|---|
| Tokens/sec (single stream) | ~42 |
| Concurrent sessions | 4 |
| Context length (configured) | 8,192 tokens |
| Model load time | ~45 seconds |
| Keep-alive | infinite (prevents reload race) |
Monthly cost comparison, workload-equivalent:
| Category | Cloud API | Self-Hosted |
|---|---|---|
| Inference compute | $200 - $400/mo | ~$35/mo electricity |
| Storage | included | one-time |
| Hardware | — | amortized |
| Data egress | varies | $0 |
The electricity figure is measured, not estimated. Three V100s under mixed load pull roughly 350-450W continuous at our Florida utility rate.
What Self-Hosting Actually Buys You
It’s not just cost. The four things that made the migration worth it, in priority order:
1. Zero data egress. Every prompt, every document, every code diff stays on the network. For clients in healthcare, legal, and finance, this alone pays for the hardware.
2. Predictable latency. Cloud APIs rate-limit, throttle, and have bad afternoons. Local inference has exactly one failure mode: your own box. You control it.
3. Model choice. We can run models that don’t exist as APIs — fine-tuned variants, quantized MoE models, specialized coding models. The frontier API vendors don’t publish everything they train.
4. Cost at scale is flat. API cost scales linearly with usage. GPU cost is fixed until you saturate it. Past a certain volume, the curve crosses and local gets dramatically cheaper.
What It Costs You
Four things the pitch decks don’t mention:
Capacity planning is your problem now. Cloud APIs scale for you. When your team hits peak usage at 2 PM on a Tuesday, you either have the GPU capacity or you don’t. We saturate our three-V100 stack most afternoons. Adding a fourth V100 is on our roadmap.
Model updates are your problem now. When OpenAI ships GPT-5, your users get it automatically. When DeepSeek releases a new coder model, you download 80GB, test it, benchmark against your current stack, and migrate carefully. That’s real engineering time.
Debugging is harder. A cloud API that returns garbage is usually a prompt problem. A local model that returns garbage could be the prompt, the quantization, the inference engine, the GPU driver, the cooling, or CUDA version mismatch. We’ve hit all six.
You will hit the power and heat ceiling. V100s run hot. Older data center GPUs are cheap on the secondary market for a reason — they’re designed for rack-mounted airflow. We run ours in a server chassis with industrial fans. A home office with a desktop tower will not work.
The Build-or-Buy Decision
Here’s the honest framework we use when clients ask:
Stay on the API if:
- You’re spending less than $300/month total
- Your traffic is spiky and infrequent
- Your use case requires frontier model quality (GPT-4, Claude Opus)
- You have no one who wants to own the infrastructure
Consider self-hosting if:
- You’re spending more than $1,000/month on inference
- Your use case tolerates a slightly older model generation
- Data privacy or compliance is a real requirement
- You have a technical team that can troubleshoot infrastructure
- Your workload is predictable and sustained
Hybrid is usually the right answer. Route 80% of queries to a local model (cheap, private, fast). Fall back to a frontier API only when the local model isn’t good enough. Our own stack works this way: routine queries go to Qwen3 locally, hard reasoning goes to Claude via API. Our combined API spend has dropped more than 90% compared to pure-cloud.
The Technical Gotchas
A few lessons learned the hard way, for the engineers who will actually implement this:
Context Length Defaults Will Crash You
Ollama’s default context length is 256K tokens. On a 3x V100 setup running an 80B model, that’s instant OOM. Set OLLAMA_CONTEXT_LENGTH=8192 for chat, bump to 32768 only for long-form tasks. Measured per-request if possible.
Keep-Alive Prevents Race Conditions
Default model eviction after idle caused hard crashes when two requests hit during reload. Set OLLAMA_KEEP_ALIVE=-1 to keep models in VRAM indefinitely. Fewer cold starts, no race conditions.
Tensor Parallelism ≠ Pipeline Parallelism
Multi-GPU speedup is a misconception. With Ollama’s current pipeline parallelism, three V100s give you three times the concurrent capacity — not three times the single-stream speed. If you need single-stream throughput, the model has to fit in one GPU’s VRAM. Plan accordingly.
V100s Have No NVENC
If you were hoping to colocate video encoding workloads (Tdarr, Plex transcoding) on the same GPUs — the V100 skipped the NVENC block. Use Intel QSV or CPU transcoding instead.
GGUF Models Need a Modelfile
Importing a raw GGUF without a proper chat template gives you {{ .Prompt }} — raw completion mode, no stop tokens, leaked training data in the output. Write a Modelfile with the correct chat template (ChatML for Qwen, Llama-2 format for Llama) before importing.
The Migration Path
For clients ready to move, the path is predictable:
- Audit current usage. Log every prompt and response for two weeks. Categorize by complexity: simple (regex-able), medium (small model), hard (frontier model).
- Benchmark a local model on the simple and medium categories. If accuracy is acceptable, you have 70-80% of your traffic ready to migrate.
- Deploy the local stack in parallel with cloud. Route progressively more traffic to local as confidence builds.
- Add monitoring. Latency, error rate, GPU utilization, VRAM pressure, model hit rate. You need to see everything.
- Cut over when stable. Keep cloud as fallback for the hard 20% you didn’t migrate.
The whole migration usually takes 4-8 weeks for a mid-sized workload. The hardware pays for itself in 2-6 months at typical consulting-volume API spend.
Should You Build This Yourself?
If your team has a strong systems engineer who wants to own it — absolutely. The tooling has matured. Ollama, vLLM, and llama.cpp are all production-grade now. Quantization quality has caught up with full-precision models for most use cases.
If you don’t have that engineer — or if your team’s time is better spent on your actual product — this is exactly the kind of project we take on as consulting work. We’ve built and operate the stack described above. We know where the edges are.
Need help deciding? Book a free 30-minute discovery call — we’ll assess your current API usage, sketch a target architecture, and give you a realistic cost/timeline estimate. No pitch, no pressure. Just the numbers.
Enjoyed this post?
Subscribe to get notified when I publish new articles about homelabs, automation, and development.
No spam, unsubscribe anytime. Typically 2-4 emails per month.