Case study — AI Integration

Nova — personal
AI assistant.

A self-hosted AI assistant with 56 integrated tools, mobile + voice access, and a routing layer that hands 90% of queries to a local model — keeping cloud spend at lunch-money levels.

Duration: 8 months in production
Cloud spend: ~$50/mo cloud spend (was $500+/mo on equivalent SaaS)
Stack: Node.js / React Native / Ollama (local LLM) / Claude API / WebSocket / Docker

01 — The problem

Three frustrations, one bill.

Off-the-shelf AI assistants (ChatGPT, Claude, Gemini, etc.) all share three frustrations: they don't know your stack, your data leaves your network, and the bill scales with usage. For someone running a home automation system, a media server, a trading bot, four niche websites, and 48 mobile apps, the per-token costs alone would have been $500+/month. The privacy story would have been worse — every config file, every email draft, every scratchpad would touch a cloud provider.

02 — The solution

Three-tier model routing.

Nova is a custom AI assistant that runs as a containerized service on a Dell R7525 with 3× NVIDIA Tesla V100 GPUs (96 GB VRAM total). It exposes a WebSocket API consumed by a React Native mobile app and a web dashboard. The architectural bet is three-tier model routing:

01

Tier 1 — Regex & deterministic dispatch

Commands like "lights off" or "what's the cache fill?" hit a router that maps directly to a tool call. Latency: <50ms. Cost: zero.
02

Tier 2 — Local LLM (Ollama, qwen3-coder-next, 80B MoE)

Anything ambiguous goes to the local model with the full toolset attached. Latency: 1-3s. Cost: zero (already paid for the GPUs).
03

Tier 3 — Cloud API (Claude / GPT)

Only when the local model self-reports low confidence or the task requires the very best reasoning. Latency: 2-5s. Cost: ~$0.02-0.20 per call.

The router measures which tier handled the query and writes a daily breakdown. In practice, ~90% of traffic stops at Tier 1 or 2 — leaving Tier 3 to handle the truly hard requests where paying for Claude actually moves the needle.

03 — What got integrated

56 tools across categories.

Home & infrastructure

Home Assistant control, Docker container management, Unraid status, server health.

Personal data

Calendar, email triage, password vault read, contact lookup, document search across Nextcloud.

Build & dev

App build status, deploy triggers, Gitea repo browsing, log tailing.

Finance & research

Stock quotes, portfolio diffing, SEC filing pulls, scheduled cron monitors with push alerts.

Communication

Push notifications to phone, email composition with attachments, SMS via Twilio.

04 — Results

Eight months in production.

Cloud cost

~$50/month instead of an estimated $500+/month with equivalent cloud-only setup.

Latency

Sub-second response for ~70% of queries (Tier 1 + warm Tier 2).

Privacy

Personal data never leaves the LAN unless Tier 3 is invoked, and Tier 3 calls are logged and reviewable.

Reliability

99.4% uptime over the last 6 months, including two server reboots and one GPU driver upgrade.

Replaced

ChatGPT Plus, Google Assistant, three separate dashboard apps, a $20/mo notification service, manual SSH-into-server workflows.

05 — What surprised us

Lessons from the build.

Tool definitions matter more than prompts.

The single biggest accuracy gain came from rewriting tool descriptions — not refining the system prompt. A tool description that says "use this when the user mentions X" pulls correct invocations dramatically better than vague capability lists.

WebSocket beat polling and beat raw HTTP for mobile.

A persistent WebSocket connection lets the assistant push progress updates ("checking your inbox… found 3 new") instead of going silent for 5 seconds while the model thinks. UX delta is large; engineering effort was small.

Local models are good enough for 90% of personal-assistant work.

The hard part isn't model capability — it's tool routing, prompt engineering, and keeping the conversation context tight. Throwing GPT-4o at a problem that qwen3-coder handles in 800ms locally is bad engineering.

06 — What we'd build for you

A stripped-down Nova, your stack.

Most companies don't need a 96GB GPU — but they almost certainly have:

A bunch of internal SaaS subscriptions whose data could be queried by an LLM if it had API access.
A handful of repetitive workflows (status reports, ticket triage, customer FAQ) that an assistant could front-line.
A privacy or compliance constraint that makes "send everything to OpenAI" a non-starter.

For ~$5K-$15K we can build a stripped-down Nova for your stack: 5-15 tools, your authentication, your model preference (cloud, local, or hybrid), deployed to your infrastructure, with documentation for your team to extend.