Local LLM
ActiveSelf-hosted AI inference with Ollama. Running Qwen 2.5 models for general tasks and code generation without cloud dependencies.
Configuration
Container
ollama/ollama
RAM Limit
16GB
API
REST (local only)
Access
LAN — not exposed
Case Study: Cost-Effective AI Development Workflow
The Challenge
Reduce dependency on expensive cloud AI APIs while maintaining access to capable language models for development tasks, code generation, and document summarization — all while keeping sensitive code local.
The Solution
- ✓ Deployed Ollama in Docker container with dedicated RAM allocation
- ✓ Configured Qwen 2.5 7B models for balance of speed and capability
- ✓ Integrated with Claude Code workers for tiered inference (local first, API fallback)
- ✓ Set up MCP server for standardized model access across tools
- ✓ Implemented prompt routing to use local models for simple tasks
The Results
Available Models
| Model | Parameters | Purpose | VRAM |
|---|---|---|---|
| qwen2.5:72b | 72B | Complex reasoning, primary model for most tasks | ~28GB |
| qwen2.5-coder:32b | 32B | Code generation and analysis | ~20GB |
| llama3.1:8b | 8B | Fast tasks, classification, simple extraction | ~6GB |
| codellama:34b | 34B | Code review and refactoring | ~22GB |
Benefits
Cost Reduction
Free inference for simple tasks that would otherwise use paid API calls. Saves money on summarization, status checks, and simple questions.
Privacy
Sensitive code and data never leaves the local network. No cloud provider sees your prompts or responses.
Speed
Local inference with no network latency. Responses start immediately without waiting for API round-trips.
Availability
Works offline and during API outages. Not dependent on external service availability.
Integrations
| Claude Code Workers | Workers use Ollama for simple tasks before falling back to paid APIs |
| Ollama MCP Server | Model Context Protocol server for standardized LLM access |
| Home Assistant | Potential integration for natural language automation control |