With Ollama and rentabot.chat, you can run a fully self-hosted AI chatbot where no customer data ever leaves your servers. This is ideal for GDPR compliance, healthcare organizations, financial services, and any environment where sending data to third-party APIs is not an option.
Why would you self-host an AI chatbot?
Cloud-hosted chatbots that use OpenAI or Anthropic APIs send every customer message to a third-party server. For many businesses, this is fine. But for some, it's a dealbreaker:
- Regulatory requirements — GDPR, HIPAA, SOC 2, and other frameworks may require data to stay within your infrastructure or specific geographic regions. See our GDPR compliance guide for details.
- Customer trust — some customers (especially enterprise clients) simply won't use your chatbot if they know their queries go to OpenAI
- Air-gapped environments — government, defense, and certain financial systems operate on networks with no internet access
- Cost control — at high volumes, running your own model can be cheaper than paying per-token to cloud API providers
- Vendor independence — you're not affected by provider outages, price changes, or policy updates. Read more about avoiding vendor lock-in.
What is Ollama and why does it matter?
Ollama is an open-source tool that makes running LLMs on your own hardware as simple as running a Docker container. It handles model downloading, GPU acceleration, memory management, and exposes a standard API that's compatible with the OpenAI format.
Before Ollama, self-hosting a model meant dealing with Python dependencies, CUDA drivers, model quantization, and custom inference servers. Ollama reduces that to a single command: ollama run llama3.
Architecture overview
Here's how the pieces fit together:
- Your server runs Ollama with your chosen model (Llama 3, Mistral, Qwen, etc.)
- rentabot.chat connects to your Ollama instance instead of OpenAI or Anthropic
- The widget on your website talks to rentabot.chat, which routes requests to your local Ollama server
- Customer messages never leave your network — the entire inference pipeline runs on your hardware
Pro tip
You can run a hybrid setup: use Ollama for sensitive conversations and fall back to cloud APIs for non-sensitive traffic. This gives you the best of both worlds — privacy where you need it, speed and quality everywhere else.
Step-by-step setup guide
Prerequisites
- A Linux server (Ubuntu 22.04+ recommended) or macOS machine
- At least 16GB RAM (32GB+ recommended for larger models)
- NVIDIA GPU with 8GB+ VRAM for production use (CPU works for testing)
- Docker and Docker Compose installed
1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh2. Pull a model
# Good starting point for customer support
ollama pull llama3:8b
# Higher quality, needs more VRAM
ollama pull llama3:70b3. Verify the API is running
curl http://localhost:11434/api/tags4. Configure rentabot.chat to use your Ollama endpoint
In the dashboard, go to your tenant's LLM settings and set the provider to "Ollama." Enter your server's URL (e.g., http://your-server:11434) and select your model. No API key needed.
5. Test the connection
Send a test message through the widget. The response should come from your local Ollama instance. Check the Ollama logs to confirm: ollama logs.
Which models work best for customer support?
- Llama 3 8B — best balance of quality and speed. Fits in 8GB VRAM. Handles most support scenarios well.
- Mistral 7B — slightly faster than Llama 3 8B with comparable quality. Good for high-traffic deployments.
- Qwen 2.5 7B — strong multilingual support if you serve customers in multiple languages.
- Llama 3 70B — near-GPT-4 quality for complex queries. Requires 40GB+ VRAM or multiple GPUs.
Performance considerations
GPU vs CPU inference
GPU acceleration makes a dramatic difference. With a Llama 3 8B model:
- GPU (RTX 3060) — 30-50 tokens/second, ~0.5-1 second first-token latency
- CPU only (modern 8-core) — 5-10 tokens/second, ~2-5 second first-token latency
For production deployments serving real customers, GPU acceleration is strongly recommended. CPU-only mode is acceptable for testing and low-traffic internal tools.
Scaling considerations
A single Ollama instance with a 7-8B model on an RTX 3060 can handle roughly 5-10 concurrent conversations comfortably. For higher traffic, you can run multiple Ollama instances behind a load balancer.
Self-hosted vs cloud: an honest comparison
| Factor | Self-Hosted (Ollama) | Cloud API (OpenAI/Anthropic) |
|---|---|---|
| Data privacy | Full control — data stays on your servers | Data sent to third-party servers |
| Setup time | 1-2 hours | 5 minutes |
| Ongoing cost | Hardware + electricity (fixed) | Per-token pricing (variable) |
| Model quality | Very good (Llama 3, Mistral) | Excellent (GPT-4o, Claude) |
| Maintenance | You manage hardware and updates | Zero maintenance |
| Latency | 0.5-1s (GPU), 2-5s (CPU) | 0.3-0.8s typical |
| Scaling | Add more hardware | Automatic |
| Offline support | Yes — works air-gapped | No — requires internet |
The right choice depends on your constraints. If data privacy is non-negotiable, self-hosting is the answer. If ease of use and model quality are top priorities, cloud APIs are hard to beat. Many organizations use both — self-hosted for sensitive data, cloud for everything else.
Frequently asked questions
Can I run an AI chatbot without sending data to OpenAI?
Yes. With Ollama, you run open-source LLMs like Llama 3 or Mistral entirely on your own hardware. No customer data leaves your network.
What hardware do I need?
For production: a server with 16GB+ RAM and an NVIDIA GPU (RTX 3060 or better). For testing: CPU-only mode works with 16GB RAM, but expect 2-5 second response times instead of under a second.
Which model should I use for customer support?
Llama 3 8B offers the best balance of quality and speed. It fits in 8GB VRAM and handles most support scenarios well. For higher quality at the cost of more hardware, try Llama 3 70B.
Self-hosting an AI chatbot is no longer a research project — it's a practical option for businesses that need full control over their data. Start with a single GPU server and Ollama, and scale from there.




