Arx Certa · Blog

How to Host an LLM in Your Own Cloud: A Step by Step Guide

April 28, 2026 · 12 min read

Most conversations about large language models still assume you will call an API someone else runs. That works for prototyping, but production use inside a UK business that handles regulated data, personally identifiable information, or competitive IP hits a wall quickly. The alternative is to deploy the model on infrastructure you control. That is what this guide covers: a step-by-step, technically honest walkthrough of how to host an LLM in your own cloud.

Self-hosting is not a one-click exercise. It demands GPU instance selection, containerised serving, network hardening, and cost discipline. But the outcome is an endpoint that stays inside your VPC, obeys your compliance rules, and does not generate unpredictable per-token bills. By the end, you will have a blueprint for production deployment, and you will understand where the risk of getting it wrong sits.

Why Host an LLM Yourself?

The first reason is data control and UK GDPR compliance. When your model processes customer data or internal documents, you need to know where every byte lives and who can touch it. A public API puts your prompt text onto another company's servers, in a jurisdiction you probably have not assessed. Hosting on your own cloud account keeps data inside your boundary, which simplifies Data Protection Impact Assessments and satisfies the data minimisation principle.

Cost predictability matters just as much. API pricing looks cheap per 1,000 tokens but scales unpredictably when usage grows. Self-hosting exchanges variable OpEx for fixed infrastructure cost. You pay for the instance, whether it processes ten queries or ten thousand. For teams with steady workloads, that trade-off works in your favour.

Then there is latency. A model running in your own cloud region, maybe inside the same VPC as your application, responds in tens of milliseconds rather than hundreds. That difference is material for real-time chat, document processing pipelines, or any user-facing feature where responsiveness defines the experience. Finally, self-hosting opens the door to fine-tuning on proprietary data. You can adapt an open-weight model to your industry terminology, support-specific workflows, or embed institutional knowledge that would never leave your network.

Choosing the Right LLM for Self Hosting

The open-weight model landscape moves fast. As of mid-2026, the practical shortlist for most UK teams includes Meta's Llama 3 family, Mistral's models, TII's Falcon, and Google's Gemma. Llama 3 8B and 70B are the most widely benchmarked and have the richest ecosystem of quantised variants and serving recipes. Mistral's 7B and Mixtral 8x7B offer strong multilingual performance and efficient architecture. Falcon 40B and 180B exist but are heavier to run; the 7B variant remains the sensible entry point. Gemma 2 models from Google come with permissive terms and perform well at smaller scales.

Size and performance trade-offs are the first filter. A 7B or 8B parameter model runs comfortably on a single consumer-class GPU or an A10G cloud instance. It handles summarisation, classification, Q&A, and simple extraction tasks. A 13B model needs more VRAM, typically an A100 40GB. A 70B or Mixtral-style mixture of experts model demands multiple GPUs or an H100 node. Matching model size to your use case keeps costs sane. Do not pay for 70B parameters if an 8B model with a well-crafted prompt delivers acceptable quality.

Quantised versions save money and open up hardware options. GGUF format, used by llama.cpp, compresses weights to 4-bit or 5-bit precision with minimal quality loss. AWQ and GPTQ formats target GPU deployment with similar intent. An 8B model quantised to 4-bit fits inside 6 GB of VRAM, making it runnable on an instance you might already have. Always check licence terms. Llama 3's community licence permits commercial use in most cases. Mistral's Apache 2.0 licence is unambiguous. Falcon uses the TII Falcon Licence, which is permissive but carries some restrictions. Read the licence before you commit a production pipeline to any model.

Hardware Requirements and Cloud Instance Selection

GPU instances are the default choice for LLM serving. On AWS, G5 instances with NVIDIA A10G GPUs provide 24 GB of VRAM per GPU, ideal for 7B and 13B models. P4d instances with A100 40GB or 80GB GPUs handle 70B models. P5 instances with H100 GPUs are available for the largest workloads but come at a premium. On Azure, the NCas T4 v3 series packs T4 GPUs with 16 GB VRAM each, suitable for smaller quantised models. NC A100 v4 series steps up to A100 for larger work. On GCP, G2 VMs offer L4 GPUs with 24 GB VRAM, a strong price-performance choice. A3 VMs with H100 GPUs are available for demanding inference.

CPU-only inference is viable for smaller models using llama.cpp. An 8B quantised model runs acceptably on a compute-optimised instance with 16 vCPUs and 32 GB RAM. This is not the path for latency-sensitive production workloads, but it works for batch processing, internal tooling, or evaluation where GPU cost is hard to justify. Memory requirements for GPU inference follow a simple rule: VRAM needed roughly equals model size in FP16 (14 GB for 7B parameters, 26 GB for 13B, 140 GB for 70B) before quantisation. Quantised models halve or quarter that. Add 1 to 2 GB for context and overhead.

Cloud GPU costs range from roughly £0.80 per hour for a single A10G spot instance to £5.00 per hour for an on-demand A100 80GB. Reserved instances cut that by 30 to 50 percent. On-premise comparison requires factoring in hardware depreciation, power, cooling, and the engineering time to maintain it. For a single persistent endpoint, cloud is typically simpler; for a cluster of always-on models, the maths of co-location or on-premise gets interesting. Most UK SMEs start in the cloud and evaluate the on-premise case once the workload stabilises.

Setting Up the Environment on AWS, Azure, or GCP

Start with a purpose-built VPC. Isolate the inference subnet from your application subnet, and restrict traffic to the ports you need. On AWS, launch a G5 instance with an Ubuntu 22.04 or Amazon Linux 2 AMI. Install the NVIDIA drivers and CUDA toolkit, then add the Docker runtime with GPU support. Use Terraform to describe the instance, security group, IAM role, and any associated S3 buckets for model weights. An EKS cluster with GPU node groups is an alternative if you already run Kubernetes workloads. SageMaker's real-time inference endpoints abstract much of this but add cost and reduce control; we prefer the EC2 direct route for maximum transparency.

On Azure, create an NCas T4 v3 VM with the NVIDIA GPU extension. Standard networking rules apply: allow inbound on the inference port from your app subnet only, and deny public internet access. Azure Machine Learning's managed online endpoints are an option, but they wrap the infrastructure in a pricing model that can surprise you. GCP offers G2 VMs with L4 GPUs and a well-documented CUDA driver installation path. The Cloud NAT setup for private subnets works differently, so plan your outbound connectivity for package installation. Across all three clouds, the principle is the same: infrastructure as code, minimal public exposure, and GPU drivers validated before installing any serving framework.

Deploying the LLM with Popular Frameworks

Ollama is the fastest path to a working single-node endpoint. Install with one script, pull a model from its library (`ollama pull llama3:8b`), and you have a local API. It defaults to CPU with GPU acceleration where available, uses GGUF quantisation internally, and exposes an OpenAI-compatible endpoint. For internal tools, small teams, or evaluation workloads, it is hard to beat. The trade-off is lower throughput under concurrent load.

vLLM is the production-grade option when throughput and latency matter. It implements continuous batching and PagedAttention, packing multiple requests into the same forward pass to maximise GPU utilisation. Deploy it as a Docker container with a model from Hugging Face: `vllm serve meta-llama/Meta-Llama-3-8B`. The server exposes an OpenAI-compatible API. vLLM supports tensor parallelism across multiple GPUs, which is how you serve a 70B model without running out of memory. Throughput improvement over naive serving is often 10x or more under load.

Text Generation Inference (TGI) from Hugging Face is another strong contender, especially if you want tight integration with the Hugging Face Hub. It supports flash attention, watermarks, and guidance. llama.cpp is the CPU-friendly backbone. It runs GGUF models efficiently, offers GPU offloading for partial acceleration, and works on modest hardware. For deployment, wrap it with something like llama-cpp-python and a FastAPI server to expose the REST endpoint. Which framework you pick depends on your team's stack: if you have Kubernetes and need high concurrency, vLLM; if you want a single binary and no fuss, Ollama; if you are CPU-bound, llama.cpp.

Securing Your Private LLM Endpoint

Network isolation comes first. Place the inference instance in a private subnet, accessible only via an internal load balancer or an API gateway within the same VPC. Security group rules should allow only the application tier on the inference port, typically 8000. Never expose the endpoint directly to the public internet unless you have a zero-trust proxy and strong authentication in front.

API authentication is not built into most serving frameworks, so you need to layer it. Options include an API Gateway with a key, an internal reverse proxy that validates JWT tokens, or an AWS SigV4 signed request via API Gateway. IAM roles for machine-to-machine calls are cleaner than static keys. TLS termination is essential even inside the VPC. Use a self-signed or private CA certificate on the serving container and enforce HTTPS on the reverse proxy. Finally, enable structured logging on the inference server and feed it into CloudWatch, Azure Monitor, or your existing observability stack. Monitor prompt latency, token counts, and error rates so you can spot anomalies before users complain.

Cost Optimisation for Ongoing Hosting

Spot instances for GPU workloads are the obvious first lever. A10G spot prices are often 60 percent below on-demand, and vLLM or TGI can restart quickly if interrupted. Combine spot with a small on-demand or reserved instance for baseline capacity. Reserved instances for steady-state inference bring a 30 to 50 percent discount. Commit to a one-year term on a single A10G instance and the monthly cost drops from roughly £600 to around £350.

Auto-scaling based on request load prevents over-provisioning. If your traffic is bursty, a scaling policy that adds GPU nodes when queue depth rises and scales to zero during idle periods can slash cost. The architectural pattern that matters most: keep GPU instances off when inactive. A serverless-like trigger, where a queued request powers on a stopped instance, works for batch workloads but introduces cold-start latency. Measure cost per query. Log token counts and map them to instance-hour cost so you can quantify whether a model upgrade or quantisation change is worth it. Tracking this metric turns infrastructure decisions into a business conversation.

Common Pitfalls and How to Avoid Them

Underestimating VRAM is the most common mistake. An FP16 13B model may fit in 26 GB, but add KV cache for context, and 40 GB disappears quickly. Always benchmark with realistic context lengths before committing to an instance size. Data egress costs surprise teams who serve users across regions or from the cloud to on-premise networks. On AWS, egress to the internet costs £0.09 per GB. It accumulates. Keep traffic inside the same region and VPC whenever possible.

Model updates and versioning trip up teams that treat the inference server as a static binary. Models improve. The Llama 3.1 release brought measurable quality gains over 3.0. Version your model weights, store them in a bucket or registry, and script the deployment so you can roll back. Security misconfigurations are the last and most damaging pitfall. An exposed port on a GPU instance is an invitation. Use infrastructure as code, peer review the security groups, and run penetration tests against your endpoint. Assume misconfiguration is the default state and verify it is not.

Bringing It All Together

A self-hosted LLM is a viable, often cost-effective way to put AI into production for a regulated UK business. The steps are well-defined: select an open-weight model matched to your use case, provision GPU instances with infrastructure as code, deploy through a framework like vLLM or Ollama, harden the network, and instrument the cost. Done carefully, you get a private, compliant, and predictable service that your teams can build on.

But the gap between a working prototype and a production endpoint that survives a security audit, scales with demand, and stays within budget is wider than most walkthroughs admit. That gap is where we work. We deploy private LLMs inside your cloud account, on your infrastructure, with your data never leaving your boundary. We handle instance selection, model quantisation, serving framework configuration, network hardening, and cost optimisation, all as a fixed-price engagement. If you would rather the outcome than the build, start with our AI capability audit to understand where self-hosted AI fits into your wider architecture.

Frequently asked questions

What hardware do I need to host an LLM? The minimum for a production-quality 7B or 8B model is a single NVIDIA A10G GPU with 24 GB VRAM, available as an AWS G5 instance, Azure NCas T4 v3, or GCP G2 VM. Quantised models can run on machines without dedicated GPUs, though latency will increase. For 70B models, expect to provision multiple A100 or H100 GPUs.

Can I host a large language model on a CPU? Yes, for smaller models (7B to 13B parameters) using a 4-bit or 5-bit quantised GGUF file through llama.cpp. A compute-optimised instance with 16 vCPUs and 32 GB RAM can serve a quantised 8B model with acceptable speed for internal tools. Production workloads with concurrent users will struggle on CPU-only inference.

How much does it cost to host an LLM on AWS? Running a single G5.xlarge instance (one A10G GPU) on-demand costs approximately £600 per month. Reserved instances reduce that to around £350. Spot instances can bring it below £200 per month. Add networking, storage, and any managed service premiums. A 70B model on a p4d.24xlarge or equivalent can exceed £25,000 per month.

Which LLM is best for self-hosting? Meta's Llama 3 8B is the most common starting point: strong performance, permissive licence, and broad framework support. Mistral 7B offers competitive quality with Apache 2.0 licensing. The best model depends on your use case, accuracy requirements, and language needs; test multiple on your own data before committing.

How do I secure my private LLM endpoint? Place it in a private subnet, restrict access to your application tier via security groups, enforce TLS, and authenticate all requests (API keys, IAM roles, or JWT). Monitor access logs and prompt metrics. Infrastructure as code plus peer review reduces the risk of misconfiguration.

Talk to Arx Certa