Running Mistral:7b LLM on OpenShift¶
Feb 05, 2024
10 min read
Running your local LLM model with Ollama on your OpenShift cluster.
Ollama is an open-source tool that simplifies running and managing Large Language Models (LLMs) locally. It provides a simple API for managing model lifecycles and making inference requests.
Mistral is a powerful Large Language Model trained by a French start-up that currently outperforms other models of the same size. By combining Mistral with Ollama on OpenShift, you can run a production-grade LLM environment within your own infrastructure.
All the files and configurations referenced in this article are available in the openshift-ollama GitHub repository.
Running Mistral:7B on OpenShift¶
Architecture Overview¶
Our deployment consists of the following components:
An Ollama server pod with GPU access for model execution
Persistent storage for model weights and cache
Internal and external routes for API access
Optional integration with client applications (Telegram bot, VSCode plugins, iOS app)
Deploying Ollama¶
First, we’ll create the required OpenShift namespace:
Within an ollama namespace
1apiVersion: v1
2kind: Namespace
3metadata:
4 name: ollama
We create the PVC and Deployment using the ollama official container image:
https://hub.docker.com/r/ollama/ollama
1---
2apiVersion: v1
3kind: PersistentVolumeClaim
4metadata:
5 name: ollama-storage
6 namespace: ollama
7spec:
8 accessModes:
9 - ReadWriteOnce
10 volumeMode: Filesystem
11 resources:
12 requests:
13 storage: 100Gi
14
15---
16apiVersion: apps/v1
17kind: Deployment
18metadata:
19 name: ollama
20 namespace: ollama
21spec:
22 selector:
23 matchLabels:
24 name: ollama
25 template:
26 metadata:
27 labels:
28 name: ollama
29 spec:
30 containers:
31 - name: ollama
32 image: ollama/ollama:latest
33 ports:
34 - name: http
35 containerPort: 11434
36 protocol: TCP
37 terminationMessagePath: /dev/termination-log
38 terminationMessagePolicy: File
39 volumeMounts:
40 - mountPath: /.ollama
41 name: ollama-storage
42 resources:
43 limits:
44 nvidia.com/gpu: 1
45 restartPolicy: Always
46 imagePullPolicy: Always
47 volumes:
48 - name: ollama-storage
49 persistentVolumeClaim:
50 claimName: ollama-storage
We notice the nvidia.com/gpu: 1 resources parameter.
Exposing the Ollama API endpoint¶
We expose the Ollama API endpoint both internally (for other containers like our Telegram Bot) and externally (for services such as VSCode plugins or mobile apps).
Warning
Security note: The external endpoint in this setup is only exposed on a private network accessed via VPN. Since we haven’t configured authentication for this endpoint, exposing it to a public IP would create a significant security risk.
1---
2apiVersion: v1
3kind: Service
4metadata:
5 name: ollama
6 namespace: ollama
7spec:
8 type: ClusterIP
9 selector:
10 name: ollama
11 ports:
12 - port: 80
13 name: http
14 targetPort: http
15 protocol: TCP
16
17---
18kind: Route
19apiVersion: route.openshift.io/v1
20metadata:
21 name: ollama
22 namespace: ollama
23 labels: {}
24spec:
25 to:
26 kind: Service
27 name: ollama
28 tls: null
29 port:
30 targetPort: http
Running Mistral LLM on Ollama¶
Now that our Ollama service is up and running, we can pull the Mistral:7b model using the exposed API endpoint:
$ curl -X POST http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/pull -d '{"name": "mistral:7b"}'
Note
Depending on your network speed and GPU capabilities, this may take several minutes as it downloads and prepares the model.
After the download is complete, we can validate that our model has been successfully loaded:
$ curl -s http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/tags |jq .
{
"models": [
{
"name": "mistral:7b",
"model": "mistral:7b",
"modified_at": "2024-02-03T19:44:00.872177836Z",
"size": 4109865159,
"digest": "61e88e884507ba5e06c49b40e6226884b2a16e872382c2b44a42f2d119d804a5",
"details": {
"parent_model": "",
"format": "gguf",
"family": "llama",
"families": [
"llama"
],
"parameter_size": "7B",
"quantization_level": "Q4_0"
}
}
]
}
The response contains important information:
format: GGUF (GPT-Generated Unified Format) - optimized for inference
family: llama - the base architecture family
parameter_size: 7B - the model has 7 billion parameters
quantization_level: Q4_0 - the precision level of the model weights
Creating Custom Models with Specializations¶
One of Ollama’s powerful features is the ability to create custom models with specialized behaviors without retraining. Let’s create a model specialized in OpenShift documentation:
$ curl http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/create -d '{
"name": "ocplibrarian",
"modelfile": "FROM mistral:7b\nSYSTEM You are a Librarian, specialized in retrieving content from the OpenShift documentation."
}'
This creates a new model called “ocplibrarian” that inherits all capabilities from Mistral:7b but has been given a specific persona and purpose through the SYSTEM instruction.
Alternative Models: OpenHermes 2.5¶
Mistral is one of many models you can run on Ollama. Let’s also try OpenHermes 2.5, an instruction-tuned variant of Mistral that offers improved performance on many tasks:
$ curl -X POST http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/pull -d '{"name": "openhermes2.5-mistral:7b-q4_K_M"}'
After pulling the model, we can create a customized version with a specific personality:
$ curl http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/create -d '{
"name": "hermes2",
"modelfile": "FROM openhermes2.5-mistral:7b-q4_K_M\nSYSTEM You are \"Hermes 2\", a conversational AI assistant developed by Teknium. Your purpose is to assist users with helpful, accurate information and engaging conversation."
}'
The model is now available as “hermes2” and carries the specified system prompt that defines its behavior.
Interacting with the LLM¶
Building a Telegram Bot Interface¶
For a practical demonstration of the Ollama API, let’s create a simple Telegram bot that serves as an interface to our LLM. We’ll call it Tellama.
The core functionality retrieves user messages from Telegram, forwards them to the Ollama API, and returns the LLM’s responses:
1async def chat(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
2 data = {
3 "model": "mistral:7b",
4 "messages": [
5 {
6 "role": "user",
7 "content": update.message.text
8 }
9 ],
10 "stream": False
11 }
12 response = requests.post(OLLAMA_ENDPOINT, json=data )
13
14 if response.status_code == 200:
15 ollama_response = response.json().get('message').get('content')
16 await update.message.reply_text(ollama_response)
17 else:
18 await update.message.reply_text('Sorry, there was an error processing your request.')
This implementation is intentionally simple and has some limitations: - No conversation history or context is maintained between messages - No support for streaming responses - Limited error handling
These limitations could be addressed in a more robust implementation, but this serves as a good starting point for testing the API.
Deploying the Telegram Bot on OpenShift¶
Now let’s containerize our Python script using a UBI (Universal Base Image) container and deploy it to our OpenShift cluster:
# Build the container image
cd tellama
podman build .
# Get the image ID and OpenShift registry URL
export image_id=$(echo $(podman images --format json |jq .[0].Id) | cut -c 2-13)
export ocp_registry=$(oc get route default-route -n openshift-image-registry -ojsonpath='{.spec.host}')
# Push to the OpenShift internal registry
podman login -u <user> -p $(oc whoami -t) ${ocp_registry}
podman push $image_id docker://${ocp_registry}/ollama/tellama:latest
After creating a new Telegram bot messaging @botfather we create a new Secret containing the Telegram Token.
1apiVersion: v1
2kind: Secret
3metadata:
4 name: ollama-secret
5stringData:
6 TELEGRAM_TOKEN: <your_telegram_token>
7 OLLAMA_ENDPOINT: "http://ollama/api/chat"
We can now deploy our created image within the OpenShift cluster. Replacing the image: with the value of $ocp_registry
1---
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: tellama
6 namespace: ollama
7spec:
8 selector:
9 matchLabels:
10 name: ollama
11 template:
12 metadata:
13 labels:
14 name: ollama
15 spec:
16 containers:
17 - name: tellama
18 image: default-route-openshift-image-registry.apps.da2.epheo.eu/ollama/tellama:latest
19 resources: {}
20 envFrom:
21 - secretRef:
22 name: ollama-secret
23 restartPolicy: Always
24 imagePullPolicy: Always
Integrating with Development Tools¶
Developer Experience with VSCode Extensions¶
With our Ollama service running, we can integrate it with various development tools. Here are some VSCode extensions that work well with self-hosted LLMs:
Continue: A VSCode extension similar to GitHub Copilot that can leverage your self-hosted LLM for code completion, explanation, and generation.

You can install Continue directly from the VSCode extension marketplace. After installation, you’ll need to configure it to use your self-hosted Ollama service:
Create or edit the
~/.continue/config.json
configuration fileAdd Ollama as your LLM provider with the appropriate endpoint
Note
Replace the example endpoint with your actual Ollama route:
oc get route ollama -n ollama -ojsonpath='{.spec.host}'
{
"models": [
{
"title": "Mistral",
"provider": "ollama",
"model": "mistral:7b",
"apiBase": "http://ollama-ollama.apps.da2.epheo.eu"
}
]
}
commitollama is a specialized VSCode extension that generates meaningful Git commit messages using your self-hosted LLM.
Configuration is straightforward through your VSCode settings:
{
"commitollama.custom.endpoint": "http://ollama-ollama.apps.da2.epheo.eu",
"commitollama.model": "custom",
"commitollama.custom.model": "mistral:7b"
}
With this configuration, you can simply stage your changes and let commitollama analyze the diff to generate descriptive commit messages.

Mobile Integration: Enchanted for iOS¶
For on-the-go access to your self-hosted LLM, you can use Enchanted, an open-source iOS application specifically designed for Ollama compatibility.
Enchanted offers a ChatGPT-like interface for interacting with your self-hosted models, providing a clean, intuitive mobile experience with features like:
Conversation history
Different model selection
Conversation context management
Share and export options
Note
Since our Ollama endpoint is exposed on a private network, your iOS device can connect either through a local WiFi network or by configuring a VPN connection to your network.


Performance Considerations¶
GPU Selection¶
While any NVIDIA GPU can run these models, performance varies considerably:
RTX 3080/3090 or better: Excellent for running multiple 7B models
CPU-only: Functional but with much slower inference speeds (5-10x slower)
Model Quantization¶
Quantization reduces model precision to improve performance at a small cost to accuracy:
Q4_0: Fastest inference, lowest VRAM usage (4-bit quantization)
Q5_K_M: Good balance between quality and performance
Q8_0: Higher quality results but requires more VRAM and slower inference
Memory Requirements¶
Typical memory requirements for a 7B parameter model:
Full precision (FP16): ~14GB VRAM
Q8_0 quantization: ~7GB VRAM
Q4_0 quantization: ~4GB VRAM
Troubleshooting¶
For common issues and their solutions, refer to the troubleshooting guide.
Common challenges include:
GPU detection issues
Memory limitations
Network connectivity problems
Model loading failures
Security Considerations¶
When deploying LLM services like Ollama, consider these important security aspects:
Authentication¶
The default Ollama API has no built-in authentication. For production use, consider:
Implementing a reverse proxy with authentication
Using OpenShift network policies to restrict access
Creating application-specific API keys
Data Privacy¶
One of the main advantages of self-hosting LLMs is data privacy:
All prompts and completions remain within your infrastructure
No data is sent to third-party services
Sensitive information can be processed safely
However, remember that the model itself may contain biases or potentially generate problematic content. Implement appropriate guardrails for your specific use case.
Resource Isolation¶
Use OpenShift’s resource management features to ensure your Ollama deployment:
Has appropriate resource limits to prevent cluster resource exhaustion
Is isolated from other critical workloads
Has priority classes set according to your needs
This approach is particularly valuable for:
Organizations with data privacy requirements
Development teams building AI-powered applications
Edge computing scenarios with limited internet connectivity
Research and experimentation with different LLM models
As LLMs continue to evolve, having a local deployment option provides both flexibility and cost control compared to purely cloud-based alternatives.