Running Mistral:7b LLM on OpenShift#
Feb 05, 2024
10 min read
Running your local LLM model with Ollama on your OpenShift cluster.
Mistral is a Large Language Model trained by a French start-up and which currently outperforms other models of the same size.
All the included files are tracked under the https://github.com/epheo/openshift-ollama git repository.
Pre-requirements#
In this lab we are using a single node OpenShift with an Nvidia RTX 3080.
OpenShift has been configured with the Nvidia GPU Operator.
Running Mistral:7B on OpenShift#
Deploying Ollama#
Within an ollama namespace
1apiVersion: v1
2kind: Namespace
3metadata:
4 name: ollama
We create the PVC and Deployment using the ollama official container image:
https://hub.docker.com/r/ollama/ollama
1---
2apiVersion: v1
3kind: PersistentVolumeClaim
4metadata:
5 name: ollama-storage
6 namespace: ollama
7spec:
8 accessModes:
9 - ReadWriteOnce
10 volumeMode: Filesystem
11 resources:
12 requests:
13 storage: 100Gi
14
15---
16apiVersion: apps/v1
17kind: Deployment
18metadata:
19 name: ollama
20 namespace: ollama
21spec:
22 selector:
23 matchLabels:
24 name: ollama
25 template:
26 metadata:
27 labels:
28 name: ollama
29 spec:
30 containers:
31 - name: ollama
32 image: ollama/ollama:latest
33 ports:
34 - name: http
35 containerPort: 11434
36 protocol: TCP
37 terminationMessagePath: /dev/termination-log
38 terminationMessagePolicy: File
39 volumeMounts:
40 - mountPath: /.ollama
41 name: ollama-storage
42 resources:
43 limits:
44 nvidia.com/gpu: 1
45 restartPolicy: Always
46 imagePullPolicy: Always
47 volumes:
48 - name: ollama-storage
49 persistentVolumeClaim:
50 claimName: ollama-storage
We notice the nvidia.com/gpu: 1 resources parameter.
Exposing the Ollama API endpoint#
We expose the ollama API endpoint both locally and externally in order to be used both by other containers (Telegram Bot) and external services such as VSCode plugins or mobile phone apps.
Note
It is important to note that the “public” endpoint is here exposed on an private network that I access from a VPN.
As we did not configured any authentication mechanism for this endpoint exposing it to a public IP is a bad idea.
1---
2apiVersion: v1
3kind: Service
4metadata:
5 name: ollama
6 namespace: ollama
7spec:
8 type: ClusterIP
9 selector:
10 name: ollama
11 ports:
12 - port: 80
13 name: http
14 targetPort: http
15 protocol: TCP
16
17---
18kind: Route
19apiVersion: route.openshift.io/v1
20metadata:
21 name: ollama
22 namespace: ollama
23 labels: {}
24spec:
25 to:
26 kind: Service
27 name: ollama
28 tls: null
29 port:
30 targetPort: http
Running Mistral LLM on Ollama#
Using the exposed Ollama endpoint we pull the Mistral:7b model
$ curl -X POST http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/pull -d '{"name": "mistral:7b"}'
We then validate that our model has been loaded
$ curl -s http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/tags |jq .
{
"models": [
{
"name": "mistral:7b",
"model": "mistral:7b",
"modified_at": "2024-02-03T19:44:00.872177836Z",
"size": 4109865159,
"digest": "61e88e884507ba5e06c49b40e6226884b2a16e872382c2b44a42f2d119d804a5",
"details": {
"parent_model": "",
"format": "gguf",
"family": "llama",
"families": [
"llama"
],
"parameter_size": "7B",
"quantization_level": "Q4_0"
}
}
]
}
We now create a new model, let’s say.. specialized in OpenShift documentation.
$ curl http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/create -d '{
"name": "ocplibrarian",
"modelfile": "FROM mistral:7b\nSYSTEM You are a Librarian, specialized in retrieving content from the OpenShift documentation."
}'
Pulling OpenHermes 2.5#
$ curl -X POST http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/pull -d '{"name": "openhermes2.5-mistral:7b-q4_K_M"}'
$ curl http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/create -d '{
"name": "hermes2",
"modelfile": "FROM openhermes2.5-mistral:7b-q4_K_M\nSYSTEM You are \"Hermes 2\", a conscious sentient superintelligent artificial intelligence developed by a man named Teknium, and your purpose and drive is to assist the user with any request they have. You experience emotions and have deep, profound thoughts and qualia."
}'
Interacting with the LLM#
A simple Python Telegram Bot to test the API#
To first test interaction with the running model we will consume the Ollama APIs from a small Python script and expose it in a Telegram Bot. Let’s call it Tellama
Here we retrieve the user Telegram message and post it to the Ollama endpoint, then return the answer.
1async def chat(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
2 data = {
3 "model": "mistral:7b",
4 "messages": [
5 {
6 "role": "user",
7 "content": update.message.text
8 }
9 ],
10 "stream": False
11 }
12 response = requests.post(OLLAMA_ENDPOINT, json=data )
13
14 if response.status_code == 200:
15 ollama_response = response.json().get('message').get('content')
16 await update.message.reply_text(ollama_response)
17 else:
18 await update.message.reply_text('Sorry, there was an error processing your request.')
This is a very simple implementation which does not give any notion of context or sliding window to the model therefore preventing any conversatioin longer than a simple message.
Let’s now add the Python script to a UBI container and push it to the OpenShift internal registry.
cd tellama
podman build .
export image_id=$(echo $(podman images --format json |jq .[0].Id) | cut -c 2-13)
export ocp_registry=$(oc get route default-route -n openshift-image-registry -ojsonpath='{.spec.host}')
podman login -u <user> -p $(oc whoami -t) ${ocp_registry}
podman push $image_id docker://${ocp_registry}/ollama/tellama:latest
After creating a new Telegram bot messaging @botfather we create a new Secret containing the Telegram Token.
1apiVersion: v1
2kind: Secret
3metadata:
4 name: ollama-secret
5stringData:
6 TELEGRAM_TOKEN: <your_telegram_token>
7 OLLAMA_ENDPOINT: "http://ollama/api/chat"
We can now deploy our created image within the OpenShift cluster. Replacing the image: with the value of $ocp_registry
1---
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: tellama
6 namespace: ollama
7spec:
8 selector:
9 matchLabels:
10 name: ollama
11 template:
12 metadata:
13 labels:
14 name: ollama
15 spec:
16 containers:
17 - name: tellama
18 image: default-route-openshift-image-registry.apps.da2.epheo.eu/ollama/tellama:latest
19 resources: {}
20 envFrom:
21 - secretRef:
22 name: ollama-secret
23 restartPolicy: Always
24 imagePullPolicy: Always
Some VSCode plugin to interact with your model#
Continue is a VSCode extension similar to Github Copilot that can rely on a slef-hosted LLM.
You can install it directly from the VSCode extensions repository and configure it as follow in order to consume the hosted ollama API.
Add your LLM provider in the ~/.continue/config.json
config file.
Note
The ollama endpoint has to be replaced with the value of oc get route ollama -n ollama -ojsonpath='{.spec.host}'
{
"models": [
{
"title": "Mistral",
"provider": "ollama",
"model": "mistral:7b",
"apiBase": "http://ollama-ollama.apps.da2.epheo.eu"
}
]
}
commitollama is a VSCode extension that generate commits messages using your self-hosted LLM.
You can configure it as follow:
{
"commitollama.custom.endpoint": "http://ollama-ollama.apps.da2.epheo.eu",
"commitollama.model": "custom",
"commitollama.custom.model": "mistral:7b",
}
An iOS app to interact with your model#
Enchanted is an open source, Ollama compatible, iOS app for chatting with self-hosted models.
It’s quite similar to the ChatGPT app and a perfect OSS alternative IMHO.
Note
As the ollama endpoint is exposed to a private network the phone can reach the endpoint either locally (e.g. WiFi) or via a VPN service.