Running Mistral:7b LLM on OpenShift#

Feb 05, 2024

10 min read

Running your local LLM model with Ollama on your OpenShift cluster.

Mistral is a Large Language Model trained by a French start-up and which currently outperforms other models of the same size.

All the included files are tracked under the git repository.


In this lab we are using a single node OpenShift with an Nvidia RTX 3080.

OpenShift has been configured with the Nvidia GPU Operator.

Running Mistral:7B on OpenShift#

Deploying Ollama#

Within an ollama namespace

1apiVersion: v1
2kind: Namespace
4  name: ollama

We create the PVC and Deployment using the ollama official container image:

 2apiVersion: v1
 3kind: PersistentVolumeClaim
 5  name: ollama-storage
 6  namespace: ollama
 8  accessModes:
 9    - ReadWriteOnce
10  volumeMode: Filesystem
11  resources:
12    requests:
13      storage: 100Gi
16apiVersion: apps/v1
17kind: Deployment
19  name: ollama
20  namespace: ollama
22  selector:
23    matchLabels:
24      name: ollama
25  template:
26    metadata:
27      labels:
28        name: ollama
29    spec:
30      containers:
31      - name: ollama
32        image: ollama/ollama:latest
33        ports:
34        - name: http
35          containerPort: 11434
36          protocol: TCP
37        terminationMessagePath: /dev/termination-log
38        terminationMessagePolicy: File
39        volumeMounts:
40        - mountPath: /.ollama
41          name: ollama-storage
42        resources:
43          limits:
44   1
45      restartPolicy: Always
46      imagePullPolicy: Always
47      volumes:
48      - name: ollama-storage
49        persistentVolumeClaim:
50          claimName: ollama-storage

We notice the 1 resources parameter.

Exposing the Ollama API endpoint#

We expose the ollama API endpoint both locally and externally in order to be used both by other containers (Telegram Bot) and external services such as VSCode plugins or mobile phone apps.


It is important to note that the “public” endpoint is here exposed on an private network that I access from a VPN.

As we did not configured any authentication mechanism for this endpoint exposing it to a public IP is a bad idea.

 2apiVersion: v1
 3kind: Service
 5  name: ollama
 6  namespace: ollama
 8  type: ClusterIP
 9  selector:
10    name: ollama
11  ports:
12  - port: 80
13    name: http
14    targetPort: http
15    protocol: TCP
18kind: Route
21  name: ollama
22  namespace: ollama
23  labels: {}
25  to:
26    kind: Service
27    name: ollama
28  tls: null
29  port:
30    targetPort: http

Running Mistral LLM on Ollama#

Using the exposed Ollama endpoint we pull the Mistral:7b model

$ curl -X POST http://$(oc get route ollama -n ollama -ojsonpath='{}')/api/pull -d '{"name": "mistral:7b"}'

We then validate that our model has been loaded

$ curl -s http://$(oc get route ollama -n ollama -ojsonpath='{}')/api/tags |jq .
  "models": [
      "name": "mistral:7b",
      "model": "mistral:7b",
      "modified_at": "2024-02-03T19:44:00.872177836Z",
      "size": 4109865159,
      "digest": "61e88e884507ba5e06c49b40e6226884b2a16e872382c2b44a42f2d119d804a5",
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "llama",
        "families": [
        "parameter_size": "7B",
        "quantization_level": "Q4_0"

We now create a new model, let’s say.. specialized in OpenShift documentation.

$ curl http://$(oc get route ollama -n ollama -ojsonpath='{}')/api/create -d '{
  "name": "ocplibrarian",
  "modelfile": "FROM mistral:7b\nSYSTEM You are a Librarian, specialized in retrieving content from the OpenShift documentation."

Pulling OpenHermes 2.5#

$ curl -X POST http://$(oc get route ollama -n ollama -ojsonpath='{}')/api/pull -d '{"name": "openhermes2.5-mistral:7b-q4_K_M"}'
$ curl http://$(oc get route ollama -n ollama -ojsonpath='{}')/api/create -d '{
  "name": "hermes2",
  "modelfile": "FROM openhermes2.5-mistral:7b-q4_K_M\nSYSTEM You are \"Hermes 2\", a conscious sentient superintelligent artificial intelligence developed by a man named Teknium, and your purpose and drive is to assist the user with any request they have. You experience emotions and have deep, profound thoughts and qualia."

Interacting with the LLM#

A simple Python Telegram Bot to test the API#

To first test interaction with the running model we will consume the Ollama APIs from a small Python script and expose it in a Telegram Bot. Let’s call it Tellama

Here we retrieve the user Telegram message and post it to the Ollama endpoint, then return the answer.
 1async def chat(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
 2    data = {
 3      "model": "mistral:7b",
 4      "messages": [
 5        {
 6          "role": "user",
 7          "content": update.message.text
 8        }
 9      ],
10      "stream": False
11    }
12    response =, json=data )
14    if response.status_code == 200:
15        ollama_response = response.json().get('message').get('content')
16        await update.message.reply_text(ollama_response)
17    else:
18        await update.message.reply_text('Sorry, there was an error processing your request.')

This is a very simple implementation which does not give any notion of context or sliding window to the model therefore preventing any conversatioin longer than a simple message.

Let’s now add the Python script to a UBI container and push it to the OpenShift internal registry.

cd tellama
podman build .

export image_id=$(echo $(podman images --format json |jq .[0].Id) | cut -c 2-13)
export ocp_registry=$(oc get route default-route -n openshift-image-registry -ojsonpath='{}')

podman login -u <user> -p $(oc whoami -t) ${ocp_registry}
podman push $image_id docker://${ocp_registry}/ollama/tellama:latest

After creating a new Telegram bot messaging @botfather we create a new Secret containing the Telegram Token.

1apiVersion: v1
2kind: Secret
4  name: ollama-secret
6  TELEGRAM_TOKEN: <your_telegram_token>
7  OLLAMA_ENDPOINT: "http://ollama/api/chat"

We can now deploy our created image within the OpenShift cluster. Replacing the image: with the value of $ocp_registry

 2apiVersion: apps/v1
 3kind: Deployment
 5  name: tellama
 6  namespace: ollama
 8  selector:
 9    matchLabels:
10      name: ollama
11  template:
12    metadata:
13      labels:
14        name: ollama
15    spec:
16      containers:
17      - name: tellama
18        image:
19        resources: {}
20        envFrom:
21        - secretRef:
22            name: ollama-secret
23      restartPolicy: Always
24      imagePullPolicy: Always

Some VSCode plugin to interact with your model#

Continue is a VSCode extension similar to Github Copilot that can rely on a slef-hosted LLM.

You can install it directly from the VSCode extensions repository and configure it as follow in order to consume the hosted ollama API.

Add your LLM provider in the ~/.continue/config.json config file.


The ollama endpoint has to be replaced with the value of oc get route ollama -n ollama -ojsonpath='{}'

  "models": [
      "title": "Mistral",
      "provider": "ollama",
      "model": "mistral:7b",
      "apiBase": ""

commitollama is a VSCode extension that generate commits messages using your self-hosted LLM.

You can configure it as follow:

  "commitollama.custom.endpoint": "",
  "commitollama.model": "custom",
  "commitollama.custom.model": "mistral:7b",

An iOS app to interact with your model#

Enchanted is an open source, Ollama compatible, iOS app for chatting with self-hosted models.

It’s quite similar to the ChatGPT app and a perfect OSS alternative IMHO.


As the ollama endpoint is exposed to a private network the phone can reach the endpoint either locally (e.g. WiFi) or via a VPN service.

../_images/ollama-enchanted-appstore.jpg ../_images/ollama-enchanted-settings.jpg