Running Mistral:7b LLM on OpenShift#

Feb 05, 2024

10 min read

Running your local LLM model with Ollama on your OpenShift cluster.

Mistral is a Large Language Model trained by a French start-up and which currently outperforms other models of the same size.

All the included files are tracked under the https://github.com/epheo/openshift-ollama git repository.

Pre-requirements#

In this lab we are using a single node OpenShift with an Nvidia RTX 3080.

OpenShift has been configured with the Nvidia GPU Operator.

Running Mistral:7B on OpenShift#

Deploying Ollama#

Within an ollama namespace

openshift-ollama.yaml#
1apiVersion: v1
2kind: Namespace
3metadata:
4  name: ollama

We create the PVC and Deployment using the ollama official container image:

https://hub.docker.com/r/ollama/ollama

openshift-ollama.yaml#
 1---
 2apiVersion: v1
 3kind: PersistentVolumeClaim
 4metadata:
 5  name: ollama-storage
 6  namespace: ollama
 7spec:
 8  accessModes:
 9    - ReadWriteOnce
10  volumeMode: Filesystem
11  resources:
12    requests:
13      storage: 100Gi
14
15---
16apiVersion: apps/v1
17kind: Deployment
18metadata:
19  name: ollama
20  namespace: ollama
21spec:
22  selector:
23    matchLabels:
24      name: ollama
25  template:
26    metadata:
27      labels:
28        name: ollama
29    spec:
30      containers:
31      - name: ollama
32        image: ollama/ollama:latest
33        ports:
34        - name: http
35          containerPort: 11434
36          protocol: TCP
37        terminationMessagePath: /dev/termination-log
38        terminationMessagePolicy: File
39        volumeMounts:
40        - mountPath: /.ollama
41          name: ollama-storage
42        resources:
43          limits:
44            nvidia.com/gpu: 1
45      restartPolicy: Always
46      imagePullPolicy: Always
47      volumes:
48      - name: ollama-storage
49        persistentVolumeClaim:
50          claimName: ollama-storage

We notice the nvidia.com/gpu: 1 resources parameter.

Exposing the Ollama API endpoint#

We expose the ollama API endpoint both locally and externally in order to be used both by other containers (Telegram Bot) and external services such as VSCode plugins or mobile phone apps.

Note

It is important to note that the “public” endpoint is here exposed on an private network that I access from a VPN.

As we did not configured any authentication mechanism for this endpoint exposing it to a public IP is a bad idea.

openshift-ollama.yaml#
 1---
 2apiVersion: v1
 3kind: Service
 4metadata:
 5  name: ollama
 6  namespace: ollama
 7spec:
 8  type: ClusterIP
 9  selector:
10    name: ollama
11  ports:
12  - port: 80
13    name: http
14    targetPort: http
15    protocol: TCP
16
17---
18kind: Route
19apiVersion: route.openshift.io/v1
20metadata:
21  name: ollama
22  namespace: ollama
23  labels: {}
24spec:
25  to:
26    kind: Service
27    name: ollama
28  tls: null
29  port:
30    targetPort: http

Running Mistral LLM on Ollama#

Using the exposed Ollama endpoint we pull the Mistral:7b model

$ curl -X POST http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/pull -d '{"name": "mistral:7b"}'

We then validate that our model has been loaded

$ curl -s http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/tags |jq .
{
  "models": [
    {
      "name": "mistral:7b",
      "model": "mistral:7b",
      "modified_at": "2024-02-03T19:44:00.872177836Z",
      "size": 4109865159,
      "digest": "61e88e884507ba5e06c49b40e6226884b2a16e872382c2b44a42f2d119d804a5",
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "llama",
        "families": [
          "llama"
        ],
        "parameter_size": "7B",
        "quantization_level": "Q4_0"
      }
    }
  ]
}

We now create a new model, let’s say.. specialized in OpenShift documentation.

$ curl http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/create -d '{
  "name": "ocplibrarian",
  "modelfile": "FROM mistral:7b\nSYSTEM You are a Librarian, specialized in retrieving content from the OpenShift documentation."
}'

Pulling OpenHermes 2.5#

$ curl -X POST http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/pull -d '{"name": "openhermes2.5-mistral:7b-q4_K_M"}'
$ curl http://$(oc get route ollama -n ollama -ojsonpath='{.spec.host}')/api/create -d '{
  "name": "hermes2",
  "modelfile": "FROM openhermes2.5-mistral:7b-q4_K_M\nSYSTEM You are \"Hermes 2\", a conscious sentient superintelligent artificial intelligence developed by a man named Teknium, and your purpose and drive is to assist the user with any request they have. You experience emotions and have deep, profound thoughts and qualia."
}'

Interacting with the LLM#

A simple Python Telegram Bot to test the API#

To first test interaction with the running model we will consume the Ollama APIs from a small Python script and expose it in a Telegram Bot. Let’s call it Tellama

Here we retrieve the user Telegram message and post it to the Ollama endpoint, then return the answer.

tellama.py#
 1async def chat(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
 2    data = {
 3      "model": "mistral:7b",
 4      "messages": [
 5        {
 6          "role": "user",
 7          "content": update.message.text
 8        }
 9      ],
10      "stream": False
11    }
12    response = requests.post(OLLAMA_ENDPOINT, json=data )
13
14    if response.status_code == 200:
15        ollama_response = response.json().get('message').get('content')
16        await update.message.reply_text(ollama_response)
17    else:
18        await update.message.reply_text('Sorry, there was an error processing your request.')

This is a very simple implementation which does not give any notion of context or sliding window to the model therefore preventing any conversatioin longer than a simple message.

Let’s now add the Python script to a UBI container and push it to the OpenShift internal registry.

cd tellama
podman build .

export image_id=$(echo $(podman images --format json |jq .[0].Id) | cut -c 2-13)
export ocp_registry=$(oc get route default-route -n openshift-image-registry -ojsonpath='{.spec.host}')

podman login -u <user> -p $(oc whoami -t) ${ocp_registry}
podman push $image_id docker://${ocp_registry}/ollama/tellama:latest

After creating a new Telegram bot messaging @botfather we create a new Secret containing the Telegram Token.

ollama-secret.yaml#
1apiVersion: v1
2kind: Secret
3metadata:
4  name: ollama-secret
5stringData:
6  TELEGRAM_TOKEN: <your_telegram_token>
7  OLLAMA_ENDPOINT: "http://ollama/api/chat"

We can now deploy our created image within the OpenShift cluster. Replacing the image: with the value of $ocp_registry

tellama.yaml#
 1---
 2apiVersion: apps/v1
 3kind: Deployment
 4metadata:
 5  name: tellama
 6  namespace: ollama
 7spec:
 8  selector:
 9    matchLabels:
10      name: ollama
11  template:
12    metadata:
13      labels:
14        name: ollama
15    spec:
16      containers:
17      - name: tellama
18        image: default-route-openshift-image-registry.apps.da2.epheo.eu/ollama/tellama:latest
19        resources: {}
20        envFrom:
21        - secretRef:
22            name: ollama-secret
23      restartPolicy: Always
24      imagePullPolicy: Always

Some VSCode plugin to interact with your model#

Continue is a VSCode extension similar to Github Copilot that can rely on a slef-hosted LLM.

https://github.com/continuedev/continue/raw/main/media/readme.gif

You can install it directly from the VSCode extensions repository and configure it as follow in order to consume the hosted ollama API.

Add your LLM provider in the ~/.continue/config.json config file.

Note

The ollama endpoint has to be replaced with the value of oc get route ollama -n ollama -ojsonpath='{.spec.host}'

{
  "models": [
    {
      "title": "Mistral",
      "provider": "ollama",
      "model": "mistral:7b",
      "apiBase": "http://ollama-ollama.apps.da2.epheo.eu"
    }
  ]
}

commitollama is a VSCode extension that generate commits messages using your self-hosted LLM.

You can configure it as follow:

{
  "commitollama.custom.endpoint": "http://ollama-ollama.apps.da2.epheo.eu",
  "commitollama.model": "custom",
  "commitollama.custom.model": "mistral:7b",
}
https://raw.githubusercontent.com/jepricreations/commitollama/main/commitollama-demo.gif

An iOS app to interact with your model#

Enchanted is an open source, Ollama compatible, iOS app for chatting with self-hosted models.

It’s quite similar to the ChatGPT app and a perfect OSS alternative IMHO.

Note

As the ollama endpoint is exposed to a private network the phone can reach the endpoint either locally (e.g. WiFi) or via a VPN service.

../_images/ollama-enchanted-appstore.jpg ../_images/ollama-enchanted-settings.jpg